In many cases, it is necessary or beneficial to “bin” the data,
i.e. divide the X axis into intervals and then calculate the average
values in these intervals. In R, you can use the
stats.bin() function from the fields library
for this purpose.
## $centers
## [1] 6.5 17.5 28.5 39.5 50.5 61.5 72.5 83.5 94.5
##
## $breaks
## [1] 1 12 23 34 45 56 67 78 89 100
##
## $stats
## 1 2 3 4 5 6
## N 12.000000 11.000000 11.000000 11.00000 11.00000 11.00000
## mean 11.677003 14.422808 25.815864 41.20827 46.77315 61.03136
## Std.Dev. 15.259525 10.357070 9.801423 12.28517 12.82800 10.08089
## min -12.757751 -4.510762 10.623311 23.38657 28.25050 46.20297
## Q1 1.414745 9.582796 21.673698 31.37182 36.15908 51.98240
## median 15.932712 17.444665 26.169323 38.61565 50.70907 65.65818
## Q3 23.977946 22.305422 29.911953 52.96734 56.86422 67.33589
## max 30.498196 23.014361 44.543366 54.92090 62.84830 75.36991
## missing values 0.000000 0.000000 0.000000 0.00000 0.00000 0.00000
## 7 8 9
## N 11.00000 11.00000 11.00000
## mean 70.34178 84.93131 95.37521
## Std.Dev. 14.75590 13.18000 12.70460
## min 54.08896 67.02167 75.79483
## Q1 60.50794 74.17385 85.70489
## median 62.25297 84.23083 99.79948
## Q3 80.30715 98.59010 101.21655
## max 93.55940 100.67813 117.15585
## missing values 0.00000 0.00000 0.00000
To make a “decent” plot, you would also need to include error bars -
they can be added in the ggplot2 package using the
geom_errorbar() or geom_pointrange()
functions.
library (dplyr)
library (magrittr)
library (ggplot2)
df.xy <- data.frame (x = x, y = y)
df.sb <- as.data.frame(t(xy.sb$stats))
df.sb %<>%
mutate (centers = xy.sb$centers,
std.err = 2*Std.Dev./sqrt(N))
ggplot(df.xy, aes (x = x, y = y)) +
geom_point (color = "gray50") +
geom_pointrange (data = df.sb,
aes (x = centers, y = mean, ymax = mean+std.err, ymin = mean-std.err),
color="red")In Laboratory 6, we learned about the concept of Pearson correlation.
However, simply stating a linear correlation between two variables is
sometimes insufficient and we would like to find a functional
relationship between them in order to be able to
predict the values of one variable based on the other.
If the Pearson correlation is significant, linear regression may be an
appropriate model. In the R package, it is implemented through the
lm() function, but the way of entering the formula is
specific - R uses the tilde symbol ~ to show the
relationship between variables (e.g. y ~ x means the
relationship between x and y. Below we generate a randomly disturbed
linear relationship:
We perform linear regression and save the result of the procedure to
the variable xy.lm. After executing the
summary (xy.lm) command, we will receive all the details we
are interested in, in particular the p-value (column
Pr(>|t|) and the last line summary)
informing us whether the fit of the linear model is statistically
justified. The null hypothesis of the test states that the true
value of the coefficient is zero, so there is no linear
relationship.
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3593 -0.8672 -0.3111 0.9124 2.6259
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.57608 0.65189 -5.486 3.29e-05 ***
## x 2.11184 0.05442 38.807 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.403 on 18 degrees of freedom
## Multiple R-squared: 0.9882, Adjusted R-squared: 0.9875
## F-statistic: 1506 on 1 and 18 DF, p-value: < 2.2e-16
The \(R^2\) coefficient is also visible, which is equal to the square of the Pearson correlation coefficient between the variables.
## [1] 0.988189
The coefficients field contains the model fit
coefficients, where coefficients[1] is the intercept and
coefficients[2] is the slope. We can then write down the
coefficients and plot the fitting line.
## (Intercept) x
## -3.576081 2.111843
b <- xy.lm$coefficients[1]
a <- xy.lm$coefficients[2]
plot (x, y, pch=19)
lines (x, a*x+b, col="red", lwd=3)A similar procedure can be performed for transformed variables. Here it is the randomized relationship \(y \sim x^{-2}\).
x <- c (1,2,5,10,20,50,100,200,500,1000)
y <- (x*(1+runif (length (x),-0.9,0.9)))^(-2)
plot (x, y, log="xy")##
## Call:
## lm(formula = log(y) ~ log(x))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7150 -0.5205 -0.2374 0.4479 1.8484
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.5567 0.6234 -0.893 0.398
## log(x) -1.6608 0.1520 -10.928 4.36e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.064 on 8 degrees of freedom
## Multiple R-squared: 0.9372, Adjusted R-squared: 0.9294
## F-statistic: 119.4 on 1 and 8 DF, p-value: 4.358e-06
b <- exp (xy.lm$coefficients[1])
a <- xy.lm$coefficients[2]
plot (x, y, pch=19, log="xy")
lines (x, b*x^a, col="red", lwd=3)Instead of rewriting the model coefficients into the formula
yourself, we can use the universal predict() function. Its
arguments are the object returned by the model generating function (in
this case by the lm() function) and a data frame containing
columns with explanatory variables (i.e. in this case the arguments of
the linear function). The function returns a vector of the result
variable.
# Once again, an example from the mtcars dataset
fit <- lm (mpg ~ wt, data = mtcars)
new.data <- data.frame (wt = seq(4.1, 5.1, 0.1))
new.data$mpg <- predict (fit, new.data)
ggplot (mtcars, aes (x = wt, y = mpg)) +
geom_point () +
geom_point (data = new.data, color="red")Sometimes we are interested in the effect of several explanatory
variables (predictors) on the outcome variable. The lm()
function allows you to include more advanced linear models.
##
## Call:
## lm(formula = mpg ~ wt + cyl + hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9290 -1.5598 -0.5311 1.1850 5.8986
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.75179 1.78686 21.687 < 2e-16 ***
## wt -3.16697 0.74058 -4.276 0.000199 ***
## cyl -0.94162 0.55092 -1.709 0.098480 .
## hp -0.01804 0.01188 -1.519 0.140015
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.512 on 28 degrees of freedom
## Multiple R-squared: 0.8431, Adjusted R-squared: 0.8263
## F-statistic: 50.17 on 3 and 28 DF, p-value: 2.184e-11
Note that the fit coefficients as well as p-values are different for the three-variable model than for the single-variable models.
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
##
## Call:
## lm(formula = mpg ~ cyl, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9814 -2.1185 0.2217 1.0717 7.5186
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.8846 2.0738 18.27 < 2e-16 ***
## cyl -2.8758 0.3224 -8.92 6.11e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.206 on 30 degrees of freedom
## Multiple R-squared: 0.7262, Adjusted R-squared: 0.7171
## F-statistic: 79.56 on 1 and 30 DF, p-value: 6.113e-10
##
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7121 -2.1122 -0.8854 1.5819 8.2360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.09886 1.63392 18.421 < 2e-16 ***
## hp -0.06823 0.01012 -6.742 1.79e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892
## F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07
Even more advanced linear models contain interactions between
variables, i.e. the dependence of the outcome variable on the
explanatory variable \(x_1\) depends on
another explanatory variable \(x_2\).
The lm() function will take into account all interactions
if we replace the + symbol with the * symbol
in the formula specifying the dependencies between variables. If we are
only interested in a selected interaction, we can use the :
symbol.
n <- 100
x1 <- rnorm (n, 175, 7)
x2 <- rnorm (n, 30, 8)
x3 <- abs (rnorm (n, 60, 30))
y <- x1 + 2*x2 - 3*x3 + 4*x1*x2 - 5*x1*x3 + 6*x2*x3 + 10 + rnorm(n, 0, 0.5)
df3 <- data.frame (x1, x2, x3, y)
y.lm <- lm (y ~ x1*x2*x3, data = df3)
summary (y.lm)##
## Call:
## lm(formula = y ~ x1 * x2 * x3, data = df3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.98313 -0.31526 0.01027 0.25425 1.33226
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.502e+01 2.018e+01 0.745 0.45846
## x1 9.683e-01 1.150e-01 8.419 4.81e-13 ***
## x2 1.867e+00 5.856e-01 3.188 0.00196 **
## x3 -3.020e+00 3.501e-01 -8.628 1.75e-13 ***
## x1:x2 4.001e+00 3.345e-03 1196.078 < 2e-16 ***
## x1:x3 -5.000e+00 2.005e-03 -2493.200 < 2e-16 ***
## x2:x3 6.001e+00 1.019e-02 588.759 < 2e-16 ***
## x1:x2:x3 -6.447e-06 5.856e-05 -0.110 0.91259
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5163 on 92 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2.097e+10 on 7 and 92 DF, p-value: < 2.2e-16
Logistic regression is similar to linear regression, but in this case the outcome variable \(Y\) has a two-point distribution - e.g. success or failure. Due to the fact that linear regression assumes the normality of the result variable, we must perform a certain transformation called logistic (or logit). Instead of predicting only one of two categories (e.g. sick=1, healthy=0), we consider the probability \(p\) that an individual falls into one of the two categories. However, the probability \(p\) is limited to the range \(\langle 0;1 \rangle\) which is still not consistent with the assumptions of linear regression. However, if we replace the probability with an odds \(O = p/(1-p)\), defined as the ratio of the probability of success to the probability of failure, the range of values changes to \(\langle 0; +\infty )\), and after taking the logarithm to \((-\infty; +\infty )\). The result variable \(Z\) is the so-called logit, i.e. the natural logarithm of the odds of an event occurring:
\(Z = \ln(\frac{p}{1-p}) = \alpha + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n\),
where the coefficients \(\beta_i\) determine the additive impact that a unit change of the variable \(i\) has on the logarithm of the odds ratio.
The inverse function is given by:
\(p = \frac{e^Z}{1+e^Z} = \frac{1}{1+e^{-Z}}\).
In the R package, logistic regression can be performed using the more
general function glm() (Generalized Linear
Models). When calling the function, the parameter
family = binomial must be provided, and the result variable
can be passed in one of three ways:
factor variable specifying whether the observation
is positive or negative; in this case, the length of the vector will be
equal to the number of samples,heart <- data.frame (heart.attack = factor (sample (c ("yes","no"), 1000, replace = TRUE)),
gender = factor (sample (c ("male","female"), 1000, replace = TRUE)),
smoking = factor (sample (c ("yes","no"), 1000, replace = TRUE)),
hypertension = factor (sample (c ("yes","no"), 1000, replace = TRUE)))
heart.lr1 <- glm (heart.attack ~ gender + smoking + hypertension, data = heart, family = binomial)
heart.lr1##
## Call: glm(formula = heart.attack ~ gender + smoking + hypertension,
## family = binomial, data = heart)
##
## Coefficients:
## (Intercept) gendermale smokingyes hypertensionyes
## -0.023477 -0.083875 0.003454 -0.065942
##
## Degrees of Freedom: 999 Total (i.e. Null); 996 Residual
## Null Deviance: 1384
## Residual Deviance: 1383 AIC: 1391
library (tidyr)
heart %>%
table() %>%
as.data.frame() %>%
pivot_wider(names_from = heart.attack, values_from = Freq) -> heart.wide
heart.wide## # A tibble: 8 × 5
## gender smoking hypertension no yes
## <fct> <fct> <fct> <int> <int>
## 1 female no no 63 51
## 2 male no no 73 67
## 3 female yes no 62 72
## 4 male yes no 60 52
## 5 female no yes 72 66
## 6 male no yes 52 52
## 7 female yes yes 63 57
## 8 male yes yes 79 59
heart.lr2 <- glm (cbind (yes,no) ~ gender + smoking + hypertension, data = heart.wide, family = binomial)
heart.lr2##
## Call: glm(formula = cbind(yes, no) ~ gender + smoking + hypertension,
## family = binomial, data = heart.wide)
##
## Coefficients:
## (Intercept) gendermale smokingyes hypertensionyes
## -0.023477 -0.083875 0.003454 -0.065942
##
## Degrees of Freedom: 7 Total (i.e. Null); 4 Residual
## Null Deviance: 4.007
## Residual Deviance: 3.311 AIC: 53.49
weights parameter).heart.wide %<>% mutate (p = yes/(yes+no))
heart.lr3 <- glm (p ~ gender + smoking + hypertension, weights = yes+no, data = heart.wide, family = binomial)
heart.lr3##
## Call: glm(formula = p ~ gender + smoking + hypertension, family = binomial,
## data = heart.wide, weights = yes + no)
##
## Coefficients:
## (Intercept) gendermale smokingyes hypertensionyes
## -0.023477 -0.083875 0.003454 -0.065942
##
## Degrees of Freedom: 7 Total (i.e. Null); 4 Residual
## Null Deviance: 4.007
## Residual Deviance: 3.311 AIC: 53.49
All explanatory variables are discrete in the above example, but in fact, they can be both continuous and discrete, as in the following example.
# Let us consider a dataset containing information about candidates for second-cycle studies.
# It contains the following columns:
#
# admit - binary variable telling if a student was admitted or no,
# gre - graduate record exam - result of the final examination at the end of first-cycle studies,
# gpa - grade point average - average grade from first-cycle studies,
# rank - a qualitative variable with values from 1 to 4, determines the prestige of the completed school,
# the lower the better
studies <- read.csv("studia.csv")
studies %<>%
mutate (admit = factor (admit),
rank = factor (rank))
summary (studies)## admit gre gpa rank
## 0:273 Min. :220.0 Min. :2.260 1: 61
## 1:127 1st Qu.:520.0 1st Qu.:3.130 2:151
## Median :580.0 Median :3.395 3:121
## Mean :587.7 Mean :3.390 4: 67
## 3rd Qu.:660.0 3rd Qu.:3.670
## Max. :800.0 Max. :4.000
##
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = binomial, data = studies)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.989979 1.139951 -3.500 0.000465 ***
## gre 0.002264 0.001094 2.070 0.038465 *
## gpa 0.804038 0.331819 2.423 0.015388 *
## rank2 -0.675443 0.316490 -2.134 0.032829 *
## rank3 -1.340204 0.345306 -3.881 0.000104 ***
## rank4 -1.551464 0.417832 -3.713 0.000205 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 499.98 on 399 degrees of freedom
## Residual deviance: 458.52 on 394 degrees of freedom
## AIC: 470.52
##
## Number of Fisher Scoring iterations: 4
According to the summary shown above, the logistic regression model can be written as:
\(\ln(\frac{p}{1-p}) = -3.990 + 0.002 x_{gre} + 0.804 x_{gpa} - 0.675 x_{rank2} - 1.340 x_{rank3} - 1.551 x_{rank4}\),
where
gre variable,gpa variable,What may be puzzling is the lack of a coefficient for the \(x_{rank1}\) variable in the above formula.
It is not explicit because it complements of the other coefficients and
is included in the intercept \(\alpha =
-3,990\). It is a reference level for all other variants of the
rank variable, e.g. if the coefficient at \(x_{rank2}\) is \(-0.675\), it means that the odds of success
of a person from a university with a ranking equal to \(2\) to the odds of a person with a
university with a ranking of \(1\) is
\(\exp(-0.675) = 0.51\). You can also
say the opposite, that the odds of success for a person from a
university with a ranking of \(1\) is
\(1/\exp(-0.675) = 1.96\) times greater
than that of a person from a university with a ranking of \(2\). The exact value of the logarithm of
the odds of success for a person from a university with a ranking of
\(1\) would be clearly visible if the
only explanatory variable in the model was rank. It would
then be equal to the intercept \(\alpha\), as shown in the example of the
UCBAdmissions dataset later in this manual.
The coefficient for the variable \(x_{gpa}\) is \(0.804\), which means that increasing \(x_{gpa}\) by one increases the logarithm of the odds ratio by \(0.804\), i.e. increases the odds of getting into college by \(2.23\) times (\(\exp (0.804) = 2.23\)).
Using the predict() function will provide the
probability of success for individual candidates, and this can be
obtained in two ways.
# Method 1
studies$logit <- predict (studies.lr)
studies$p <- 1/(1+exp(-studies$logit))
# Method 2
studies$p2 <- predict (studies.lr, type = "response")
head (studies)## admit gre gpa rank logit p p2
## 1 0 380 3.61 3 -1.5671256 0.1726265 0.1726265
## 2 1 660 3.67 3 -0.8848442 0.2921750 0.2921750
## 3 1 800 4.00 1 1.0377118 0.7384082 0.7384082
## 4 1 640 3.19 4 -1.5273305 0.1783846 0.1783846
## 5 0 520 2.93 4 -2.0081113 0.1183539 0.1183539
## 6 1 760 3.00 2 -0.5323458 0.3699699 0.3699699
Knowing the probabilities, we can easily calculate the odds and verify whether our interpretation of the coefficient for the \(x_{gpa}\) variable is correct. The most common final test score is \(620\). Of those with this score, \(13\) graduated from a university with a ranking of \(2\).
studies %>%
filter (gre==620, rank==2) %>%
mutate (gpa.diff = gpa - min (gpa),
exp.beta = exp (studies.lr$coefficients["gpa"]*gpa.diff)) -> studies2
studies2## admit gre gpa rank logit p p2 gpa.diff exp.beta
## 1 1 620 3.18 2 -0.70463861 0.3307846 0.3307846 0.33 1.303864
## 2 0 620 3.07 2 -0.79308274 0.3115071 0.3115071 0.22 1.193498
## 3 1 620 3.17 2 -0.71267898 0.3290072 0.3290072 0.32 1.293423
## 4 0 620 3.40 2 -0.52775035 0.3710417 0.3710417 0.55 1.556159
## 5 1 620 3.37 2 -0.55187147 0.3654303 0.3654303 0.52 1.519072
## 6 0 620 3.58 2 -0.38302359 0.4053979 0.4053979 0.73 1.798490
## 7 0 620 3.05 2 -0.80916349 0.3080688 0.3080688 0.20 1.174459
## 8 0 620 3.22 2 -0.67247711 0.3379424 0.3379424 0.37 1.346480
## 9 1 620 3.45 2 -0.48754847 0.3804713 0.3804713 0.60 1.619994
## 10 0 620 2.85 2 -0.96997100 0.2748863 0.2748863 0.00 1.000000
## 11 0 620 3.63 2 -0.34282171 0.4151242 0.4151242 0.78 1.872266
## 12 1 620 3.75 2 -0.24633720 0.4387252 0.4387252 0.90 2.061912
## 13 0 620 4.00 2 -0.04532782 0.4886700 0.4886700 1.15 2.520969
odds <- studies2$p/(1-studies2$p) # Compute the odds of students
odds/odds[which(studies2$gpa.diff==0)] # Compute the odds ratio## [1] 1.303864 1.193498 1.293423 1.556159 1.519072 1.798490 1.174459 1.346480
## [9] 1.619994 1.000000 1.872266 2.061912 2.520969
Using the predict() function, we can check the
probabilities of new candidates for studies. To do this, we create a new
data frame.
new.students <- data.frame (gre = mean(studies$gre), gpa = mean(studies$gpa), rank = factor(1:4))
new.students$p <- predict (studies.lr, newdata = new.students, type = "response")
new.students## gre gpa rank p
## 1 587.7 3.3899 1 0.5166016
## 2 587.7 3.3899 2 0.3522846
## 3 587.7 3.3899 3 0.2186120
## 4 587.7 3.3899 4 0.1846684
For explanatory variables on a dichotomous scale (e.g., gender), \(e^\beta\) is an estimate of the odds of, say, men compared to women.
# We have the following data frame containing information about the results of a speed check on a road
# where the speed limit was 90 km/h.
speed.check <- data.frame (gender = rep ( c ("M","F"), c (60,40)),
speed = c (runif (60, 60, 120), runif (40, 50, 100)))
# Creates the logical vector with the information whether the speed limit was exceeded
speed.check$exceed <- speed.check$speed>90
head(speed.check)## gender speed exceed
## 1 M 96.96644 TRUE
## 2 M 74.65544 FALSE
## 3 M 103.78498 TRUE
## 4 M 91.42864 TRUE
## 5 M 104.71827 TRUE
## 6 M 70.65747 FALSE
##
## FALSE TRUE
## F 30 10
## M 30 30
##
## FALSE TRUE
## F 0.3 0.1
## M 0.3 0.3
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tab
## X-squared = 5.2517, df = 1, p-value = 0.02192
# Logistic regression
speed.check.lr <- glm (exceed ~ gender, data=speed.check, family=binomial)
summary(speed.check.lr)##
## Call:
## glm(formula = exceed ~ gender, family = binomial, data = speed.check)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.0986 0.3651 -3.009 0.00262 **
## genderM 1.0986 0.4472 2.457 0.01403 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 134.60 on 99 degrees of freedom
## Residual deviance: 128.16 on 98 degrees of freedom
## AIC: 132.16
##
## Number of Fisher Scoring iterations: 4
# Calculates the probabilities and odds that a woman will exceed the speed limit
p.f <- tab["F","TRUE"]/sum(tab["F",])
p.f## [1] 0.25
## [1] 0.3333333
# Calculates the probabilities and odds that a man will exceed the speed limit
p.m <- tab["M","TRUE"]/sum(tab["M",])
p.m## [1] 0.5
## [1] 1
## [1] 3
## (Intercept) genderM
## 0.3333333 3.0000000
The UCBAdmissions collection from the
datasets package contains information about admissions to
the University of California, Berkeley. We will check how the
probability of admission depends on the faculty the candidate is trying
to get into.
## , , Dept = A
##
## Gender
## Admit Male Female
## Admitted 512 89
## Rejected 313 19
##
## , , Dept = B
##
## Gender
## Admit Male Female
## Admitted 353 17
## Rejected 207 8
##
## , , Dept = C
##
## Gender
## Admit Male Female
## Admitted 120 202
## Rejected 205 391
##
## , , Dept = D
##
## Gender
## Admit Male Female
## Admitted 138 131
## Rejected 279 244
##
## , , Dept = E
##
## Gender
## Admit Male Female
## Admitted 53 94
## Rejected 138 299
##
## , , Dept = F
##
## Gender
## Admit Male Female
## Admitted 22 24
## Rejected 351 317
## Admit Gender Dept Freq
## 1 Admitted Male A 512
## 2 Rejected Male A 313
## 3 Admitted Female A 89
## 4 Rejected Female A 19
## 5 Admitted Male B 353
## 6 Rejected Male B 207
# Transform table from long to wide
admissions %<>% pivot_wider (names_from = "Admit", values_from = "Freq")
# Ignore the Gender variable
admissions %<>% group_by (Dept) %>% summarise (admitted = sum (Admitted), rejected = sum (Rejected))
# Compute probabilities and odds
admissions %<>% mutate (p = admitted/(admitted+rejected)) %>% mutate (o = p/(1-p))
admissions## # A tibble: 6 × 5
## Dept admitted rejected p o
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 A 601 332 0.644 1.81
## 2 B 370 215 0.632 1.72
## 3 C 322 596 0.351 0.540
## 4 D 269 523 0.340 0.514
## 5 E 147 437 0.252 0.336
## 6 F 46 668 0.0644 0.0689
# Perform logistic regression
admissions.lr <- glm (cbind (admitted,rejected) ~ Dept, data = admissions, family = binomial)
summary (admissions.lr)##
## Call:
## glm(formula = cbind(admitted, rejected) ~ Dept, family = binomial,
## data = admissions)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.59346 0.06838 8.679 <2e-16 ***
## DeptB -0.05059 0.10968 -0.461 0.645
## DeptC -1.20915 0.09726 -12.432 <2e-16 ***
## DeptD -1.25833 0.10152 -12.395 <2e-16 ***
## DeptE -1.68296 0.11733 -14.343 <2e-16 ***
## DeptF -3.26911 0.16707 -19.567 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8.5532e+02 on 5 degrees of freedom
## Residual deviance: 3.3595e-13 on 0 degrees of freedom
## AIC: 52.298
##
## Number of Fisher Scoring iterations: 3
## (Intercept) DeptB DeptC DeptD DeptE DeptF
## 1.81024096 0.95066362 0.29845113 0.28412811 0.18582302 0.03804039
## [1] 1.00000000 0.95066362 0.29845113 0.28412811 0.18582302 0.03804039
Let us consider the Titanic set from the
datasets library. This collection includes a 4-dimensional
table containing information about 2,201 passengers of the ill-fated
cruise.
## , , Age = Child, Survived = No
##
## Sex
## Class Male Female
## 1st 0 0
## 2nd 0 0
## 3rd 35 17
## Crew 0 0
##
## , , Age = Adult, Survived = No
##
## Sex
## Class Male Female
## 1st 118 4
## 2nd 154 13
## 3rd 387 89
## Crew 670 3
##
## , , Age = Child, Survived = Yes
##
## Sex
## Class Male Female
## 1st 5 1
## 2nd 11 13
## 3rd 13 14
## Crew 0 0
##
## , , Age = Adult, Survived = Yes
##
## Sex
## Class Male Female
## 1st 57 140
## 2nd 14 80
## 3rd 75 76
## Crew 192 20
## Class Sex Age Survived Freq
## 1 1st Male Child No 0
## 2 2nd Male Child No 0
## 3 3rd Male Child No 35
## 4 Crew Male Child No 0
## 5 1st Female Child No 0
## 6 2nd Female Child No 0
# Transform table from long to wide
titanic %<>% pivot_wider(names_from = "Survived", values_from = "Freq")
names(titanic)[4:5] <- c("Died","Survived")
head(titanic)## # A tibble: 6 × 5
## Class Sex Age Died Survived
## <fct> <fct> <fct> <dbl> <dbl>
## 1 1st Male Child 0 5
## 2 2nd Male Child 0 11
## 3 3rd Male Child 35 13
## 4 Crew Male Child 0 0
## 5 1st Female Child 0 1
## 6 2nd Female Child 0 13
# Perform logistic regression
titanic.lr <- glm (cbind (Survived,Died) ~ Class+Sex+Age, data = titanic, family = binomial)
summary(titanic.lr)##
## Call:
## glm(formula = cbind(Survived, Died) ~ Class + Sex + Age, family = binomial,
## data = titanic)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.6853 0.2730 2.510 0.0121 *
## Class2nd -1.0181 0.1960 -5.194 2.05e-07 ***
## Class3rd -1.7778 0.1716 -10.362 < 2e-16 ***
## ClassCrew -0.8577 0.1573 -5.451 5.00e-08 ***
## SexFemale 2.4201 0.1404 17.236 < 2e-16 ***
## AgeAdult -1.0615 0.2440 -4.350 1.36e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 671.96 on 13 degrees of freedom
## Residual deviance: 112.57 on 8 degrees of freedom
## AIC: 171.19
##
## Number of Fisher Scoring iterations: 5
The analysis shows that all variables are statistically significant. The results indicate that women on the ship were 11.25 times \((exp(2.42)=11.25)\) more likely to survive than men, and third-class passengers were 5.92 times \((1/exp(-1.78)=5.92)\) less likely to survive compared to first class passengers. Moreover, in the logistic regression model there is a multiplier effect, which means that men from the third class had \(11.25 \cdot 5.92 = 66.6\) times less odds of survival than women from the first class.
If the result variable \(Y\) is the frequency of an event (rate) or simply the number of counts in some period, we can use Poisson regression to find its dependence on the explanatory variables \(x_i\). The name of this regression comes from the assumption that \(Y\) comes from the Poisson distribution \(P(Y=y) = \lambda^y e^{-\lambda}/y!\). An example is the frequency of epileptic seizures depending on various factors (experiencing certain diseases, diet, taking certain medications). The Poisson regression model takes a similar form to the logistic regression model:
\(\ln(y) = \alpha + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n\),
where \(y\) is the expected frequency or number of occurrences for individuals with a given set of values \(x_1, \dots, x_n\), and \(\alpha, \beta_1, \dots, \beta_n\) are the regression coefficients. The exponents of individual coefficients are the estimated relative frequencies associated with the relevant variables. For only one explanatory variable we have:
\(y = e^\alpha \cdot e^{\beta x_1}\)
In the R package, Poisson regression can be performed using the
glm (Generalized Linear Models) function with the
family = poisson parameter. As an example, let us consider
the warpbreaks dataset, which contains information on the
number of warp breaks per loom depending on the wool type and
tension.
## breaks wool tension
## 1 26 A L
## 2 30 A L
## 3 54 A L
## 4 25 A L
## 5 70 A L
## 6 52 A L
## 7 51 A L
## 8 26 A L
## 9 67 A L
## 10 18 A M
## 11 21 A M
## 12 29 A M
## 13 17 A M
## 14 12 A M
## 15 18 A M
## 16 35 A M
## 17 30 A M
## 18 36 A M
## 19 36 A H
## 20 21 A H
## 21 24 A H
## 22 18 A H
## 23 10 A H
## 24 43 A H
## 25 28 A H
## 26 15 A H
## 27 26 A H
## 28 27 B L
## 29 14 B L
## 30 29 B L
## 31 19 B L
## 32 29 B L
## 33 31 B L
## 34 41 B L
## 35 20 B L
## 36 44 B L
## 37 42 B M
## 38 26 B M
## 39 19 B M
## 40 16 B M
## 41 39 B M
## 42 28 B M
## 43 21 B M
## 44 39 B M
## 45 29 B M
## 46 20 B H
## 47 21 B H
## 48 24 B H
## 49 17 B H
## 50 13 B H
## 51 15 B H
## 52 15 B H
## 53 16 B H
## 54 28 B H
g <- ggplot(warp)
g +
geom_histogram (aes (x = breaks, y = after_stat(density)), breaks = seq (0,70,10), alpha = 0.5) +
geom_density (aes (x=breaks), color = "red", linewidth = 1.5)# Poisson regression
warp.pr <- glm (breaks ~ wool+tension, data = warp, family = poisson)
summary (warp.pr)##
## Call:
## glm(formula = breaks ~ wool + tension, family = poisson, data = warp)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.69196 0.04541 81.302 < 2e-16 ***
## woolB -0.20599 0.05157 -3.994 6.49e-05 ***
## tensionM -0.32132 0.06027 -5.332 9.73e-08 ***
## tensionH -0.51849 0.06396 -8.107 5.21e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 297.37 on 53 degrees of freedom
## Residual deviance: 210.39 on 50 degrees of freedom
## AIC: 493.06
##
## Number of Fisher Scoring iterations: 4
The above results indicate that wool type B causes relatively 0.814
(exp(-0.206)=0.814) times less warp damage than wool type
A. In other words, changing wool from A to B will reduce the number of
damages by 18.6%. Let’s check the predictive effectiveness of the model
we built.
warp$predict <- predict (warp.pr, type="response")
warp %>%
group_by (wool, tension) %>%
summarise (breaks.mean = mean (breaks),
breaks.sderr = sd (breaks)/sqrt(n()),
predict = mean (predict)) -> warp.summary## `summarise()` has grouped output by 'wool'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 5
## # Groups: wool [2]
## wool tension breaks.mean breaks.sderr predict
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 A L 44.6 6.03 40.1
## 2 A M 24 2.89 29.1
## 3 A H 24.6 3.42 23.9
## 4 B L 28.2 3.29 32.7
## 5 B M 28.8 3.14 23.7
## 6 B H 18.8 1.63 19.4
The second example of using Poisson regression is modeling the
number of plant species on the Galapagos Islands depending on various
factors. The gala collection is part of the
faraway package, which must be installed and included
before loading data.
## Species Endemics Area Elevation Nearest Scruz Adjacent
## Baltra 58 23 25.09 346 0.6 0.6 1.84
## Bartolome 31 21 1.24 109 0.6 26.3 572.33
## Caldwell 3 3 0.21 114 2.8 58.7 0.78
## Champion 25 9 0.10 46 1.9 47.4 0.18
## Coamano 2 1 0.05 77 1.9 1.9 903.82
## Daphne.Major 18 11 0.34 119 8.0 8.0 1.84
## Species Endemics Area Elevation
## Min. : 2.00 Min. : 0.00 Min. : 0.0100 Min. : 25.00
## 1st Qu.: 13.00 1st Qu.: 7.25 1st Qu.: 0.2575 1st Qu.: 97.75
## Median : 42.00 Median :18.00 Median : 2.5900 Median : 192.00
## Mean : 85.23 Mean :26.10 Mean : 261.7087 Mean : 368.03
## 3rd Qu.: 96.00 3rd Qu.:32.25 3rd Qu.: 59.2375 3rd Qu.: 435.25
## Max. :444.00 Max. :95.00 Max. :4669.3200 Max. :1707.00
## Nearest Scruz Adjacent
## Min. : 0.20 Min. : 0.00 Min. : 0.03
## 1st Qu.: 0.80 1st Qu.: 11.03 1st Qu.: 0.52
## Median : 3.05 Median : 46.65 Median : 2.59
## Mean :10.06 Mean : 56.98 Mean : 261.10
## 3rd Qu.:10.03 3rd Qu.: 81.08 3rd Qu.: 59.24
## Max. :47.40 Max. :290.20 Max. :4669.32
g <- ggplot(gala)
g + geom_histogram (aes (x = Species, y = after_stat (density)), breaks = seq (0,450,50), alpha=0.5) +
geom_density (aes (x = Species), color = "red", linewidth = 1.5)##
## Call:
## glm(formula = Species ~ ., family = poisson, data = gala)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.828e+00 5.958e-02 47.471 < 2e-16 ***
## Endemics 3.388e-02 1.741e-03 19.459 < 2e-16 ***
## Area -1.067e-04 3.741e-05 -2.853 0.00433 **
## Elevation 2.638e-04 1.934e-04 1.364 0.17264
## Nearest 1.048e-02 1.611e-03 6.502 7.91e-11 ***
## Scruz -6.835e-04 5.802e-04 -1.178 0.23877
## Adjacent 4.539e-05 4.800e-05 0.946 0.34437
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 3510.73 on 29 degrees of freedom
## Residual deviance: 313.36 on 23 degrees of freedom
## AIC: 488.19
##
## Number of Fisher Scoring iterations: 5
The results obtained using Poisson regression suggest that the height
of the highest elevation on the island (Elevation), the
distance from Santa Cruz Island (Scruz) and the area of the
neighboring islands (Adjacent) do not influence the number
of species on an island. We can build the model again, this time without
these variables.
##
## Call:
## glm(formula = Species ~ Endemics + Area + Nearest, family = poisson,
## data = gala)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.868e+00 4.883e-02 58.729 < 2e-16 ***
## Endemics 3.551e-02 7.327e-04 48.462 < 2e-16 ***
## Area -4.542e-05 1.568e-05 -2.896 0.00378 **
## Nearest 9.289e-03 1.319e-03 7.043 1.88e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 3510.73 on 29 degrees of freedom
## Residual deviance: 330.84 on 26 degrees of freedom
## AIC: 499.67
##
## Number of Fisher Scoring iterations: 5
gala$predict <- predict (gala.pr2, type = "response")
gala %<>%
mutate (rel.err = (predict-Species)/Species)
gala## Species Endemics Area Elevation Nearest Scruz Adjacent
## Baltra 58 23 25.09 346 0.6 0.6 1.84
## Bartolome 31 21 1.24 109 0.6 26.3 572.33
## Caldwell 3 3 0.21 114 2.8 58.7 0.78
## Champion 25 9 0.10 46 1.9 47.4 0.18
## Coamano 2 1 0.05 77 1.9 1.9 903.82
## Daphne.Major 18 11 0.34 119 8.0 8.0 1.84
## Daphne.Minor 24 0 0.08 93 6.0 12.0 0.34
## Darwin 10 7 2.33 168 34.1 290.2 2.85
## Eden 8 4 0.03 71 0.4 0.4 17.95
## Enderby 2 2 0.18 112 2.6 50.2 0.10
## Espanola 97 26 58.27 198 1.1 88.3 0.57
## Fernandina 93 35 634.49 1494 4.3 95.3 4669.32
## Gardner1 58 17 0.57 49 1.1 93.1 58.27
## Gardner2 5 4 0.78 227 4.6 62.2 0.21
## Genovesa 40 19 17.35 76 47.4 92.2 129.49
## Isabela 347 89 4669.32 1707 0.7 28.1 634.49
## Marchena 51 23 129.49 343 29.1 85.9 59.56
## Onslow 2 2 0.01 25 3.3 45.9 0.10
## Pinta 104 37 59.56 777 29.1 119.6 129.49
## Pinzon 108 33 17.95 458 10.7 10.7 0.03
## Las.Plazas 12 9 0.23 94 0.5 0.6 25.09
## Rabida 70 30 4.89 367 4.4 24.4 572.33
## SanCristobal 280 65 551.62 716 45.2 66.6 0.57
## SanSalvador 237 81 572.33 906 0.2 19.8 4.89
## SantaCruz 444 95 903.82 864 0.6 0.0 0.52
## SantaFe 62 28 24.08 259 16.5 16.5 0.52
## SantaMaria 285 73 170.92 640 2.6 49.2 0.10
## Seymour 44 16 1.84 147 0.6 9.6 25.09
## Tortuga 16 8 1.24 186 6.8 50.9 17.95
## Wolf 21 12 2.85 253 34.1 254.7 2.33
## predict rel.err
## Baltra 40.00437 -0.31026951
## Bartolome 37.30235 0.20330150
## Caldwell 20.09352 5.69783856
## Champion 24.65770 -0.01369189
## Coamano 18.56038 8.28018851
## Daphne.Major 28.01541 0.55641144
## Daphne.Minor 18.60821 -0.22465772
## Darwin 30.97144 2.09714399
## Eden 20.36098 1.54512296
## Enderby 19.35659 8.67829549
## Espanola 44.64085 -0.53978500
## Fernandina 61.66847 -0.33689815
## Gardner1 32.51494 -0.43939759
## Gardner2 21.17029 3.23405714
## Genovesa 53.62501 0.34062531
## Isabela 337.81672 -0.02646479
## Marchena 51.88202 0.01729450
## Onslow 19.48301 8.74150423
## Pinta 85.56342 -0.17727482
## Pinzon 62.69026 -0.41953465
## Las.Plazas 24.33899 1.02824875
## Rabida 53.18400 -0.24022861
## SanCristobal 262.61134 -0.06210236
## SanSalvador 304.86604 0.28635460
## SantaCruz 495.53694 0.11607418
## SantaFe 55.38243 -0.10673495
## SantaMaria 238.96896 -0.16151242
## Seymour 31.23346 -0.29014869
## Tortuga 24.90439 0.55652408
## Wolf 36.98754 0.76131121