DATA ANALYSIS AND VISUALIZATION IN R, WINTER 2025 EDITION



Binning

In many cases, it is necessary or beneficial to “bin” the data, i.e. divide the X axis into intervals and then calculate the average values in these intervals. In R, you can use the stats.bin() function from the fields library for this purpose.

library (fields)

x <- 1:100
y <- x + runif (length (x), -20, 20)

xy.sb <- stats.bin (x, y)
xy.sb
## $centers
## [1]  6.5 17.5 28.5 39.5 50.5 61.5 72.5 83.5 94.5
## 
## $breaks
##  [1]   1  12  23  34  45  56  67  78  89 100
## 
## $stats
##                         1         2         3        4        5        6
## N               12.000000 11.000000 11.000000 11.00000 11.00000 11.00000
## mean            11.677003 14.422808 25.815864 41.20827 46.77315 61.03136
## Std.Dev.        15.259525 10.357070  9.801423 12.28517 12.82800 10.08089
## min            -12.757751 -4.510762 10.623311 23.38657 28.25050 46.20297
## Q1               1.414745  9.582796 21.673698 31.37182 36.15908 51.98240
## median          15.932712 17.444665 26.169323 38.61565 50.70907 65.65818
## Q3              23.977946 22.305422 29.911953 52.96734 56.86422 67.33589
## max             30.498196 23.014361 44.543366 54.92090 62.84830 75.36991
## missing values   0.000000  0.000000  0.000000  0.00000  0.00000  0.00000
##                       7         8         9
## N              11.00000  11.00000  11.00000
## mean           70.34178  84.93131  95.37521
## Std.Dev.       14.75590  13.18000  12.70460
## min            54.08896  67.02167  75.79483
## Q1             60.50794  74.17385  85.70489
## median         62.25297  84.23083  99.79948
## Q3             80.30715  98.59010 101.21655
## max            93.55940 100.67813 117.15585
## missing values  0.00000   0.00000   0.00000
plot (x, y)
points (xy.sb$centers, xy.sb$stats["mean",], col="red", pch=19, cex=1.5)

To make a “decent” plot, you would also need to include error bars - they can be added in the ggplot2 package using the geom_errorbar() or geom_pointrange() functions.

library (dplyr)
library (magrittr)
library (ggplot2)

df.xy <- data.frame (x = x, y = y)
df.sb <- as.data.frame(t(xy.sb$stats))
df.sb %<>%
  mutate (centers = xy.sb$centers,
          std.err = 2*Std.Dev./sqrt(N))

ggplot(df.xy, aes (x = x, y = y)) +
  geom_point (color = "gray50") +
  geom_pointrange (data = df.sb, 
                   aes (x = centers, y = mean, ymax = mean+std.err, ymin = mean-std.err), 
                   color="red")


Linear regression

In Laboratory 6, we learned about the concept of Pearson correlation. However, simply stating a linear correlation between two variables is sometimes insufficient and we would like to find a functional relationship between them in order to be able to predict the values of one variable based on the other. If the Pearson correlation is significant, linear regression may be an appropriate model. In the R package, it is implemented through the lm() function, but the way of entering the formula is specific - R uses the tilde symbol ~ to show the relationship between variables (e.g. y ~ x means the relationship between x and y. Below we generate a randomly disturbed linear relationship:

x <- 1:20
y <- 2*x-2 + runif (length (x), -3, 3)
plot (x, y)

We perform linear regression and save the result of the procedure to the variable xy.lm. After executing the summary (xy.lm) command, we will receive all the details we are interested in, in particular the p-value (column Pr(>|t|) and the last line summary) informing us whether the fit of the linear model is statistically justified. The null hypothesis of the test states that the true value of the coefficient is zero, so there is no linear relationship.

xy.lm <- lm (y ~ x)
summary (xy.lm)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3593 -0.8672 -0.3111  0.9124  2.6259 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.57608    0.65189  -5.486 3.29e-05 ***
## x            2.11184    0.05442  38.807  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.403 on 18 degrees of freedom
## Multiple R-squared:  0.9882, Adjusted R-squared:  0.9875 
## F-statistic:  1506 on 1 and 18 DF,  p-value: < 2.2e-16

The \(R^2\) coefficient is also visible, which is equal to the square of the Pearson correlation coefficient between the variables.

cor (x,y)^2
## [1] 0.988189

The coefficients field contains the model fit coefficients, where coefficients[1] is the intercept and coefficients[2] is the slope. We can then write down the coefficients and plot the fitting line.

xy.lm$coefficients
## (Intercept)           x 
##   -3.576081    2.111843
b <- xy.lm$coefficients[1]
a <- xy.lm$coefficients[2]
plot (x, y, pch=19)
lines (x, a*x+b, col="red", lwd=3)

A similar procedure can be performed for transformed variables. Here it is the randomized relationship \(y \sim x^{-2}\).

x <- c (1,2,5,10,20,50,100,200,500,1000)
y <- (x*(1+runif (length (x),-0.9,0.9)))^(-2)
plot (x, y, log="xy")

xy.lm <- lm (log(y) ~ log(x))
summary (xy.lm)
## 
## Call:
## lm(formula = log(y) ~ log(x))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7150 -0.5205 -0.2374  0.4479  1.8484 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.5567     0.6234  -0.893    0.398    
## log(x)       -1.6608     0.1520 -10.928 4.36e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.064 on 8 degrees of freedom
## Multiple R-squared:  0.9372, Adjusted R-squared:  0.9294 
## F-statistic: 119.4 on 1 and 8 DF,  p-value: 4.358e-06
b <- exp (xy.lm$coefficients[1])
a <- xy.lm$coefficients[2]
plot (x, y, pch=19, log="xy")
lines (x, b*x^a, col="red", lwd=3)

Instead of rewriting the model coefficients into the formula yourself, we can use the universal predict() function. Its arguments are the object returned by the model generating function (in this case by the lm() function) and a data frame containing columns with explanatory variables (i.e. in this case the arguments of the linear function). The function returns a vector of the result variable.

# Once again, an example from the mtcars dataset
fit <- lm (mpg ~ wt, data = mtcars)
new.data <- data.frame (wt = seq(4.1, 5.1, 0.1))
new.data$mpg <- predict (fit, new.data)

ggplot (mtcars, aes (x = wt, y = mpg)) + 
  geom_point () + 
  geom_point (data = new.data, color="red")


Multivariate linear regression

Sometimes we are interested in the effect of several explanatory variables (predictors) on the outcome variable. The lm() function allows you to include more advanced linear models.

fit2 <- lm (mpg ~ wt+cyl+hp, data = mtcars)
summary (fit2)
## 
## Call:
## lm(formula = mpg ~ wt + cyl + hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9290 -1.5598 -0.5311  1.1850  5.8986 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.75179    1.78686  21.687  < 2e-16 ***
## wt          -3.16697    0.74058  -4.276 0.000199 ***
## cyl         -0.94162    0.55092  -1.709 0.098480 .  
## hp          -0.01804    0.01188  -1.519 0.140015    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.512 on 28 degrees of freedom
## Multiple R-squared:  0.8431, Adjusted R-squared:  0.8263 
## F-statistic: 50.17 on 3 and 28 DF,  p-value: 2.184e-11

Note that the fit coefficients as well as p-values are different for the three-variable model than for the single-variable models.

lm (mpg ~ wt, data = mtcars) %>% summary()
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10
lm (mpg ~ cyl, data = mtcars) %>% summary()
## 
## Call:
## lm(formula = mpg ~ cyl, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9814 -2.1185  0.2217  1.0717  7.5186 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.8846     2.0738   18.27  < 2e-16 ***
## cyl          -2.8758     0.3224   -8.92 6.11e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.206 on 30 degrees of freedom
## Multiple R-squared:  0.7262, Adjusted R-squared:  0.7171 
## F-statistic: 79.56 on 1 and 30 DF,  p-value: 6.113e-10
lm (mpg ~ hp, data = mtcars) %>% summary()
## 
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7121 -2.1122 -0.8854  1.5819  8.2360 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
## hp          -0.06823    0.01012  -6.742 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

Even more advanced linear models contain interactions between variables, i.e. the dependence of the outcome variable on the explanatory variable \(x_1\) depends on another explanatory variable \(x_2\). The lm() function will take into account all interactions if we replace the + symbol with the * symbol in the formula specifying the dependencies between variables. If we are only interested in a selected interaction, we can use the : symbol.

n  <- 100
x1 <- rnorm (n, 175, 7)
x2 <- rnorm (n, 30, 8)
x3 <- abs (rnorm (n, 60, 30))

y <- x1 + 2*x2 - 3*x3 + 4*x1*x2 - 5*x1*x3 + 6*x2*x3 + 10 + rnorm(n, 0, 0.5)

df3 <- data.frame (x1, x2, x3, y)
y.lm <- lm (y ~ x1*x2*x3, data = df3)
summary (y.lm)
## 
## Call:
## lm(formula = y ~ x1 * x2 * x3, data = df3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.98313 -0.31526  0.01027  0.25425  1.33226 
## 
## Coefficients:
##               Estimate Std. Error   t value Pr(>|t|)    
## (Intercept)  1.502e+01  2.018e+01     0.745  0.45846    
## x1           9.683e-01  1.150e-01     8.419 4.81e-13 ***
## x2           1.867e+00  5.856e-01     3.188  0.00196 ** 
## x3          -3.020e+00  3.501e-01    -8.628 1.75e-13 ***
## x1:x2        4.001e+00  3.345e-03  1196.078  < 2e-16 ***
## x1:x3       -5.000e+00  2.005e-03 -2493.200  < 2e-16 ***
## x2:x3        6.001e+00  1.019e-02   588.759  < 2e-16 ***
## x1:x2:x3    -6.447e-06  5.856e-05    -0.110  0.91259    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5163 on 92 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.097e+10 on 7 and 92 DF,  p-value: < 2.2e-16


Logistic regression

Logistic regression is similar to linear regression, but in this case the outcome variable \(Y\) has a two-point distribution - e.g. success or failure. Due to the fact that linear regression assumes the normality of the result variable, we must perform a certain transformation called logistic (or logit). Instead of predicting only one of two categories (e.g. sick=1, healthy=0), we consider the probability \(p\) that an individual falls into one of the two categories. However, the probability \(p\) is limited to the range \(\langle 0;1 \rangle\) which is still not consistent with the assumptions of linear regression. However, if we replace the probability with an odds \(O = p/(1-p)\), defined as the ratio of the probability of success to the probability of failure, the range of values changes to \(\langle 0; +\infty )\), and after taking the logarithm to \((-\infty; +\infty )\). The result variable \(Z\) is the so-called logit, i.e. the natural logarithm of the odds of an event occurring:

\(Z = \ln(\frac{p}{1-p}) = \alpha + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n\),

where the coefficients \(\beta_i\) determine the additive impact that a unit change of the variable \(i\) has on the logarithm of the odds ratio.

The inverse function is given by:

\(p = \frac{e^Z}{1+e^Z} = \frac{1}{1+e^{-Z}}\).

In the R package, logistic regression can be performed using the more general function glm() (Generalized Linear Models). When calling the function, the parameter family = binomial must be provided, and the result variable can be passed in one of three ways:

heart <- data.frame (heart.attack = factor (sample (c ("yes","no"), 1000, replace = TRUE)),
                     gender = factor (sample (c ("male","female"), 1000, replace = TRUE)),
                     smoking = factor (sample (c ("yes","no"), 1000, replace = TRUE)),
                     hypertension = factor (sample (c ("yes","no"), 1000, replace = TRUE)))

heart.lr1 <- glm (heart.attack ~ gender + smoking + hypertension, data = heart, family = binomial)
heart.lr1
## 
## Call:  glm(formula = heart.attack ~ gender + smoking + hypertension, 
##     family = binomial, data = heart)
## 
## Coefficients:
##     (Intercept)       gendermale       smokingyes  hypertensionyes  
##       -0.023477        -0.083875         0.003454        -0.065942  
## 
## Degrees of Freedom: 999 Total (i.e. Null);  996 Residual
## Null Deviance:       1384 
## Residual Deviance: 1383  AIC: 1391
library (tidyr)
heart %>%
  table() %>% 
  as.data.frame() %>% 
  pivot_wider(names_from = heart.attack, values_from = Freq) -> heart.wide
heart.wide
## # A tibble: 8 × 5
##   gender smoking hypertension    no   yes
##   <fct>  <fct>   <fct>        <int> <int>
## 1 female no      no              63    51
## 2 male   no      no              73    67
## 3 female yes     no              62    72
## 4 male   yes     no              60    52
## 5 female no      yes             72    66
## 6 male   no      yes             52    52
## 7 female yes     yes             63    57
## 8 male   yes     yes             79    59
heart.lr2 <- glm (cbind (yes,no) ~ gender + smoking + hypertension, data = heart.wide, family = binomial)
heart.lr2
## 
## Call:  glm(formula = cbind(yes, no) ~ gender + smoking + hypertension, 
##     family = binomial, data = heart.wide)
## 
## Coefficients:
##     (Intercept)       gendermale       smokingyes  hypertensionyes  
##       -0.023477        -0.083875         0.003454        -0.065942  
## 
## Degrees of Freedom: 7 Total (i.e. Null);  4 Residual
## Null Deviance:       4.007 
## Residual Deviance: 3.311     AIC: 53.49
heart.wide %<>% mutate (p = yes/(yes+no))
heart.lr3 <- glm (p ~ gender + smoking + hypertension, weights = yes+no, data = heart.wide, family = binomial)
heart.lr3
## 
## Call:  glm(formula = p ~ gender + smoking + hypertension, family = binomial, 
##     data = heart.wide, weights = yes + no)
## 
## Coefficients:
##     (Intercept)       gendermale       smokingyes  hypertensionyes  
##       -0.023477        -0.083875         0.003454        -0.065942  
## 
## Degrees of Freedom: 7 Total (i.e. Null);  4 Residual
## Null Deviance:       4.007 
## Residual Deviance: 3.311     AIC: 53.49

All explanatory variables are discrete in the above example, but in fact, they can be both continuous and discrete, as in the following example.

# Let us consider a dataset containing information about candidates for second-cycle studies. 
# It contains the following columns:
#
# admit - binary variable telling if a student was admitted or no,
# gre - graduate record exam - result of the final examination at the end of first-cycle studies,
# gpa - grade point average - average grade from first-cycle studies,
# rank - a qualitative variable with values from 1 to 4, determines the prestige of the completed school, 
#        the lower the better

studies <- read.csv("studia.csv")
studies %<>%
  mutate (admit = factor (admit),
          rank = factor (rank))
summary (studies)
##  admit        gre             gpa        rank   
##  0:273   Min.   :220.0   Min.   :2.260   1: 61  
##  1:127   1st Qu.:520.0   1st Qu.:3.130   2:151  
##          Median :580.0   Median :3.395   3:121  
##          Mean   :587.7   Mean   :3.390   4: 67  
##          3rd Qu.:660.0   3rd Qu.:3.670          
##          Max.   :800.0   Max.   :4.000
studies.lr <- glm (admit ~ gre+gpa+rank, data=studies, family=binomial)
summary (studies.lr)
## 
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = binomial, data = studies)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.989979   1.139951  -3.500 0.000465 ***
## gre          0.002264   0.001094   2.070 0.038465 *  
## gpa          0.804038   0.331819   2.423 0.015388 *  
## rank2       -0.675443   0.316490  -2.134 0.032829 *  
## rank3       -1.340204   0.345306  -3.881 0.000104 ***
## rank4       -1.551464   0.417832  -3.713 0.000205 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 458.52  on 394  degrees of freedom
## AIC: 470.52
## 
## Number of Fisher Scoring iterations: 4

According to the summary shown above, the logistic regression model can be written as:

\(\ln(\frac{p}{1-p}) = -3.990 + 0.002 x_{gre} + 0.804 x_{gpa} - 0.675 x_{rank2} - 1.340 x_{rank3} - 1.551 x_{rank4}\),

where

What may be puzzling is the lack of a coefficient for the \(x_{rank1}\) variable in the above formula. It is not explicit because it complements of the other coefficients and is included in the intercept \(\alpha = -3,990\). It is a reference level for all other variants of the rank variable, e.g. if the coefficient at \(x_{rank2}\) is \(-0.675\), it means that the odds of success of a person from a university with a ranking equal to \(2\) to the odds of a person with a university with a ranking of \(1\) is \(\exp(-0.675) = 0.51\). You can also say the opposite, that the odds of success for a person from a university with a ranking of \(1\) is \(1/\exp(-0.675) = 1.96\) times greater than that of a person from a university with a ranking of \(2\). The exact value of the logarithm of the odds of success for a person from a university with a ranking of \(1\) would be clearly visible if the only explanatory variable in the model was rank. It would then be equal to the intercept \(\alpha\), as shown in the example of the UCBAdmissions dataset later in this manual.

The coefficient for the variable \(x_{gpa}\) is \(0.804\), which means that increasing \(x_{gpa}\) by one increases the logarithm of the odds ratio by \(0.804\), i.e. increases the odds of getting into college by \(2.23\) times (\(\exp (0.804) = 2.23\)).

Using the predict() function will provide the probability of success for individual candidates, and this can be obtained in two ways.

# Method 1
studies$logit <- predict (studies.lr)
studies$p <- 1/(1+exp(-studies$logit))

# Method 2
studies$p2 <- predict (studies.lr, type = "response")

head (studies)
##   admit gre  gpa rank      logit         p        p2
## 1     0 380 3.61    3 -1.5671256 0.1726265 0.1726265
## 2     1 660 3.67    3 -0.8848442 0.2921750 0.2921750
## 3     1 800 4.00    1  1.0377118 0.7384082 0.7384082
## 4     1 640 3.19    4 -1.5273305 0.1783846 0.1783846
## 5     0 520 2.93    4 -2.0081113 0.1183539 0.1183539
## 6     1 760 3.00    2 -0.5323458 0.3699699 0.3699699

Knowing the probabilities, we can easily calculate the odds and verify whether our interpretation of the coefficient for the \(x_{gpa}\) variable is correct. The most common final test score is \(620\). Of those with this score, \(13\) graduated from a university with a ranking of \(2\).

studies %>%
  filter (gre==620, rank==2) %>%
  mutate (gpa.diff = gpa - min (gpa),
          exp.beta = exp (studies.lr$coefficients["gpa"]*gpa.diff)) -> studies2
studies2
##    admit gre  gpa rank       logit         p        p2 gpa.diff exp.beta
## 1      1 620 3.18    2 -0.70463861 0.3307846 0.3307846     0.33 1.303864
## 2      0 620 3.07    2 -0.79308274 0.3115071 0.3115071     0.22 1.193498
## 3      1 620 3.17    2 -0.71267898 0.3290072 0.3290072     0.32 1.293423
## 4      0 620 3.40    2 -0.52775035 0.3710417 0.3710417     0.55 1.556159
## 5      1 620 3.37    2 -0.55187147 0.3654303 0.3654303     0.52 1.519072
## 6      0 620 3.58    2 -0.38302359 0.4053979 0.4053979     0.73 1.798490
## 7      0 620 3.05    2 -0.80916349 0.3080688 0.3080688     0.20 1.174459
## 8      0 620 3.22    2 -0.67247711 0.3379424 0.3379424     0.37 1.346480
## 9      1 620 3.45    2 -0.48754847 0.3804713 0.3804713     0.60 1.619994
## 10     0 620 2.85    2 -0.96997100 0.2748863 0.2748863     0.00 1.000000
## 11     0 620 3.63    2 -0.34282171 0.4151242 0.4151242     0.78 1.872266
## 12     1 620 3.75    2 -0.24633720 0.4387252 0.4387252     0.90 2.061912
## 13     0 620 4.00    2 -0.04532782 0.4886700 0.4886700     1.15 2.520969
odds <- studies2$p/(1-studies2$p) # Compute the odds of students
odds/odds[which(studies2$gpa.diff==0)] # Compute the odds ratio
##  [1] 1.303864 1.193498 1.293423 1.556159 1.519072 1.798490 1.174459 1.346480
##  [9] 1.619994 1.000000 1.872266 2.061912 2.520969

Using the predict() function, we can check the probabilities of new candidates for studies. To do this, we create a new data frame.

new.students <- data.frame (gre = mean(studies$gre), gpa = mean(studies$gpa), rank = factor(1:4))
new.students$p <- predict (studies.lr, newdata = new.students, type = "response")
new.students
##     gre    gpa rank         p
## 1 587.7 3.3899    1 0.5166016
## 2 587.7 3.3899    2 0.3522846
## 3 587.7 3.3899    3 0.2186120
## 4 587.7 3.3899    4 0.1846684


For explanatory variables on a dichotomous scale (e.g., gender), \(e^\beta\) is an estimate of the odds of, say, men compared to women.

# We have the following data frame containing information about the results of a speed check on a road 
# where the speed limit was 90 km/h.
speed.check <- data.frame (gender = rep ( c ("M","F"), c (60,40)),
                           speed = c (runif (60, 60, 120), runif (40, 50, 100)))

# Creates the logical vector with the information whether the speed limit was exceeded
speed.check$exceed <- speed.check$speed>90
head(speed.check)
##   gender     speed exceed
## 1      M  96.96644   TRUE
## 2      M  74.65544  FALSE
## 3      M 103.78498   TRUE
## 4      M  91.42864   TRUE
## 5      M 104.71827   TRUE
## 6      M  70.65747  FALSE
# Contingency table
tab <- table (speed.check$gender, speed.check$exceed)
tab
##    
##     FALSE TRUE
##   F    30   10
##   M    30   30
tab/sum(tab)
##    
##     FALSE TRUE
##   F   0.3  0.1
##   M   0.3  0.3
# Chi^2 test
chisq.test (tab)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tab
## X-squared = 5.2517, df = 1, p-value = 0.02192
# Logistic regression
speed.check.lr <- glm (exceed ~ gender, data=speed.check, family=binomial)
summary(speed.check.lr)
## 
## Call:
## glm(formula = exceed ~ gender, family = binomial, data = speed.check)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  -1.0986     0.3651  -3.009  0.00262 **
## genderM       1.0986     0.4472   2.457  0.01403 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 134.60  on 99  degrees of freedom
## Residual deviance: 128.16  on 98  degrees of freedom
## AIC: 132.16
## 
## Number of Fisher Scoring iterations: 4
# Calculates the probabilities and odds that a woman will exceed the speed limit
p.f <- tab["F","TRUE"]/sum(tab["F",])
p.f
## [1] 0.25
o.f <- p.f/(1-p.f)
o.f
## [1] 0.3333333
# Calculates the probabilities and odds that a man will exceed the speed limit
p.m <- tab["M","TRUE"]/sum(tab["M",])
p.m
## [1] 0.5
o.m <- p.m/(1-p.m)
o.m
## [1] 1
# Odds ratio
o.m/o.f
## [1] 3
# Exponent of logistic regression coefficients
exp (speed.check.lr$coefficients)
## (Intercept)     genderM 
##   0.3333333   3.0000000


The UCBAdmissions collection from the datasets package contains information about admissions to the University of California, Berkeley. We will check how the probability of admission depends on the faculty the candidate is trying to get into.

UCBAdmissions
## , , Dept = A
## 
##           Gender
## Admit      Male Female
##   Admitted  512     89
##   Rejected  313     19
## 
## , , Dept = B
## 
##           Gender
## Admit      Male Female
##   Admitted  353     17
##   Rejected  207      8
## 
## , , Dept = C
## 
##           Gender
## Admit      Male Female
##   Admitted  120    202
##   Rejected  205    391
## 
## , , Dept = D
## 
##           Gender
## Admit      Male Female
##   Admitted  138    131
##   Rejected  279    244
## 
## , , Dept = E
## 
##           Gender
## Admit      Male Female
##   Admitted   53     94
##   Rejected  138    299
## 
## , , Dept = F
## 
##           Gender
## Admit      Male Female
##   Admitted   22     24
##   Rejected  351    317
# Transform array into data frame
admissions <- as.data.frame (UCBAdmissions)
head(admissions)
##      Admit Gender Dept Freq
## 1 Admitted   Male    A  512
## 2 Rejected   Male    A  313
## 3 Admitted Female    A   89
## 4 Rejected Female    A   19
## 5 Admitted   Male    B  353
## 6 Rejected   Male    B  207
# Transform table from long to wide
admissions %<>% pivot_wider (names_from = "Admit", values_from = "Freq")

# Ignore the Gender variable
admissions %<>% group_by (Dept) %>% summarise (admitted = sum (Admitted), rejected = sum (Rejected))

# Compute probabilities and odds
admissions %<>% mutate (p = admitted/(admitted+rejected)) %>% mutate (o = p/(1-p))
admissions
## # A tibble: 6 × 5
##   Dept  admitted rejected      p      o
##   <fct>    <dbl>    <dbl>  <dbl>  <dbl>
## 1 A          601      332 0.644  1.81  
## 2 B          370      215 0.632  1.72  
## 3 C          322      596 0.351  0.540 
## 4 D          269      523 0.340  0.514 
## 5 E          147      437 0.252  0.336 
## 6 F           46      668 0.0644 0.0689
# Perform logistic regression
admissions.lr <- glm (cbind (admitted,rejected) ~ Dept, data = admissions, family = binomial)
summary (admissions.lr)
## 
## Call:
## glm(formula = cbind(admitted, rejected) ~ Dept, family = binomial, 
##     data = admissions)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.59346    0.06838   8.679   <2e-16 ***
## DeptB       -0.05059    0.10968  -0.461    0.645    
## DeptC       -1.20915    0.09726 -12.432   <2e-16 ***
## DeptD       -1.25833    0.10152 -12.395   <2e-16 ***
## DeptE       -1.68296    0.11733 -14.343   <2e-16 ***
## DeptF       -3.26911    0.16707 -19.567   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8.5532e+02  on 5  degrees of freedom
## Residual deviance: 3.3595e-13  on 0  degrees of freedom
## AIC: 52.298
## 
## Number of Fisher Scoring iterations: 3
exp (admissions.lr$coefficients)
## (Intercept)       DeptB       DeptC       DeptD       DeptE       DeptF 
##  1.81024096  0.95066362  0.29845113  0.28412811  0.18582302  0.03804039
admissions$o/admissions$o[1]
## [1] 1.00000000 0.95066362 0.29845113 0.28412811 0.18582302 0.03804039


Let us consider the Titanic set from the datasets library. This collection includes a 4-dimensional table containing information about 2,201 passengers of the ill-fated cruise.

Titanic
## , , Age = Child, Survived = No
## 
##       Sex
## Class  Male Female
##   1st     0      0
##   2nd     0      0
##   3rd    35     17
##   Crew    0      0
## 
## , , Age = Adult, Survived = No
## 
##       Sex
## Class  Male Female
##   1st   118      4
##   2nd   154     13
##   3rd   387     89
##   Crew  670      3
## 
## , , Age = Child, Survived = Yes
## 
##       Sex
## Class  Male Female
##   1st     5      1
##   2nd    11     13
##   3rd    13     14
##   Crew    0      0
## 
## , , Age = Adult, Survived = Yes
## 
##       Sex
## Class  Male Female
##   1st    57    140
##   2nd    14     80
##   3rd    75     76
##   Crew  192     20
# Transform array into data frame
titanic <- as.data.frame(Titanic)
head(titanic)
##   Class    Sex   Age Survived Freq
## 1   1st   Male Child       No    0
## 2   2nd   Male Child       No    0
## 3   3rd   Male Child       No   35
## 4  Crew   Male Child       No    0
## 5   1st Female Child       No    0
## 6   2nd Female Child       No    0
# Transform table from long to wide
titanic %<>% pivot_wider(names_from = "Survived", values_from = "Freq")
names(titanic)[4:5] <- c("Died","Survived")
head(titanic)
## # A tibble: 6 × 5
##   Class Sex    Age    Died Survived
##   <fct> <fct>  <fct> <dbl>    <dbl>
## 1 1st   Male   Child     0        5
## 2 2nd   Male   Child     0       11
## 3 3rd   Male   Child    35       13
## 4 Crew  Male   Child     0        0
## 5 1st   Female Child     0        1
## 6 2nd   Female Child     0       13
# Perform logistic regression
titanic.lr <- glm (cbind (Survived,Died) ~ Class+Sex+Age, data = titanic, family = binomial)
summary(titanic.lr)
## 
## Call:
## glm(formula = cbind(Survived, Died) ~ Class + Sex + Age, family = binomial, 
##     data = titanic)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.6853     0.2730   2.510   0.0121 *  
## Class2nd     -1.0181     0.1960  -5.194 2.05e-07 ***
## Class3rd     -1.7778     0.1716 -10.362  < 2e-16 ***
## ClassCrew    -0.8577     0.1573  -5.451 5.00e-08 ***
## SexFemale     2.4201     0.1404  17.236  < 2e-16 ***
## AgeAdult     -1.0615     0.2440  -4.350 1.36e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 671.96  on 13  degrees of freedom
## Residual deviance: 112.57  on  8  degrees of freedom
## AIC: 171.19
## 
## Number of Fisher Scoring iterations: 5

The analysis shows that all variables are statistically significant. The results indicate that women on the ship were 11.25 times \((exp(2.42)=11.25)\) more likely to survive than men, and third-class passengers were 5.92 times \((1/exp(-1.78)=5.92)\) less likely to survive compared to first class passengers. Moreover, in the logistic regression model there is a multiplier effect, which means that men from the third class had \(11.25 \cdot 5.92 = 66.6\) times less odds of survival than women from the first class.


Poisson regression

If the result variable \(Y\) is the frequency of an event (rate) or simply the number of counts in some period, we can use Poisson regression to find its dependence on the explanatory variables \(x_i\). The name of this regression comes from the assumption that \(Y\) comes from the Poisson distribution \(P(Y=y) = \lambda^y e^{-\lambda}/y!\). An example is the frequency of epileptic seizures depending on various factors (experiencing certain diseases, diet, taking certain medications). The Poisson regression model takes a similar form to the logistic regression model:

\(\ln(y) = \alpha + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n\),

where \(y\) is the expected frequency or number of occurrences for individuals with a given set of values \(x_1, \dots, x_n\), and \(\alpha, \beta_1, \dots, \beta_n\) are the regression coefficients. The exponents of individual coefficients are the estimated relative frequencies associated with the relevant variables. For only one explanatory variable we have:

\(y = e^\alpha \cdot e^{\beta x_1}\)

In the R package, Poisson regression can be performed using the glm (Generalized Linear Models) function with the family = poisson parameter. As an example, let us consider the warpbreaks dataset, which contains information on the number of warp breaks per loom depending on the wool type and tension.

warp <- warpbreaks
warp
##    breaks wool tension
## 1      26    A       L
## 2      30    A       L
## 3      54    A       L
## 4      25    A       L
## 5      70    A       L
## 6      52    A       L
## 7      51    A       L
## 8      26    A       L
## 9      67    A       L
## 10     18    A       M
## 11     21    A       M
## 12     29    A       M
## 13     17    A       M
## 14     12    A       M
## 15     18    A       M
## 16     35    A       M
## 17     30    A       M
## 18     36    A       M
## 19     36    A       H
## 20     21    A       H
## 21     24    A       H
## 22     18    A       H
## 23     10    A       H
## 24     43    A       H
## 25     28    A       H
## 26     15    A       H
## 27     26    A       H
## 28     27    B       L
## 29     14    B       L
## 30     29    B       L
## 31     19    B       L
## 32     29    B       L
## 33     31    B       L
## 34     41    B       L
## 35     20    B       L
## 36     44    B       L
## 37     42    B       M
## 38     26    B       M
## 39     19    B       M
## 40     16    B       M
## 41     39    B       M
## 42     28    B       M
## 43     21    B       M
## 44     39    B       M
## 45     29    B       M
## 46     20    B       H
## 47     21    B       H
## 48     24    B       H
## 49     17    B       H
## 50     13    B       H
## 51     15    B       H
## 52     15    B       H
## 53     16    B       H
## 54     28    B       H
g <- ggplot(warp)
g + 
  geom_histogram (aes (x = breaks, y = after_stat(density)), breaks = seq (0,70,10), alpha = 0.5) +
  geom_density (aes (x=breaks), color = "red", linewidth = 1.5)

# Poisson regression
warp.pr <- glm (breaks ~ wool+tension, data = warp, family = poisson)
summary (warp.pr)
## 
## Call:
## glm(formula = breaks ~ wool + tension, family = poisson, data = warp)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.69196    0.04541  81.302  < 2e-16 ***
## woolB       -0.20599    0.05157  -3.994 6.49e-05 ***
## tensionM    -0.32132    0.06027  -5.332 9.73e-08 ***
## tensionH    -0.51849    0.06396  -8.107 5.21e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 297.37  on 53  degrees of freedom
## Residual deviance: 210.39  on 50  degrees of freedom
## AIC: 493.06
## 
## Number of Fisher Scoring iterations: 4

The above results indicate that wool type B causes relatively 0.814 (exp(-0.206)=0.814) times less warp damage than wool type A. In other words, changing wool from A to B will reduce the number of damages by 18.6%. Let’s check the predictive effectiveness of the model we built.

warp$predict <- predict (warp.pr, type="response")
warp %>%
  group_by (wool, tension) %>%
  summarise (breaks.mean = mean (breaks),
             breaks.sderr = sd (breaks)/sqrt(n()),
             predict = mean (predict)) -> warp.summary
## `summarise()` has grouped output by 'wool'. You can override using the
## `.groups` argument.
warp.summary
## # A tibble: 6 × 5
## # Groups:   wool [2]
##   wool  tension breaks.mean breaks.sderr predict
##   <fct> <fct>         <dbl>        <dbl>   <dbl>
## 1 A     L              44.6         6.03    40.1
## 2 A     M              24           2.89    29.1
## 3 A     H              24.6         3.42    23.9
## 4 B     L              28.2         3.29    32.7
## 5 B     M              28.8         3.14    23.7
## 6 B     H              18.8         1.63    19.4


The second example of using Poisson regression is modeling the number of plant species on the Galapagos Islands depending on various factors. The gala collection is part of the faraway package, which must be installed and included before loading data.

library(faraway)
data (gala)
head (gala)
##              Species Endemics  Area Elevation Nearest Scruz Adjacent
## Baltra            58       23 25.09       346     0.6   0.6     1.84
## Bartolome         31       21  1.24       109     0.6  26.3   572.33
## Caldwell           3        3  0.21       114     2.8  58.7     0.78
## Champion          25        9  0.10        46     1.9  47.4     0.18
## Coamano            2        1  0.05        77     1.9   1.9   903.82
## Daphne.Major      18       11  0.34       119     8.0   8.0     1.84
summary (gala)
##     Species          Endemics          Area             Elevation      
##  Min.   :  2.00   Min.   : 0.00   Min.   :   0.0100   Min.   :  25.00  
##  1st Qu.: 13.00   1st Qu.: 7.25   1st Qu.:   0.2575   1st Qu.:  97.75  
##  Median : 42.00   Median :18.00   Median :   2.5900   Median : 192.00  
##  Mean   : 85.23   Mean   :26.10   Mean   : 261.7087   Mean   : 368.03  
##  3rd Qu.: 96.00   3rd Qu.:32.25   3rd Qu.:  59.2375   3rd Qu.: 435.25  
##  Max.   :444.00   Max.   :95.00   Max.   :4669.3200   Max.   :1707.00  
##     Nearest          Scruz           Adjacent      
##  Min.   : 0.20   Min.   :  0.00   Min.   :   0.03  
##  1st Qu.: 0.80   1st Qu.: 11.03   1st Qu.:   0.52  
##  Median : 3.05   Median : 46.65   Median :   2.59  
##  Mean   :10.06   Mean   : 56.98   Mean   : 261.10  
##  3rd Qu.:10.03   3rd Qu.: 81.08   3rd Qu.:  59.24  
##  Max.   :47.40   Max.   :290.20   Max.   :4669.32
g <- ggplot(gala)

g + geom_histogram (aes (x = Species, y = after_stat (density)), breaks = seq (0,450,50), alpha=0.5) +
  geom_density (aes (x = Species), color = "red", linewidth = 1.5)

gala.pr <- glm (Species ~ ., data = gala, family = poisson)
summary (gala.pr)
## 
## Call:
## glm(formula = Species ~ ., family = poisson, data = gala)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.828e+00  5.958e-02  47.471  < 2e-16 ***
## Endemics     3.388e-02  1.741e-03  19.459  < 2e-16 ***
## Area        -1.067e-04  3.741e-05  -2.853  0.00433 ** 
## Elevation    2.638e-04  1.934e-04   1.364  0.17264    
## Nearest      1.048e-02  1.611e-03   6.502 7.91e-11 ***
## Scruz       -6.835e-04  5.802e-04  -1.178  0.23877    
## Adjacent     4.539e-05  4.800e-05   0.946  0.34437    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 3510.73  on 29  degrees of freedom
## Residual deviance:  313.36  on 23  degrees of freedom
## AIC: 488.19
## 
## Number of Fisher Scoring iterations: 5

The results obtained using Poisson regression suggest that the height of the highest elevation on the island (Elevation), the distance from Santa Cruz Island (Scruz) and the area of the neighboring islands (Adjacent) do not influence the number of species on an island. We can build the model again, this time without these variables.

gala.pr2 <- glm (Species ~ Endemics+Area+Nearest, data = gala, family = poisson)
summary (gala.pr2)
## 
## Call:
## glm(formula = Species ~ Endemics + Area + Nearest, family = poisson, 
##     data = gala)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.868e+00  4.883e-02  58.729  < 2e-16 ***
## Endemics     3.551e-02  7.327e-04  48.462  < 2e-16 ***
## Area        -4.542e-05  1.568e-05  -2.896  0.00378 ** 
## Nearest      9.289e-03  1.319e-03   7.043 1.88e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 3510.73  on 29  degrees of freedom
## Residual deviance:  330.84  on 26  degrees of freedom
## AIC: 499.67
## 
## Number of Fisher Scoring iterations: 5
gala$predict <- predict (gala.pr2, type = "response")
gala %<>%
  mutate (rel.err = (predict-Species)/Species)
gala
##              Species Endemics    Area Elevation Nearest Scruz Adjacent
## Baltra            58       23   25.09       346     0.6   0.6     1.84
## Bartolome         31       21    1.24       109     0.6  26.3   572.33
## Caldwell           3        3    0.21       114     2.8  58.7     0.78
## Champion          25        9    0.10        46     1.9  47.4     0.18
## Coamano            2        1    0.05        77     1.9   1.9   903.82
## Daphne.Major      18       11    0.34       119     8.0   8.0     1.84
## Daphne.Minor      24        0    0.08        93     6.0  12.0     0.34
## Darwin            10        7    2.33       168    34.1 290.2     2.85
## Eden               8        4    0.03        71     0.4   0.4    17.95
## Enderby            2        2    0.18       112     2.6  50.2     0.10
## Espanola          97       26   58.27       198     1.1  88.3     0.57
## Fernandina        93       35  634.49      1494     4.3  95.3  4669.32
## Gardner1          58       17    0.57        49     1.1  93.1    58.27
## Gardner2           5        4    0.78       227     4.6  62.2     0.21
## Genovesa          40       19   17.35        76    47.4  92.2   129.49
## Isabela          347       89 4669.32      1707     0.7  28.1   634.49
## Marchena          51       23  129.49       343    29.1  85.9    59.56
## Onslow             2        2    0.01        25     3.3  45.9     0.10
## Pinta            104       37   59.56       777    29.1 119.6   129.49
## Pinzon           108       33   17.95       458    10.7  10.7     0.03
## Las.Plazas        12        9    0.23        94     0.5   0.6    25.09
## Rabida            70       30    4.89       367     4.4  24.4   572.33
## SanCristobal     280       65  551.62       716    45.2  66.6     0.57
## SanSalvador      237       81  572.33       906     0.2  19.8     4.89
## SantaCruz        444       95  903.82       864     0.6   0.0     0.52
## SantaFe           62       28   24.08       259    16.5  16.5     0.52
## SantaMaria       285       73  170.92       640     2.6  49.2     0.10
## Seymour           44       16    1.84       147     0.6   9.6    25.09
## Tortuga           16        8    1.24       186     6.8  50.9    17.95
## Wolf              21       12    2.85       253    34.1 254.7     2.33
##                predict     rel.err
## Baltra        40.00437 -0.31026951
## Bartolome     37.30235  0.20330150
## Caldwell      20.09352  5.69783856
## Champion      24.65770 -0.01369189
## Coamano       18.56038  8.28018851
## Daphne.Major  28.01541  0.55641144
## Daphne.Minor  18.60821 -0.22465772
## Darwin        30.97144  2.09714399
## Eden          20.36098  1.54512296
## Enderby       19.35659  8.67829549
## Espanola      44.64085 -0.53978500
## Fernandina    61.66847 -0.33689815
## Gardner1      32.51494 -0.43939759
## Gardner2      21.17029  3.23405714
## Genovesa      53.62501  0.34062531
## Isabela      337.81672 -0.02646479
## Marchena      51.88202  0.01729450
## Onslow        19.48301  8.74150423
## Pinta         85.56342 -0.17727482
## Pinzon        62.69026 -0.41953465
## Las.Plazas    24.33899  1.02824875
## Rabida        53.18400 -0.24022861
## SanCristobal 262.61134 -0.06210236
## SanSalvador  304.86604  0.28635460
## SantaCruz    495.53694  0.11607418
## SantaFe       55.38243 -0.10673495
## SantaMaria   238.96896 -0.16151242
## Seymour       31.23346 -0.29014869
## Tortuga       24.90439  0.55652408
## Wolf          36.98754  0.76131121