DATA ANALYSIS AND VISUALIZATION IN R, WINTER 2025 EDITION



Pearson and Spearman correlations

A popular measure of the relationship between two variables is the Pearson correlation coefficient, defined (population-wise) for two random variables \(X\) and \(Y\) as

\(r_{XY} = \frac{\mathrm{cov}(X,Y)}{\sigma_X \sigma_Y}\),

while in the case of the sample estimator, for two \(n\)-element data series \(x_i\) and \(y_i\), as

\(r = \frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n(x_i - \bar{x})^2}\sqrt{\sum_{i=1}^n(y_i - \bar{y})^2}}\).

The coefficient can be easily calculated in R using the cor() function.

lb2kg <- 0.4536
gallons2liter <- 3.78541178
miles2km <- 1.609344

mass <- 1000*lb2kg*mtcars$wt # mass in kilograms
fuel.consum <- 100*gallons2liter/(mtcars$mpg*miles2km) # fuel consumption in L/100 km
plot (mass, fuel.consum)

cor (mass, fuel.consum)
## [1] 0.8898927

If we are dealing with a non-linear (but still monotonic) relationship, we use the Spearman coefficient cor(method="spearman"), which is in fact the Pearson coefficient for data transformed into ranks.

x <- 2:100
cor (x, log (log (x)))
## [1] 0.8042407
cor (x, log (log (x)), method = "spearman")
## [1] 1
cor (mass, fuel.consum, method="spearman")
## [1] 0.886422
mass
##  [1] 1188.4320 1304.1000 1052.3520 1458.3240 1560.3840 1569.4560 1619.3520
##  [8] 1446.9840 1428.8400 1560.3840 1560.3840 1846.1520 1691.9280 1714.6080
## [15] 2381.4000 2460.3264 2424.4920  997.9200  732.5640  832.3560 1118.1240
## [22] 1596.6720 1558.1160 1741.8240 1744.0920  877.7160  970.7040  686.2968
## [29] 1437.9120 1256.4720 1619.3520 1261.0080
rank (mass)
##  [1]  9.0 12.0  7.0 16.0 19.0 21.0 23.5 15.0 13.0 19.0 19.0 29.0 25.0 26.0 30.0
## [16] 32.0 31.0  6.0  2.0  3.0  8.0 22.0 17.0 27.0 28.0  4.0  5.0  1.0 14.0 10.0
## [31] 23.5 11.0
fuel.consum
##  [1] 11.200694 11.200694 10.316429 10.991336 12.578320 12.995281 16.448572
##  [8]  9.639942 10.316429 12.250760 13.214302 14.342353 13.596219 15.474644
## [15] 22.616787 22.616787 16.000992  7.259709  7.737322  6.938483 10.940213
## [22] 15.175134 15.474644 17.685307 12.250760  8.615919  9.046715  7.737322
## [29] 14.886999 11.939827 15.680972 10.991336
rank (fuel.consum)
##  [1] 13.5 13.5  8.5 11.5 18.0 19.0 29.0  7.0  8.5 16.5 20.0 22.0 21.0 25.5 31.5
## [16] 31.5 28.0  2.0  3.5  1.0 10.0 24.0 25.5 30.0 16.5  5.0  6.0  3.5 23.0 15.0
## [31] 27.0 11.5
cor (rank (mass), rank (fuel.consum))
## [1] 0.886422

If we pass a data frame to the cor() function, the correlation coefficients between each pair of variables will be calculated and presented in the form of a matrix.

cor (mtcars)
##             mpg        cyl       disp         hp        drat         wt
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684  0.68117191 -0.8676594
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475 -0.69993811  0.7824958
## disp -0.8475514  0.9020329  1.0000000  0.7909486 -0.71021393  0.8879799
## hp   -0.7761684  0.8324475  0.7909486  1.0000000 -0.44875912  0.6587479
## drat  0.6811719 -0.6999381 -0.7102139 -0.4487591  1.00000000 -0.7124406
## wt   -0.8676594  0.7824958  0.8879799  0.6587479 -0.71244065  1.0000000
## qsec  0.4186840 -0.5912421 -0.4336979 -0.7082234  0.09120476 -0.1747159
## vs    0.6640389 -0.8108118 -0.7104159 -0.7230967  0.44027846 -0.5549157
## am    0.5998324 -0.5226070 -0.5912270 -0.2432043  0.71271113 -0.6924953
## gear  0.4802848 -0.4926866 -0.5555692 -0.1257043  0.69961013 -0.5832870
## carb -0.5509251  0.5269883  0.3949769  0.7498125 -0.09078980  0.4276059
##             qsec         vs          am       gear        carb
## mpg   0.41868403  0.6640389  0.59983243  0.4802848 -0.55092507
## cyl  -0.59124207 -0.8108118 -0.52260705 -0.4926866  0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692  0.39497686
## hp   -0.70822339 -0.7230967 -0.24320426 -0.1257043  0.74981247
## drat  0.09120476  0.4402785  0.71271113  0.6996101 -0.09078980
## wt   -0.17471588 -0.5549157 -0.69249526 -0.5832870  0.42760594
## qsec  1.00000000  0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs    0.74453544  1.0000000  0.16834512  0.2060233 -0.56960714
## am   -0.22986086  0.1683451  1.00000000  0.7940588  0.05753435
## gear -0.21268223  0.2060233  0.79405876  1.0000000  0.27407284
## carb -0.65624923 -0.5696071  0.05753435  0.2740728  1.00000000

It is worth mentioning the interesting corrplot package, which provides the corrplot() function for visualizing correlation matrices.

library (corrplot)
## corrplot 0.95 loaded
corrplot (cor (mtcars))

To perform a proper statistical test of the Pearson correlation, use the cor.test() function.

cor.test (mass, fuel.consum)
## 
##  Pearson's product-moment correlation
## 
## data:  mass and fuel.consum
## t = 10.685, df = 30, p-value = 9.566e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7846874 0.9452694
## sample estimates:
##       cor 
## 0.8898927

In the above case, the p-value of the test is very small, so we can reject the null hypothesis that the correlation between the variables is zero. A 95% confidence level does not imply a 95% probability that the true parameter lies within a particular calculated interval. The confidence level instead reflects the long-run reliability of the method used to generate the interval. In other words, if the same sampling procedure were repeated 100 times from the same distributions, approximately 95 of the resulting intervals would be expected to contain the true value of the Pearson correlation.

It is important to remember that the Pearson correlation measures a linear relationship between two variables and is insensitive to non-monotonic relationships.

x <- seq(0,6,0.5)
y <- x^2-6*x+8
plot (x, y)

cor.test (x,y)
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = 0, df = 11, p-value = 1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5509853  0.5509853
## sample estimates:
## cor 
##   0


Pearson’s \(\chi^2\) test for the contingency matrix

If we are dealing with two qualitative variables and we want to check whether they occur independently, we can perform the \(\chi^2\) Pearson test using the previously known chisq.test() function.

gearbox <- factor (mtcars$am, labels = c("automatic","manual"))
cylinders <- factor (mtcars$cyl)
tab <- table (gearbox, cylinders)
tab
##            cylinders
## gearbox      4  6  8
##   automatic  3  4 12
##   manual     8  3  2
chisq.test (tab)
## Warning in chisq.test(tab): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  tab
## X-squared = 8.7407, df = 2, p-value = 0.01265
chisq.test (gearbox, cylinders)
## Warning in chisq.test(gearbox, cylinders): Chi-squared approximation may be
## incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  gearbox and cylinders
## X-squared = 8.7407, df = 2, p-value = 0.01265

A low p-value indicates that the null hypothesis of independence of variables should be rejected. We can check what the matrix of expected values (and probabilities) would look like in the case of independent variables and check whether it is equal to the products of marginal distributions.

tab.exp <- chisq.test (tab)$expected
## Warning in chisq.test(tab): Chi-squared approximation may be incorrect
tab.exp
##            cylinders
## gearbox           4       6      8
##   automatic 6.53125 4.15625 8.3125
##   manual    4.46875 2.84375 5.6875
tab.exp / sum( tab.exp)
##            cylinders
## gearbox             4          6         8
##   automatic 0.2041016 0.12988281 0.2597656
##   manual    0.1396484 0.08886719 0.1777344

The table above shows what the probabilities of a given pair should be if the variables were independent, i.e. they occurred randomly. These values can also be determined from our original table by calculating the marginal distributions.

gearbox.marg <- rowSums (tab) / sum (tab)
# summary (gearbox)/length (gearbox)
gearbox.marg
## automatic    manual 
##   0.59375   0.40625
cylinders.marg <- colSums (tab) / sum (tab)
# summary (cylinders)/length (cylinders)
cylinders.marg
##       4       6       8 
## 0.34375 0.21875 0.43750
gearbox.marg %o% cylinders.marg
##                   4          6         8
## automatic 0.2041016 0.12988281 0.2597656
## manual    0.1396484 0.08886719 0.1777344

However, the actual probability values, obtained by simply dividing the individual counts by the total number of cases, are completely different, confirming the test result.

tab / sum(tab)
##            cylinders
## gearbox           4       6       8
##   automatic 0.09375 0.12500 0.37500
##   manual    0.25000 0.09375 0.06250


Time-Series objects

Any set of measurements taken at regular intervals can be considered as a time series. In the R package, there is a special object of type ts for time series storage and operations. The most important arguments of the ts (data, frequency, start, ...) constructor are:

# Number of births in New York from January 1946 to December 1959
nybirths <- read.table ("nybirths.dat")
nybirths.ts <- ts (nybirths, frequency = 12, start = c(1946,1))
nybirths.ts
##        Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec
## 1946 26663 23598 26931 24740 25806 24364 24477 23901 23175 23227 21672 21870
## 1947 21439 21089 23709 21669 21752 20761 23479 23824 23105 23110 21759 22073
## 1948 21937 20035 23590 21672 22222 22123 23950 23504 22238 23142 21059 21573
## 1949 21548 20000 22424 20615 21761 22874 24104 23748 23262 22907 21519 22025
## 1950 22604 20894 24677 23673 25320 23583 24671 24454 24122 24252 22084 22991
## 1951 23287 23049 25076 24037 24430 24667 26451 25618 25014 25110 22964 23981
## 1952 23798 22270 24775 22646 23988 24737 26276 25816 25210 25199 23162 24707
## 1953 24364 22644 25565 24062 25431 24635 27009 26606 26268 26462 25246 25180
## 1954 24657 23304 26982 26199 27210 26122 26706 26878 26152 26379 24712 25688
## 1955 24990 24239 26721 23475 24767 26219 28361 28599 27914 27784 25693 26881
## 1956 26217 24218 27914 26975 28527 27139 28982 28169 28056 29136 26291 26987
## 1957 26589 24848 27543 26896 28878 27390 28065 28141 29048 28484 26634 27735
## 1958 27132 24924 28963 26589 27931 28009 29229 28759 28405 27945 25912 26619
## 1959 26076 25286 27660 25951 26398 25565 28865 30000 29261 29012 26992 27897
# Data downloaded in 15-minute time windows from Twitter during the 2012 London Olympics
twitter <- read.table ("cyber.dat", header = TRUE)
head (twitter)
##   t       date     time comments     neg    pos    emo
## 1 1 2012-07-11 16:44:00      330 -1.7303 1.9970 0.2667
## 2 2 2012-07-11 16:59:00      292 -1.7808 2.0479 0.2671
## 3 3 2012-07-11 17:14:00      373 -1.7641 2.0080 0.2440
## 4 4 2012-07-11 17:29:00      372 -1.9167 1.9704 0.0538
## 5 5 2012-07-11 17:44:00      359 -1.7855 2.0084 0.2228
## 6 6 2012-07-11 17:59:00      351 -1.7892 1.9772 0.1880
twitter.comments.ts <- ts (twitter$comments, frequency = 96)
head (twitter.comments.ts)
## Time Series:
## Start = c(1, 1) 
## End = c(1, 6) 
## Frequency = 96 
## [1] 330 292 373 372 359 351
# Multidimensional time series
twitter.mts <- ts (twitter[,4:7], frequency = 96)
head (twitter.mts)
## Time Series:
## Start = c(1, 1) 
## End = c(1, 6) 
## Frequency = 96 
##          comments     neg    pos    emo
## 1.000000      330 -1.7303 1.9970 0.2667
## 1.010417      292 -1.7808 2.0479 0.2671
## 1.020833      373 -1.7641 2.0080 0.2440
## 1.031250      372 -1.9167 1.9704 0.0538
## 1.041667      359 -1.7855 2.0084 0.2228
## 1.052083      351 -1.7892 1.9772 0.1880


Time series plots

Using the plot.ts() function, you can easily create time series plots with the appropriate x-axis labeling.

plot.ts (nybirths.ts)

plot.ts (twitter.comments.ts)

plot.ts (twitter.mts)

The plot.ts() function is easy to use and is therefore best suited for quick time series views. To create more elegant plots, it is recommended to use the ggplot2 package. The appropriate appearance of the x-axis can be obtained by converting a variable of type character to an object of type Date. For this we will use the as_datetime() function from the lubridate library.

library (ggplot2)
library (dplyr)
library (magrittr)
library (lubridate)

# Change the spelling of dates from Polish to English
Sys.setlocale (category = "LC_TIME", locale = "en_GB.UTF-8")
## [1] "en_GB.UTF-8"
twitter %<>%
  mutate (date.time = as_datetime (paste (date, time),
                                   format = "%Y-%m-%d %H:%M:%S"))

g <- ggplot(twitter, aes (x = date.time))

g + theme_bw() +
  geom_line (aes(y = comments)) + 
  labs (x = "date", y = "number of comments")

g + theme_bw() +
  geom_line (aes(y = neg, color = "negative")) + 
  geom_line (aes(y = pos, color = "positive")) +
  scale_color_manual (values = c("purple","gold")) +
  labs (x = "date", y = "average arousal level", color = "emotions")

A multiple-panel figure can be easily obtained using the facet_grid() function, assuming that data is in the tidy format.

library (tidyr)

twitter.long <- pivot_longer (twitter, cols = 4:7)

g <- ggplot (twitter.long, aes (x = date.time, y = value))
g +
  theme_bw() +
  geom_line ( aes(color = name)) +
  facet_grid(rows = vars(name), scales = "free_y")


Time series decomposition

Each time series \(Y(t)\) can be expressed as the sum or product of three components:

For additive time series \(Y(t) = T(t) + S(t) + \epsilon(t)\) holds, while for multiplicative time series \(Y(t) = T(t) \cdot S( t) \cdot \epsilon(t)\). A mulitplicative time series can be converted to an additive one by taking the logarithm of it, i.e. \(\log Y(t) = \log T(t) + \log S(t) + \log \epsilon(t)\).

The decompose() function allows you to decompose a time series into trend, seasonality, and noise components. It can be used for both additive and multiplicative time series. The most important argument of the function is an object of class ts with a correctly specified frequency. The result of the decompose function is a list whose elements are:

comments.comp <- decompose (twitter.comments.ts)
plot (comments.comp)

plot.ts (comments.comp$trend)


Autocorrelation function

Using the acf() function we will obtain the values (and default plot) of the time series autocorrelation function. Autocorrelation is simply a correlation coefficient determined for the same series, but shifted by a certain lag.

Calling acf() automatically creates plot of the autocorrelation function. It is worth noting that the function determines autocorrelation using the following formula (assuming that our series \(X\) consist of \(N\) samples and \(l\) is the lag):

\(r(l) = \frac{\sum\limits_{t=1}^{t=N-l} (X_t - \langle X \rangle) (X_{t+l} - \langle X \rangle)}{\sum\limits_{t=1}^{t=N} (X_t - \langle X \rangle)^2}\)

As a result, for example, for the series \(x = [1,2,...,10]\) we get this seemingly strange graph:

x <- 1:10
acf(x)

Of course, when the series is long enough, such problems no longer exist.

acf (1:100)

acf (twitter$emo, lag.max = 100)

acf (twitter.comments.ts)

After saving the result of the acf() function to a variable, we have access to the autocorrelation values as a function of the lag.

emo.acf <- acf (twitter$emo, plot = FALSE)
comments.acf <- acf (twitter$comments, plot = FALSE)

acf.df <- data.frame (lag = emo.acf$lag, comments = comments.acf$acf, emo = emo.acf$acf)
acf.df
##    lag  comments       emo
## 1    0 1.0000000 1.0000000
## 2    1 0.9399225 0.3665214
## 3    2 0.9000480 0.3543908
## 4    3 0.8618741 0.3431365
## 5    4 0.8250931 0.3368175
## 6    5 0.7860484 0.3003519
## 7    6 0.7343220 0.3274854
## 8    7 0.6781029 0.2916379
## 9    8 0.6345612 0.2789155
## 10   9 0.5963326 0.2956152
## 11  10 0.5552265 0.2765411
## 12  11 0.5230479 0.2867227
## 13  12 0.4953809 0.2992242
## 14  13 0.4714417 0.2818484
## 15  14 0.4510583 0.2493000
## 16  15 0.4293892 0.2754497
## 17  16 0.4081743 0.2518896
## 18  17 0.3876031 0.2403999
## 19  18 0.3659647 0.2594402
## 20  19 0.3464604 0.2634851
## 21  20 0.3293129 0.2469580
## 22  21 0.3144594 0.2528576
## 23  22 0.3019226 0.2452463
## 24  23 0.2899641 0.2313643
## 25  24 0.2754574 0.2506975
## 26  25 0.2638865 0.2399968
## 27  26 0.2532810 0.2314394
## 28  27 0.2449191 0.2390851
## 29  28 0.2386572 0.2270953
## 30  29 0.2310675 0.2140408
## 31  30 0.2252918 0.2298475
## 32  31 0.2212231 0.2109398
## 33  32 0.2176836 0.2165425
## 34  33 0.2163366 0.2168547
## 35  34 0.2152734 0.2066870
## 36  35 0.2140204 0.2055577
## 37  36 0.2167229 0.2294540
## 38  37 0.2182613 0.2171979
## 39  38 0.2208866 0.2227386
ggplot (acf.df, aes (x = lag)) + 
  geom_point (aes (y = comments, color="comments"), size=3) + 
  geom_point (aes (y = emo, color="emo"), size=3) +
  labs (x = "lag", y = "autocorrelation", color = "series") +
  theme_bw()

In case we want to calculate the correlation of two series with shifted values, we can use the ccf() function.

twitter.ccf <- ccf (twitter$comments, twitter$emo, lag.max = 2000)

l <- twitter.ccf$lag[which.min (twitter.ccf$acf)]
len <- length(twitter$comments)
cor.test (twitter$comments[(l+1):len], twitter$emo[1:(len-l)])
## 
##  Pearson's product-moment correlation
## 
## data:  twitter$comments[(l + 1):len] and twitter$emo[1:(len - l)]
## t = -26.609, df = 5287, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3672118 -0.3196712
## sample estimates:
##        cor 
## -0.3436617