DATA ANALYSIS AND VISUALIZATION IN R, WINTER 2025 EDITION



Principal Component Analysis

One of the tasks of data mining is dimension reduction, i.e. determining which components of the observation vector are irrelevant or what other combinations of components may prove useful for further analysis. The standard method of dimension reduction is Principal Component Analysis (PCA). It involves finding a new direction that maximizes the variance of the observations projected onto it. Then we look for the next direction, also with the highest possible variance, but orthogonal to the previous one, etc. The PCA algorihtm is implemented in R by the princomp() function.

# Simple linear relation
x <- seq(-5, 5, by=.1)
y <- 2*x

# Add some noise for more variance
x <- x + runif(101, max = 1)
y <- y + runif(101, max = 1)
test <- data.frame(x, y)
plot (test)

# Perform PCA
test.pc <- princomp (~., data=test)
summary (test.pc)
## Importance of components:
##                           Comp.1      Comp.2
## Standard deviation     6.5031976 0.261490078
## Proportion of Variance 0.9983858 0.001614191
## Cumulative Proportion  0.9983858 1.000000000

The vectors which correspond to the new directions are saved in the loadings field. The values of the original data projected to new directions can be obtained using the scores field. The princomp object can be also plotted to show the variances of the components.

test.pc$loadings
## 
## Loadings:
##   Comp.1 Comp.2
## x  0.444  0.896
## y  0.896 -0.444
## 
##                Comp.1 Comp.2
## SS loadings       1.0    1.0
## Proportion Var    0.5    0.5
## Cumulative Var    0.5    1.0
plot (test.pc)

plot (test.pc$scores)

The PCA method has strong theoretical background because it can be proved that the subsequent components are identical to the eigenvectors of the covariance (or correlation) matrix.

# Get the covariance and correlation matrices
test.cov <- cov (test)
test.cor <- cor (test)
# Compute eigenvalues and eigenvectors
eigen (test.cov)
## eigen() decomposition
## $values
## [1] 42.71449509  0.06906083
## 
## $vectors
##           [,1]       [,2]
## [1,] 0.4437879 -0.8961319
## [2,] 0.8961319  0.4437879
eigen (test.cor)
## eigen() decomposition
## $values
## [1] 1.994911235 0.005088765
## 
## $vectors
##           [,1]       [,2]
## [1,] 0.7071068 -0.7071068
## [2,] 0.7071068  0.7071068
# Scale data
test.scaled <- as.data.frame (scale (test))
eigen (cov (test.scaled))
## eigen() decomposition
## $values
## [1] 1.994911235 0.005088765
## 
## $vectors
##            [,1]       [,2]
## [1,] -0.7071068  0.7071068
## [2,] -0.7071068 -0.7071068


K-Means Clustering

One of the unsupervised classification methods is cluster analysis. Its assumptions are as follows:

In practice, the K-Means algorithm initializes randomly the centers of \(K\) clusters (where \(K\) is a parameter) and in subsequent iterations assigns the closest points to these clusters and determines new clusters, etc. until there is no change in the system. The function implementing such an algorithm in R is kmeans().

library (MASS)
set.seed (111)

n <- 30
sigma <- matrix(c(1,0,0,1),2,2)

mu1 <- c(5,5)
mu2 <- c(1,1)
mu3 <- c(4,-2)

my_colors <- c (rep ("orange", n), rep ("violet", n), rep ("green", n))

clust1 <- mvrnorm (n, mu1, sigma)
clust2 <- mvrnorm (n, mu2, sigma)
clust3 <- mvrnorm (n, mu3, sigma)

all_points <- rbind (clust1, clust2, clust3)

xrange <- range (all_points[,1])
yrange <- range (all_points[,2])

par(mfrow = c(2,2))

plot(all_points, col=my_colors, pch=19, cex=2, xlab="X", ylab="Y", xlim=xrange, ylim=yrange)
title("Clusters", cex.main=1.4, font=2)

cl <- kmeans (all_points, 3)

plot(all_points, col=my_colors, pch=19, cex=2, xlab="X", ylab="Y", xlim=xrange, ylim=yrange)
text(all_points[,1], all_points[,2], cl$cluster, font=2)
title("3-Means", cex.main=1.4, font=2)

cl1 <- kmeans(all_points, 4)

plot(all_points, col=cl1$cluster, pch=19, cex=2, xlab="X", ylab="Y", xlim=xrange, ylim=yrange)
title("4-Means", cex.main=1.4, font=2)

cl2 <- kmeans(all_points, 5)

plot(all_points, col=cl2$cluster, pch=19, cex=2, xlab="X", ylab="Y", xlim=xrange, ylim=yrange)
title("5-Means", cex.main=1.4, font=2)


Optimal number of clusters

A key element of the K-Means algorithm is choosing the right number of groups. Merely comparing the value of \(W\) is insufficient because it will always decrease with the number of clusters \(K\). To determine the optimal number of clusters \(K^\ast\), one can use the gap statistics \(GS = \ln \langle W_{rand} \rangle - \ln \langle W_{data} \rangle\), where \(\langle W_{rand} \rangle\) is the sum of distances within clusters, averaged over many implementations, calculated for random data from a uniform distribution with boundaries determined by the extreme coordinates of the real data. Then the optimal number of clusters is given by:

\(K^\ast = \arg \max_K GS(K)\)

random_points <- function(n, K, xrange, yrange) {
  x <- runif (n, xrange[1], xrange[2])
  y <- runif (n, yrange[1], yrange[2])
  cl <- kmeans (cbind (x,y), K)
  return (cl$tot.withinss)
}

K <- 1:10
W.rand <- sapply (K, function(k) mean (sapply (1:20, function(i) random_points (3*n, k, xrange, yrange))))
W.data <- sapply (K, function(k) mean (sapply (1:20, function(i) kmeans(all_points, k)$tot.withinss)))

W.df <- data.frame (K = K,
                    W.rand = W.rand,
                    W.data = W.data,
                    GS = log(W.rand/W.data))

library(ggplot2)
g <- ggplot(W.df)

g + theme_bw() + 
  geom_point ( aes (x=K, y=W.rand, color="random"), size=3) +
  geom_point ( aes (x=K, y=W.data, color="data"), size=3) +
  labs (x = "K", y="W", color="data")

g + theme_bw() +
  geom_point ( aes (x=K, y=GS), size=3) +
  labs (x="K", y=expression(paste("GS = ",lnW[rnd]-lnW[data])))

In the \(GS(K)\) graph above, it is clearly visible that \(K^\ast = 3\). Let’s calculate the confusion matrix and classification accuracy for this number of clusters.

library(caret)
cluster.predict <- as.factor (cl$cluster)
cluster.true <- as.factor (rep (c(1,3,2), each = n))
cluster.cm <- confusionMatrix (cluster.predict, cluster.true)
cluster.cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3
##          1 30  0  0
##          2  0 30  1
##          3  0  0 29
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9889          
##                  95% CI : (0.9396, 0.9997)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9833          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3
## Sensitivity            1.0000   1.0000   0.9667
## Specificity            1.0000   0.9833   1.0000
## Pos Pred Value         1.0000   0.9677   1.0000
## Neg Pred Value         1.0000   1.0000   0.9836
## Prevalence             0.3333   0.3333   0.3333
## Detection Rate         0.3333   0.3333   0.3222
## Detection Prevalence   0.3333   0.3444   0.3222
## Balanced Accuracy      1.0000   0.9917   0.9833

In case the \(GS(K)\) function has no visible maximum, its derivative \(\Delta GS(K)\) can be examined.

W.df$dGS = c(0, diff(W.df$GS))

ggplot(W.df) + theme_bw() +
  geom_point ( aes (x=K, y=dGS), size=3) +
  labs (x="K", y=expression(paste(Delta,"GS",sep="")))


Hierarchical Clustering

Another way to perform cluster analysis is to use hierarchical methods. Their general description can be summarized in the following points:

Using the hclust() function we will create a simple dendrogram for the subsamples of data from the previous example.

sub_points <- rbind (clust1[1:10,], clust2[1:10,], clust3[1:10,])
hc <- hclust (dist (sub_points))
plot(hc, hang = -1)

Key information regarding the distances at which clusters are created and the connecting points are provided in the height and merge fields. Negative values indicate combined observations, while positive values represent clusters combined in a given step of the algorithm, e.g. -24 1 is a combination of observation no. 24 with the cluster created in the first step.

hc$height
##  [1]  0.2893393  0.4030210  0.4246878  0.4772576  0.5104020  0.5946472
##  [7]  0.6635294  0.7216255  0.8254000  0.8572473  0.8693177  0.8955508
## [13]  0.9395204  1.0086786  1.0677816  1.3723851  1.7232404  1.7962182
## [19]  1.8155090  2.2566573  2.3432081  2.3546992  2.6193467  3.6627558
## [25]  3.6655308  3.8337065  4.5464827  7.7416469 10.1880062
hc$merge
##       [,1] [,2]
##  [1,]  -28  -30
##  [2,]   -2  -10
##  [3,]  -24    1
##  [4,]  -12  -15
##  [5,]   -3   -6
##  [6,]  -23  -26
##  [7,]   -7   -8
##  [8,]  -21  -27
##  [9,]   -9    2
## [10,]  -13  -16
## [11,]  -14    4
## [12,]  -11  -17
## [13,]   -1   -5
## [14,]  -19   10
## [15,]  -22  -25
## [16,]  -29    8
## [17,]    7    9
## [18,]  -20   14
## [19,]    3   16
## [20,]  -18   12
## [21,]   -4   17
## [22,]    6   19
## [23,]   11   20
## [24,]   15   22
## [25,]   13   21
## [26,]   18   23
## [27,]    5   25
## [28,]   24   26
## [29,]   27   28

The optimal number of clusters can be obtained by looking for the maximum height difference between subsequent splits.

height.diff.max <- hc$height[which.max (diff (hc$height))+1]
plot (hc, hang = -1)
abline (h=height.diff.max, col="red", lwd=2)

To get the clusters from the hclust object we can use the cutree() function which returns a vector with group membership for a given height or the desired number of groups.

hc <- hclust (dist (all_points))
cluster.predict <- as.factor (cutree (hc, k=3))
cluster.true <- as.factor (rep (c(1,2,3), each = n))
cluster.cm <- confusionMatrix (cluster.predict, cluster.true)
cluster.cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3
##          1 29  0  0
##          2  1 30  2
##          3  0  0 28
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9667          
##                  95% CI : (0.9057, 0.9931)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.95            
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3
## Sensitivity            0.9667   1.0000   0.9333
## Specificity            1.0000   0.9500   1.0000
## Pos Pred Value         1.0000   0.9091   1.0000
## Neg Pred Value         0.9836   1.0000   0.9677
## Prevalence             0.3333   0.3333   0.3333
## Detection Rate         0.3222   0.3333   0.3111
## Detection Prevalence   0.3222   0.3667   0.3111
## Balanced Accuracy      0.9833   0.9750   0.9667