DATA ANALYSIS AND VISUALIZATION IN R, WINTER 2025 EDITION



Data mining

Data mining is the process of discovering significant new relationships, patterns, and trends by searching through large amounts of data stored in databases using mathematical methods. We will deal with the problem of classification, i.e. how to assign objects to specific classes based on the characteristics of them (in other words: how to divide them into individual classes), as well as the issue of dimension reduction. In general, we can distinguish two types of classification: supervised and unsupervised learning. In the first case, we must have a certain data set in which there is already classified data (i.e. data with an assigned class, the so-called “training sample”). The algorithm “learns” features related to objects and then, using such knowledge, classifies new data. In the case of unsupervised classification, the division into classes takes place without a prior learning process.


Classifier efficiency evaluation

To estimate the effectiveness of a given classifier (predictive model), a test sample is necessary, which should be separate from the training sample. We use the model to predict classes of the observations from the test sample and then compare them to the actual classes.
In this way, we can construct a confusion matrix, whose fields have the names as shown in the figure below (the numenclature is related to medical tests aimed at determining the carrier status of a given disease).

classified as ILL classified as HEALTHY
actually ILL TRUE POSITIVE (TP) FALSE NEGATIVE (FN)
actually HEALTHY FALSE POSITIVE (FP) TRUE NEGATIVE (TN)

All \(n=P+N\) cases are divided into infected \(P=TP+FN\) and healthy individuals \(N=TN+FP\). Based on the confusion matrix, the following measures can be defined to help compare classifiers:

\(\hspace{1.2cm} TPR = \frac{TP}{P}\),

\(\hspace{1.2cm} TNR = \frac{TN}{N}\),

\(\hspace{1.2cm} PPV = \frac{TP}{TP+FP}\),

\(\hspace{1.2cm} NPV = \frac{TN}{TN+FN}\)

\(\hspace{1.2cm} ACC = \frac{TN + TP}{n}\),

\(\hspace{1.2cm} F_1 = \frac{2 \cdot PPV \cdot TPR}{PPV + TPR} = \frac{2 TP}{2TP + FP + FN}\).

Accuracy measures how well the algorithm predicts any given class. Recall and specificity are conditional probabilities of good classification provided that the object actually belongs to a given class. They are particularly important if the distribution of class sizes is uneven. As an example, let’s calculate the above measures for a logistic regression model for the Pima dataset from the MASS library.

library(MASS)

# Logistic regression based of statistically significant predictors: glu i ped
pima.lr <- glm (type~glu+ped, data = Pima.tr, family = binomial)

# Prediction
pima.predict <- predict (pima.lr, newdata = Pima.te, type = "response")
pima.result <- ifelse (pima.predict>0.50, "Yes", "No")
pima.true <- Pima.te[,8]

# Confusion matrix
pima.cm <- table(pima.true,pima.result)
pima.cm
##          pima.result
## pima.true  No Yes
##       No  205  18
##       Yes  50  59
TN <- pima.cm[1,1]
TP <- pima.cm[2,2]
FP <- pima.cm[1,2]
FN <- pima.cm[2,1]
P <- TP+FN
N <- TN+FP

# Recall
TP/P
## [1] 0.5412844
# Specificity
TN/N
## [1] 0.9192825
# Precision
TP/(TP+FP)
## [1] 0.7662338
# Negative Predictive Value
TN/(TN+FN)
## [1] 0.8039216
# Accuracy
(TN+TP)/sum(pima.cm)
## [1] 0.7951807
# F1 score
2*TP/(2*TP+FP+FN)
## [1] 0.6344086

An easier way to calculate the above values is to use the confusionMatrix() function from the caret library. The arguments of this function can be both the contingency matrix returned by the table function and class membership vectors (predicted and actual).

library(caret)
pima.conf <- confusionMatrix (as.factor(pima.result), pima.true, positive = "Yes")
pima.conf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  205  50
##        Yes  18  59
##                                           
##                Accuracy : 0.7952          
##                  95% CI : (0.7477, 0.8373)
##     No Information Rate : 0.6717          
##     P-Value [Acc > NIR] : 4.282e-07       
##                                           
##                   Kappa : 0.4979          
##                                           
##  Mcnemar's Test P-Value : 0.0001704       
##                                           
##             Sensitivity : 0.5413          
##             Specificity : 0.9193          
##          Pos Pred Value : 0.7662          
##          Neg Pred Value : 0.8039          
##              Prevalence : 0.3283          
##          Detection Rate : 0.1777          
##    Detection Prevalence : 0.2319          
##       Balanced Accuracy : 0.7303          
##                                           
##        'Positive' Class : Yes             
## 

The above function also returns confidence intervals for accuracy and tests the statistical hypothesis whether accuracy is greater than the “no information rate”, which is equal to the percentage of the larger class. The \(F_1\) score is not shown by default and must be accessed via the byClass field.

pima.conf$byClass
##          Sensitivity          Specificity       Pos Pred Value 
##            0.5412844            0.9192825            0.7662338 
##       Neg Pred Value            Precision               Recall 
##            0.8039216            0.7662338            0.5412844 
##                   F1           Prevalence       Detection Rate 
##            0.6344086            0.3283133            0.1777108 
## Detection Prevalence    Balanced Accuracy 
##            0.2319277            0.7302835


ROC curve

In the previous example, we assumed that the probability threshold for considering an individual to be diabetic is \(0.5\). A more appropriate approach would be to test multiple thresholds and examine the recall and specificity for each threshold. Using the sapply function, you can calculate the confusion matrix for any number of thresholds, with the \(0.04\) threshold being the minimum at which both classes non-zero representation.

recall.tnr <- function (x, p, threshold) {
  require (caret)
  predictions <- as.factor (ifelse (p>threshold, "Yes", "No"))
  cm <- confusionMatrix (predictions, x, positive = "Yes")
  return (cm$byClass[c("Sensitivity","Specificity")])
}

thres <- seq (0.04, 0.99, 0.01)

roc <- sapply (thres, function(t) recall.tnr (pima.true, pima.predict, t)) 

plot (1-roc["Specificity",], roc["Sensitivity",], xlab="1-TNR", ylab="TPR", main="ROC", type = "l")

The above curve is called Receiver Operating Characteristic (ROC) and it illustrates the quality of the classifier. The optimal split point is the one that maximizes the Youden Index, given by \(YI = TPR + TNR - 1 = TPR - FPR\). There are many libraries that help you plot a ROC curve and even calculate the optimal split point and AUC (Area Under Curve), which is another measure of classifier performance.

library(PRROC)
## Loading required package: rlang
pima.true <- ifelse(Pima.te[,8]=="Yes", 1, 0)
prroc.obj <- roc.curve (scores.class0 = pima.predict, weights.class0 = pima.true, curve = TRUE)
plot (prroc.obj)

library(ROCit)
rocit.obj <- rocit (score = pima.predict, class = Pima.te[,8])
summary (rocit.obj)
##                            
##  Method used: empirical    
##  Number of positive(s): 109
##  Number of negative(s): 223
##  Area under curve: 0.8245
plot (rocit.obj)

We can easily calculate the threshold that maximizes the Youden Index \(YI = TPR - FPR\) using the fields of the rocit object.

youden.max <- which.max (rocit.obj$TPR - rocit.obj$FPR)
best.cutoff <- rocit.obj$Cutoff[youden.max]
best.tpr <- rocit.obj$TPR[youden.max]
best.fpr <- rocit.obj$FPR[youden.max]
sprintf("Best Cutoff = %.2f (TPR = %.3f, FPR = %.3f)", best.cutoff, best.tpr, best.fpr)
## [1] "Best Cutoff = 0.23 (TPR = 0.862, FPR = 0.359)"


Cross-validation

Classic model validation involves dividing the data set into two disjoint subsets: the training set and the test set. This division can be easily done using the createDataPartition() function from the caret library.

data("swiss")
# The data concerns fertility in 47 Swiss provinces in 1888
head(swiss)
##              Fertility Agriculture Examination Education Catholic
## Courtelary        80.2        17.0          15        12     9.96
## Delemont          83.1        45.1           6         9    84.84
## Franches-Mnt      92.5        39.7           5         5    93.40
## Moutier           85.8        36.5          12         7    33.77
## Neuveville        76.9        43.5          17        15     5.16
## Porrentruy        76.1        35.3           9         7    90.57
##              Infant.Mortality
## Courtelary               22.2
## Delemont                 22.2
## Franches-Mnt             20.2
## Moutier                  20.3
## Neuveville               20.6
## Porrentruy               26.6
set.seed(123) # Random number generator seed
swiss.samples <- createDataPartition (swiss$Fertility, p = 0.8, list = FALSE)
swiss.train <- swiss[swiss.samples,]
swiss.test <- swiss[-swiss.samples,]
swiss.lm <- lm (Fertility ~., data = swiss.train)
swiss.predict <- predict (swiss.lm, newdata=swiss.test)
data.frame (R2 = R2 (swiss.predict, swiss.test$Fertility),
            RMSE = RMSE (swiss.predict, swiss.test$Fertility),
            MAE = MAE (swiss.predict, swiss.test$Fertility))
##          R2     RMSE      MAE
## 1 0.5946201 6.410914 5.651552

In the example above, we used three error estimation measures: \(R^2\), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE). If we have a very large data set, classical validation should be sufficient to estimate the effectiveness of the model. However, if there is not much data collected, each observation is valuable and may affect the characteristics of the model and therefore excluding some observations from the training set is not advisable. In such situations, cross-validation is used in one of four variants:

  1. Bootstrapping – it involves generating multiple test sets by sampling with replacement from the original set. Typically, hundreds or thousands of such sets are generated, and for each of them the effectiveness of the model is calculated. The final result is the average (or other function) of all calculated performances.

  2. Leave one out cross validation (LOOCV) – it involves selecting only one observation for the test set. The remaining observations are used to train the model. This operation is repeated for all observations in the set, and the final effectiveness is the average of all calculated effectiveness.

  3. K-fold cross-validation (CV) – the dataset is divided into \(K\) random subsets, then we train and test the model \(K\) times, each time selecting a different subset as the test sample and the rest as the training sample.

  4. Repeated K-fold cross-validation (RCV) – it involves performing K-fold cross-validation several times, each time drawing different \(K\) subsets.

The caret library implements the train() function, which is a wrapper for over two hundred different statistical models and allows for easy cross-validation.

# Print models compatible with the train() function
names(getModelInfo())
##   [1] "ada"                 "AdaBag"              "AdaBoost.M1"        
##   [4] "adaboost"            "amdai"               "ANFIS"              
##   [7] "avNNet"              "awnb"                "awtan"              
##  [10] "bag"                 "bagEarth"            "bagEarthGCV"        
##  [13] "bagFDA"              "bagFDAGCV"           "bam"                
##  [16] "bartMachine"         "bayesglm"            "binda"              
##  [19] "blackboost"          "blasso"              "blassoAveraged"     
##  [22] "bridge"              "brnn"                "BstLm"              
##  [25] "bstSm"               "bstTree"             "C5.0"               
##  [28] "C5.0Cost"            "C5.0Rules"           "C5.0Tree"           
##  [31] "cforest"             "chaid"               "CSimca"             
##  [34] "ctree"               "ctree2"              "cubist"             
##  [37] "dda"                 "deepboost"           "DENFIS"             
##  [40] "dnn"                 "dwdLinear"           "dwdPoly"            
##  [43] "dwdRadial"           "earth"               "elm"                
##  [46] "enet"                "evtree"              "extraTrees"         
##  [49] "fda"                 "FH.GBML"             "FIR.DM"             
##  [52] "foba"                "FRBCS.CHI"           "FRBCS.W"            
##  [55] "FS.HGD"              "gam"                 "gamboost"           
##  [58] "gamLoess"            "gamSpline"           "gaussprLinear"      
##  [61] "gaussprPoly"         "gaussprRadial"       "gbm_h2o"            
##  [64] "gbm"                 "gcvEarth"            "GFS.FR.MOGUL"       
##  [67] "GFS.LT.RS"           "GFS.THRIFT"          "glm.nb"             
##  [70] "glm"                 "glmboost"            "glmnet_h2o"         
##  [73] "glmnet"              "glmStepAIC"          "gpls"               
##  [76] "hda"                 "hdda"                "hdrda"              
##  [79] "HYFIS"               "icr"                 "J48"                
##  [82] "JRip"                "kernelpls"           "kknn"               
##  [85] "knn"                 "krlsPoly"            "krlsRadial"         
##  [88] "lars"                "lars2"               "lasso"              
##  [91] "lda"                 "lda2"                "leapBackward"       
##  [94] "leapForward"         "leapSeq"             "Linda"              
##  [97] "lm"                  "lmStepAIC"           "LMT"                
## [100] "loclda"              "logicBag"            "LogitBoost"         
## [103] "logreg"              "lssvmLinear"         "lssvmPoly"          
## [106] "lssvmRadial"         "lvq"                 "M5"                 
## [109] "M5Rules"             "manb"                "mda"                
## [112] "Mlda"                "mlp"                 "mlpKerasDecay"      
## [115] "mlpKerasDecayCost"   "mlpKerasDropout"     "mlpKerasDropoutCost"
## [118] "mlpML"               "mlpSGD"              "mlpWeightDecay"     
## [121] "mlpWeightDecayML"    "monmlp"              "msaenet"            
## [124] "multinom"            "mxnet"               "mxnetAdam"          
## [127] "naive_bayes"         "nb"                  "nbDiscrete"         
## [130] "nbSearch"            "neuralnet"           "nnet"               
## [133] "nnls"                "nodeHarvest"         "null"               
## [136] "OneR"                "ordinalNet"          "ordinalRF"          
## [139] "ORFlog"              "ORFpls"              "ORFridge"           
## [142] "ORFsvm"              "ownn"                "pam"                
## [145] "parRF"               "PART"                "partDSA"            
## [148] "pcaNNet"             "pcr"                 "pda"                
## [151] "pda2"                "penalized"           "PenalizedLDA"       
## [154] "plr"                 "pls"                 "plsRglm"            
## [157] "polr"                "ppr"                 "pre"                
## [160] "PRIM"                "protoclass"          "qda"                
## [163] "QdaCov"              "qrf"                 "qrnn"               
## [166] "randomGLM"           "ranger"              "rbf"                
## [169] "rbfDDA"              "Rborist"             "rda"                
## [172] "regLogistic"         "relaxo"              "rf"                 
## [175] "rFerns"              "RFlda"               "rfRules"            
## [178] "ridge"               "rlda"                "rlm"                
## [181] "rmda"                "rocc"                "rotationForest"     
## [184] "rotationForestCp"    "rpart"               "rpart1SE"           
## [187] "rpart2"              "rpartCost"           "rpartScore"         
## [190] "rqlasso"             "rqnc"                "RRF"                
## [193] "RRFglobal"           "rrlda"               "RSimca"             
## [196] "rvmLinear"           "rvmPoly"             "rvmRadial"          
## [199] "SBC"                 "sda"                 "sdwd"               
## [202] "simpls"              "SLAVE"               "slda"               
## [205] "smda"                "snn"                 "sparseLDA"          
## [208] "spikeslab"           "spls"                "stepLDA"            
## [211] "stepQDA"             "superpc"             "svmBoundrangeString"
## [214] "svmExpoString"       "svmLinear"           "svmLinear2"         
## [217] "svmLinear3"          "svmLinearWeights"    "svmLinearWeights2"  
## [220] "svmPoly"             "svmRadial"           "svmRadialCost"      
## [223] "svmRadialSigma"      "svmRadialWeights"    "svmSpectrumString"  
## [226] "tan"                 "tanSearch"           "treebag"            
## [229] "vbmpRadial"          "vglmAdjCat"          "vglmContRatio"      
## [232] "vglmCumulative"      "widekernelpls"       "WM"                 
## [235] "wsrf"                "xgbDART"             "xgbLinear"          
## [238] "xgbTree"             "xyf"
# Bootstrap
train.control <- trainControl(method="boot", number=100)
swiss.boot <- train(Fertility ~ ., data = swiss, method = "lm", trControl = train.control)
print (swiss.boot)
## Linear Regression 
## 
## 47 samples
##  5 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (100 reps) 
## Summary of sample sizes: 47, 47, 47, 47, 47, 47, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   8.437493  0.5887206  6.772303
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
# Leave one out cross validation (LOOCV)
train.control <- trainControl(method = "LOOCV")
swiss.loocv <- train(Fertility ~ ., data = swiss, method = "lm", trControl = train.control)
print (swiss.loocv)
## Linear Regression 
## 
## 47 samples
##  5 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 46, 46, 46, 46, 46, 46, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   7.738618  0.6128307  6.116021
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
# 10-fold crossvalidation
train.control <- trainControl(method = "cv", number=10)
swiss.cv <- train(Fertility ~ ., data = swiss, method = "lm", trControl = train.control)
print (swiss.cv)
## Linear Regression 
## 
## 47 samples
##  5 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 43, 42, 43, 43, 44, 41, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   7.278777  0.7008316  6.080132
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
# 10-fold crossvalidation repeated 3 times
train.control <- trainControl(method = "repeatedcv", number=10, repeats=3)
swiss.repeatedcv <- train(Fertility ~ ., data = swiss, method = "lm", trControl = train.control)
print (swiss.repeatedcv)
## Linear Regression 
## 
## 47 samples
##  5 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 42, 41, 43, 42, 43, 41, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   7.320856  0.7033335  6.156657
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE