Data mining is the process of discovering significant new relationships, patterns, and trends by searching through large amounts of data stored in databases using mathematical methods. We will deal with the problem of classification, i.e. how to assign objects to specific classes based on the characteristics of them (in other words: how to divide them into individual classes), as well as the issue of dimension reduction. In general, we can distinguish two types of classification: supervised and unsupervised learning. In the first case, we must have a certain data set in which there is already classified data (i.e. data with an assigned class, the so-called “training sample”). The algorithm “learns” features related to objects and then, using such knowledge, classifies new data. In the case of unsupervised classification, the division into classes takes place without a prior learning process.
To estimate the effectiveness of a given classifier (predictive
model), a test sample is necessary, which should be separate from the
training sample. We use the model to predict classes of the observations
from the test sample and then compare them to the actual classes.
In this way, we can construct a confusion matrix, whose fields
have the names as shown in the figure below (the numenclature is related
to medical tests aimed at determining the carrier status of a given
disease).
| classified as ILL | classified as HEALTHY | |
|---|---|---|
| actually ILL | TRUE POSITIVE (TP) | FALSE NEGATIVE (FN) |
| actually HEALTHY | FALSE POSITIVE (FP) | TRUE NEGATIVE (TN) |
All \(n=P+N\) cases are divided into infected \(P=TP+FN\) and healthy individuals \(N=TN+FP\). Based on the confusion matrix, the following measures can be defined to help compare classifiers:
\(\hspace{1.2cm} TPR = \frac{TP}{P}\),
\(\hspace{1.2cm} TNR = \frac{TN}{N}\),
\(\hspace{1.2cm} PPV = \frac{TP}{TP+FP}\),
\(\hspace{1.2cm} NPV = \frac{TN}{TN+FN}\)
\(\hspace{1.2cm} ACC = \frac{TN + TP}{n}\),
\(\hspace{1.2cm} F_1 = \frac{2 \cdot PPV \cdot TPR}{PPV + TPR} = \frac{2 TP}{2TP + FP + FN}\).
Accuracy measures how well the algorithm predicts any given class.
Recall and specificity are conditional probabilities of good
classification provided that the object actually belongs to a given
class. They are particularly important if the distribution of class
sizes is uneven. As an example, let’s calculate the above measures for a
logistic regression model for the Pima dataset from the
MASS library.
library(MASS)
# Logistic regression based of statistically significant predictors: glu i ped
pima.lr <- glm (type~glu+ped, data = Pima.tr, family = binomial)
# Prediction
pima.predict <- predict (pima.lr, newdata = Pima.te, type = "response")
pima.result <- ifelse (pima.predict>0.50, "Yes", "No")
pima.true <- Pima.te[,8]
# Confusion matrix
pima.cm <- table(pima.true,pima.result)
pima.cm## pima.result
## pima.true No Yes
## No 205 18
## Yes 50 59
TN <- pima.cm[1,1]
TP <- pima.cm[2,2]
FP <- pima.cm[1,2]
FN <- pima.cm[2,1]
P <- TP+FN
N <- TN+FP
# Recall
TP/P## [1] 0.5412844
## [1] 0.9192825
## [1] 0.7662338
## [1] 0.8039216
## [1] 0.7951807
## [1] 0.6344086
An easier way to calculate the above values is to use the
confusionMatrix() function from the caret
library. The arguments of this function can be both the contingency
matrix returned by the table function and class membership
vectors (predicted and actual).
library(caret)
pima.conf <- confusionMatrix (as.factor(pima.result), pima.true, positive = "Yes")
pima.conf## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 205 50
## Yes 18 59
##
## Accuracy : 0.7952
## 95% CI : (0.7477, 0.8373)
## No Information Rate : 0.6717
## P-Value [Acc > NIR] : 4.282e-07
##
## Kappa : 0.4979
##
## Mcnemar's Test P-Value : 0.0001704
##
## Sensitivity : 0.5413
## Specificity : 0.9193
## Pos Pred Value : 0.7662
## Neg Pred Value : 0.8039
## Prevalence : 0.3283
## Detection Rate : 0.1777
## Detection Prevalence : 0.2319
## Balanced Accuracy : 0.7303
##
## 'Positive' Class : Yes
##
The above function also returns confidence intervals for accuracy and
tests the statistical hypothesis whether accuracy is greater than the
“no information rate”, which is equal to the percentage of the larger
class. The \(F_1\) score is not shown
by default and must be accessed via the byClass field.
## Sensitivity Specificity Pos Pred Value
## 0.5412844 0.9192825 0.7662338
## Neg Pred Value Precision Recall
## 0.8039216 0.7662338 0.5412844
## F1 Prevalence Detection Rate
## 0.6344086 0.3283133 0.1777108
## Detection Prevalence Balanced Accuracy
## 0.2319277 0.7302835
In the previous example, we assumed that the probability threshold
for considering an individual to be diabetic is \(0.5\). A more appropriate approach would be
to test multiple thresholds and examine the recall and specificity for
each threshold. Using the sapply function, you can
calculate the confusion matrix for any number of thresholds, with the
\(0.04\) threshold being the minimum at
which both classes non-zero representation.
recall.tnr <- function (x, p, threshold) {
require (caret)
predictions <- as.factor (ifelse (p>threshold, "Yes", "No"))
cm <- confusionMatrix (predictions, x, positive = "Yes")
return (cm$byClass[c("Sensitivity","Specificity")])
}
thres <- seq (0.04, 0.99, 0.01)
roc <- sapply (thres, function(t) recall.tnr (pima.true, pima.predict, t))
plot (1-roc["Specificity",], roc["Sensitivity",], xlab="1-TNR", ylab="TPR", main="ROC", type = "l")The above curve is called Receiver Operating Characteristic (ROC) and it illustrates the quality of the classifier. The optimal split point is the one that maximizes the Youden Index, given by \(YI = TPR + TNR - 1 = TPR - FPR\). There are many libraries that help you plot a ROC curve and even calculate the optimal split point and AUC (Area Under Curve), which is another measure of classifier performance.
## Loading required package: rlang
pima.true <- ifelse(Pima.te[,8]=="Yes", 1, 0)
prroc.obj <- roc.curve (scores.class0 = pima.predict, weights.class0 = pima.true, curve = TRUE)
plot (prroc.obj)##
## Method used: empirical
## Number of positive(s): 109
## Number of negative(s): 223
## Area under curve: 0.8245
We can easily calculate the threshold that maximizes the Youden Index
\(YI = TPR - FPR\) using the fields of
the rocit object.
youden.max <- which.max (rocit.obj$TPR - rocit.obj$FPR)
best.cutoff <- rocit.obj$Cutoff[youden.max]
best.tpr <- rocit.obj$TPR[youden.max]
best.fpr <- rocit.obj$FPR[youden.max]
sprintf("Best Cutoff = %.2f (TPR = %.3f, FPR = %.3f)", best.cutoff, best.tpr, best.fpr)## [1] "Best Cutoff = 0.23 (TPR = 0.862, FPR = 0.359)"
Classic model validation involves dividing the data set into two
disjoint subsets: the training set and the test set. This division can
be easily done using the createDataPartition() function
from the caret library.
## Fertility Agriculture Examination Education Catholic
## Courtelary 80.2 17.0 15 12 9.96
## Delemont 83.1 45.1 6 9 84.84
## Franches-Mnt 92.5 39.7 5 5 93.40
## Moutier 85.8 36.5 12 7 33.77
## Neuveville 76.9 43.5 17 15 5.16
## Porrentruy 76.1 35.3 9 7 90.57
## Infant.Mortality
## Courtelary 22.2
## Delemont 22.2
## Franches-Mnt 20.2
## Moutier 20.3
## Neuveville 20.6
## Porrentruy 26.6
set.seed(123) # Random number generator seed
swiss.samples <- createDataPartition (swiss$Fertility, p = 0.8, list = FALSE)
swiss.train <- swiss[swiss.samples,]
swiss.test <- swiss[-swiss.samples,]
swiss.lm <- lm (Fertility ~., data = swiss.train)
swiss.predict <- predict (swiss.lm, newdata=swiss.test)
data.frame (R2 = R2 (swiss.predict, swiss.test$Fertility),
RMSE = RMSE (swiss.predict, swiss.test$Fertility),
MAE = MAE (swiss.predict, swiss.test$Fertility))## R2 RMSE MAE
## 1 0.5946201 6.410914 5.651552
In the example above, we used three error estimation measures: \(R^2\), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE). If we have a very large data set, classical validation should be sufficient to estimate the effectiveness of the model. However, if there is not much data collected, each observation is valuable and may affect the characteristics of the model and therefore excluding some observations from the training set is not advisable. In such situations, cross-validation is used in one of four variants:
Bootstrapping – it involves generating multiple test sets by sampling with replacement from the original set. Typically, hundreds or thousands of such sets are generated, and for each of them the effectiveness of the model is calculated. The final result is the average (or other function) of all calculated performances.
Leave one out cross validation (LOOCV) – it involves selecting only one observation for the test set. The remaining observations are used to train the model. This operation is repeated for all observations in the set, and the final effectiveness is the average of all calculated effectiveness.
K-fold cross-validation (CV) – the dataset is divided into \(K\) random subsets, then we train and test the model \(K\) times, each time selecting a different subset as the test sample and the rest as the training sample.
Repeated K-fold cross-validation (RCV) – it involves performing K-fold cross-validation several times, each time drawing different \(K\) subsets.
The caret library implements the train()
function, which is a wrapper for over two hundred different statistical
models and allows for easy cross-validation.
## [1] "ada" "AdaBag" "AdaBoost.M1"
## [4] "adaboost" "amdai" "ANFIS"
## [7] "avNNet" "awnb" "awtan"
## [10] "bag" "bagEarth" "bagEarthGCV"
## [13] "bagFDA" "bagFDAGCV" "bam"
## [16] "bartMachine" "bayesglm" "binda"
## [19] "blackboost" "blasso" "blassoAveraged"
## [22] "bridge" "brnn" "BstLm"
## [25] "bstSm" "bstTree" "C5.0"
## [28] "C5.0Cost" "C5.0Rules" "C5.0Tree"
## [31] "cforest" "chaid" "CSimca"
## [34] "ctree" "ctree2" "cubist"
## [37] "dda" "deepboost" "DENFIS"
## [40] "dnn" "dwdLinear" "dwdPoly"
## [43] "dwdRadial" "earth" "elm"
## [46] "enet" "evtree" "extraTrees"
## [49] "fda" "FH.GBML" "FIR.DM"
## [52] "foba" "FRBCS.CHI" "FRBCS.W"
## [55] "FS.HGD" "gam" "gamboost"
## [58] "gamLoess" "gamSpline" "gaussprLinear"
## [61] "gaussprPoly" "gaussprRadial" "gbm_h2o"
## [64] "gbm" "gcvEarth" "GFS.FR.MOGUL"
## [67] "GFS.LT.RS" "GFS.THRIFT" "glm.nb"
## [70] "glm" "glmboost" "glmnet_h2o"
## [73] "glmnet" "glmStepAIC" "gpls"
## [76] "hda" "hdda" "hdrda"
## [79] "HYFIS" "icr" "J48"
## [82] "JRip" "kernelpls" "kknn"
## [85] "knn" "krlsPoly" "krlsRadial"
## [88] "lars" "lars2" "lasso"
## [91] "lda" "lda2" "leapBackward"
## [94] "leapForward" "leapSeq" "Linda"
## [97] "lm" "lmStepAIC" "LMT"
## [100] "loclda" "logicBag" "LogitBoost"
## [103] "logreg" "lssvmLinear" "lssvmPoly"
## [106] "lssvmRadial" "lvq" "M5"
## [109] "M5Rules" "manb" "mda"
## [112] "Mlda" "mlp" "mlpKerasDecay"
## [115] "mlpKerasDecayCost" "mlpKerasDropout" "mlpKerasDropoutCost"
## [118] "mlpML" "mlpSGD" "mlpWeightDecay"
## [121] "mlpWeightDecayML" "monmlp" "msaenet"
## [124] "multinom" "mxnet" "mxnetAdam"
## [127] "naive_bayes" "nb" "nbDiscrete"
## [130] "nbSearch" "neuralnet" "nnet"
## [133] "nnls" "nodeHarvest" "null"
## [136] "OneR" "ordinalNet" "ordinalRF"
## [139] "ORFlog" "ORFpls" "ORFridge"
## [142] "ORFsvm" "ownn" "pam"
## [145] "parRF" "PART" "partDSA"
## [148] "pcaNNet" "pcr" "pda"
## [151] "pda2" "penalized" "PenalizedLDA"
## [154] "plr" "pls" "plsRglm"
## [157] "polr" "ppr" "pre"
## [160] "PRIM" "protoclass" "qda"
## [163] "QdaCov" "qrf" "qrnn"
## [166] "randomGLM" "ranger" "rbf"
## [169] "rbfDDA" "Rborist" "rda"
## [172] "regLogistic" "relaxo" "rf"
## [175] "rFerns" "RFlda" "rfRules"
## [178] "ridge" "rlda" "rlm"
## [181] "rmda" "rocc" "rotationForest"
## [184] "rotationForestCp" "rpart" "rpart1SE"
## [187] "rpart2" "rpartCost" "rpartScore"
## [190] "rqlasso" "rqnc" "RRF"
## [193] "RRFglobal" "rrlda" "RSimca"
## [196] "rvmLinear" "rvmPoly" "rvmRadial"
## [199] "SBC" "sda" "sdwd"
## [202] "simpls" "SLAVE" "slda"
## [205] "smda" "snn" "sparseLDA"
## [208] "spikeslab" "spls" "stepLDA"
## [211] "stepQDA" "superpc" "svmBoundrangeString"
## [214] "svmExpoString" "svmLinear" "svmLinear2"
## [217] "svmLinear3" "svmLinearWeights" "svmLinearWeights2"
## [220] "svmPoly" "svmRadial" "svmRadialCost"
## [223] "svmRadialSigma" "svmRadialWeights" "svmSpectrumString"
## [226] "tan" "tanSearch" "treebag"
## [229] "vbmpRadial" "vglmAdjCat" "vglmContRatio"
## [232] "vglmCumulative" "widekernelpls" "WM"
## [235] "wsrf" "xgbDART" "xgbLinear"
## [238] "xgbTree" "xyf"
# Bootstrap
train.control <- trainControl(method="boot", number=100)
swiss.boot <- train(Fertility ~ ., data = swiss, method = "lm", trControl = train.control)
print (swiss.boot)## Linear Regression
##
## 47 samples
## 5 predictor
##
## No pre-processing
## Resampling: Bootstrapped (100 reps)
## Summary of sample sizes: 47, 47, 47, 47, 47, 47, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 8.437493 0.5887206 6.772303
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
# Leave one out cross validation (LOOCV)
train.control <- trainControl(method = "LOOCV")
swiss.loocv <- train(Fertility ~ ., data = swiss, method = "lm", trControl = train.control)
print (swiss.loocv)## Linear Regression
##
## 47 samples
## 5 predictor
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 46, 46, 46, 46, 46, 46, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 7.738618 0.6128307 6.116021
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
# 10-fold crossvalidation
train.control <- trainControl(method = "cv", number=10)
swiss.cv <- train(Fertility ~ ., data = swiss, method = "lm", trControl = train.control)
print (swiss.cv)## Linear Regression
##
## 47 samples
## 5 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 43, 42, 43, 43, 44, 41, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 7.278777 0.7008316 6.080132
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
# 10-fold crossvalidation repeated 3 times
train.control <- trainControl(method = "repeatedcv", number=10, repeats=3)
swiss.repeatedcv <- train(Fertility ~ ., data = swiss, method = "lm", trControl = train.control)
print (swiss.repeatedcv)## Linear Regression
##
## 47 samples
## 5 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 42, 41, 43, 42, 43, 41, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 7.320856 0.7033335 6.156657
##
## Tuning parameter 'intercept' was held constant at a value of TRUE