Probably the simplest supervised classification method is Linear Discriminant Analysis (LDA). Its basis is the assumption that a multivariate Gaussian distribution describes the distributions of variables in each class. Moreover, the covariance matrices of all classes are the same. It can then be proven that the two classes are best separated by a hyperplane (i.e., a straight line in two dimensions) given by the equation
\(\ln \frac{\pi_1}{\pi_2} + \frac{1}{2}(\mathbf{m}_1-\mathbf{m}_2)^T \mathbf{S}^{-1}(\mathbf{m}_1+\mathbf{m}_2) + (\mathbf{m}_1-\mathbf{m}_2)^T \mathbf{S}^{-1} \mathbf{x} = 0\),
where \(\pi_1,\pi_2\) are the a priori probabilities of belonging to both classes, \(\mathbf{m}_1, \mathbf{m}_2\) are the means of the distributions, and \(\mathbf{S}\) is the covariance matrix.
We start with a simple example of two classes, each of them
represented by points drawn from a two-dimensional normal distribution
(function mvrnorm() from the MASS library)
library(MASS)
library(ggplot2)
S <- matrix (c(3,0,0,3),2,2)
m1 <- c (2,2)
m2 <- c (-1,-1)
n1 <- 60
n2 <- 20
n <- n1 + n2
v1 <- mvrnorm (n1, m1, S)
v2 <- mvrnorm (n2, m2, S)
df <- data.frame (x = c (v1[,1],v2[,1]),
y = c (v1[,2],v2[,2]),
class = as.factor (rep (c ("1","2"), c (n1,n2))))
theme_set (theme_bw())
ggplot(df) + geom_point(aes(x = x, y = y, color = class), shape = 19, size = 3)In the R package, LDA is implemented by the lda()
function (in MASS library), where we provide the dependency
between the fields of the data frame as arguments (e.g.,
z ~ x + y). As a result, we will obtain the values of a
priori probabilities (class frequencies), the expected values of points
belonging to the classes, and the direction in which the observations
should be projected to make the best separation of classes.
Unfortunately, the optimal split point is not returned, and you have to
determine it yourself by estimating the covariance matrix and using the
formula above.
## Call:
## lda(class ~ x + y, data = df)
##
## Prior probabilities of groups:
## 1 2
## 0.75 0.25
##
## Group means:
## x y
## 1 2.3803385 1.554330
## 2 -0.2358954 -1.283169
##
## Coefficients of linear discriminants:
## LD1
## x -0.3153548
## y -0.5088512
project <- as.matrix (df[,1:2]) %*% df.lda$scaling
df$project <- project[,1]
pp <- 0
ggplot (df) +
geom_point (aes (x = project, y = 0, color = class), shape = 21, size = 5) +
geom_vline (xintercept = pp)Of course, the “best” separation of classes does not mean that we
will get a division in which all points from class “1” will be
classified as belonging to this group - this can be seen in the figure
above. Using the klaR library and the
partimat() function implemented in it, you can obtain an
explicit straight line separating classes and quickly verify which
points have been incorrectly classified. We will quantitatively evaluate
the classification accuracy using a confusion matrix.
library(caret)
df.pred <- predict (df.lda, df)
df.conf <- confusionMatrix (df.pred$class, df$class, positive = "1")
df.conf## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 55 6
## 2 5 14
##
## Accuracy : 0.8625
## 95% CI : (0.7673, 0.9293)
## No Information Rate : 0.75
## P-Value [Acc > NIR] : 0.01064
##
## Kappa : 0.6271
##
## Mcnemar's Test P-Value : 1.00000
##
## Sensitivity : 0.9167
## Specificity : 0.7000
## Pos Pred Value : 0.9016
## Neg Pred Value : 0.7368
## Prevalence : 0.7500
## Detection Rate : 0.6875
## Detection Prevalence : 0.7625
## Balanced Accuracy : 0.8083
##
## 'Positive' Class : 1
##
In a situation where the assumption of equality of the covariance matrix in all groups is not met, the effectiveness of LDA is much lower and therefore it is better to use Quadratic Discriminant Analysis (QDA). We will compare both methods on an artificial set containing observations from three groups with different covariance matrices.
S1 <- matrix (c (4,2,2,4), 2, 2)
S2 <- matrix (c (2,0.5,0,1), 2, 2)
S3 <- matrix (c (1,-0.5,0.5,4), 2, 2)
m1 <- c (3,4)
m2 <- c (0,-1)
m3 <- c (4,-3)
n1 <- 120
n2 <- 80
n3 <- 100
n <- n1 + n2 + n3
v1 <- mvrnorm (n1, m1, S1)
v2 <- mvrnorm (n2, m2, S2)
v3 <- mvrnorm (n3, m3, S3)
df <- data.frame (x = c (v1[,1], v2[,1], v3[,1]),
y = c (v1[,2], v2[,2], v3[,2]),
class = as.factor (rep (c ("1", "2", "3"), c (n1, n2, n3))))
ggplot (df) + geom_point (aes(x = x, y = y, color = class), shape = 19, size = 3)# Train LDA and QDA and test with 10-fold cross-validation
trContr <- trainControl (method = "cv", number = 10)
df.lda.cv <- train (class ~ x + y, data = df, method="lda", trControl = trContr)
df.qda.cv <- train (class ~ x + y, data = df, method="qda", trControl = trContr)
df.lda.cv## Linear Discriminant Analysis
##
## 300 samples
## 2 predictor
## 3 classes: '1', '2', '3'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 270, 270, 270, 270, 270, 270, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9266667 0.8895769
## Quadratic Discriminant Analysis
##
## 300 samples
## 2 predictor
## 3 classes: '1', '2', '3'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 270, 270, 270, 270, 270, 270, ...
## Resampling results:
##
## Accuracy Kappa
## 0.94 0.9089515
# Observations projected onto separating lines found by LDA
df.lda <- lda (class ~ x + y, data = df)
df.lda## Call:
## lda(class ~ x + y, data = df)
##
## Prior probabilities of groups:
## 1 2 3
## 0.4000000 0.2666667 0.3333333
##
## Group means:
## x y
## 1 2.9653630 3.839433
## 2 0.1190764 -1.019548
## 3 4.0480741 -2.970206
##
## Coefficients of linear discriminants:
## LD1 LD2
## x 0.3164014 0.60203401
## y -0.6065254 0.09468834
##
## Proportion of trace:
## LD1 LD2
## 0.797 0.203
project <- as.matrix (df[,1:2]) %*% df.lda$scaling
df.project <- data.frame (project, class = df$class)
g <- ggplot (df.project, aes (x = LD1, y = LD2))
g + geom_point (aes (color = class), size=3)Support Vector Machines (SVM) is a quite modern (from the 1990s) supervised learning method. Its main feature is a new approach to the problem of the separating hyperplane: the task is to find the largest possible margin separating points belonging to different classes. The margin hyperplanes themselves must pass through specific elements of the training sample, called support vectors. If we denote the observation vector by \(\mathbf{x_i}\) and its class by \(y_i = \{-1,1\}\), the expression for the support vector can be written as
\(\mathbf{w} = \sum_{i=1}^{i=n} y_i \alpha_i x_i\),
where the coefficients \(\alpha_i\) are non-zero only for support vectors. The intercept is equal
\(\rho = \frac{1}{2}(\mathbf{w}*\mathbf{x}_1^*+\mathbf{w}*\mathbf{x}_{-1}^*)\),
where \(\mathbf{x}_1, \mathbf{x}_{-1}\) are the points through which the support vectors pass. As an illustration, we will use a simple example with 12 observations from two classes.
x <- c (1.0, 1.5, 1.0, 1.5, 2.5, 3.5, 2.0, 3.0, 3.5, 4.5, 5.0, 5.0)
y <- c (2.0, 1.0, 4.0, 3.0, 2.0, 2.0, 5.5, 5.0, 4.0, 4.5, 2.0, 3.5)
df <- data.frame (x = x,
y = y,
class = as.factor (rep (c("-1", "1"), each=6)))
g <- ggplot (df) + geom_point (aes (x = x, y = y, color = class), shape = 19, size = 3)
gFor classification we will use the svm() function from
the e1071 library. It has many parameters, including:
type - the type of machine we want to use, there are
five options to choose from, three are classification and two are
regression,
kernel - kernel used for training and prediction,
selectable: linear, polynomial, radial and sigmoid,
scale - a logical variable deciding whether the data
should be scaled (subtracting the mean and dividing by the variance), by
default it has the value TRUE,
cost - a parameter regulating the penalty for
exceeding the margin boundary.
library(e1071)
df.svm <- svm (class ~ x + y,
data = df,
type = "C-classification",
kernel = "linear",
cost = 10,
scale = FALSE)
# Support vectors
print (df.svm$SV)## x y
## 6 3.5 2.0
## 7 2.0 5.5
## 11 5.0 2.0
## [,1]
## [1,] 1.5416840
## [2,] -0.3264949
## [3,] -1.2151892
As one can see, only three points are support vectors (SV), and the values of the \(\alpha\) coefficients immediately encode the class through the sign before the coefficient. We will determine the \(\mathbf{w}\) vector and use it to plot the separating hyperplane and the margin hyperplanes on the graph.
## [,1]
## x -1.333041
## y -1.142732
## [1] -7.950963
xx <- 0:6
hyper <- data.frame (xx = xx,
r = -w[1]/w[2]*xx + b/w[2],
m1 = -w[1]/w[2]*xx + (b+1)/w[2],
m2 = -w[1]/w[2]*xx + (b-1)/w[2])
# Separating hyperplane
r <- geom_line (data = hyper, aes (x=xx, y=r), color = "red")
# Margin hyperplanes
m1 <- geom_line (data = hyper, aes (x=xx, y=m1))
m2 <- geom_line (data = hyper, aes (x=xx, y=m2))
g + r + m1 + m2Of course, perfect class separability is desirable, but not always possible to achieve. In such cases, the value of the “penalty” for exceeding the margins plays a key role.
S <- matrix (c (3,1,1,3),2,2)
m1 <- c (3,3)
m2 <- c (-1,-1)
n1 <- 30
n2 <- 20
n <- n1 + n2
v1 <- mvrnorm (n1, m1, S)
v2 <- mvrnorm (n2, m2, S)
df <- data.frame (x = c (v1[,1], v2[,1]),
y = c (v1[,2], v2[,2]),
class = as.factor (rep (c("1","2"), c(n1,n2))))
g <- ggplot(df) + geom_point (aes (x = x, y = y, color = class), shape = 19, size = 3)
gdf.svm <- svm (class ~ x + y,
data = df,
type = "C-classification",
kernel = "linear",
cost = 10,
scale = FALSE)
# Support vectors and alpha coefficients
print (cbind (df.svm$SV, df.svm$coefs))## x y
## 12 1.9287689 0.09670212 10.00000
## 15 0.4927069 3.12663755 4.61808
## 18 0.9622093 -1.02128873 10.00000
## 21 1.6090437 0.78932677 10.00000
## 28 -0.1890131 0.98394459 10.00000
## 33 -1.9260764 3.41839590 -6.28876
## 36 0.8761095 2.41351080 -10.00000
## 40 -0.4542006 2.04466948 -10.00000
## 45 2.3546441 -3.52691678 -8.32932
## 46 3.2773316 -1.43270609 -10.00000
## [,1]
## x 0.8930932
## y 0.5505165
## [1] 1.161434
xx <- -4:4
hyper <- data.frame (xx = xx,
r = -w[1]/w[2]*xx + b/w[2],
m1 = -w[1]/w[2]*xx + (b+1)/w[2],
m2 = -w[1]/w[2]*xx + (b-1)/w[2])
# Separating hyperplane
r <- geom_line (data = hyper, aes (x=xx, y=r), color = "red")
# Marings hyperplanes
m1 <- geom_line (data = hyper, aes (x=xx, y=m1))
m2 <- geom_line (data = hyper, aes (x=xx, y=m2))
g + r + m1 + m2The choice of the appropriate kernel affects the shape of the margin hyperplanes. Below is an example of using a radial kernel.
n <- 50
s <- 0.25
df <- data.frame (x = rnorm(n, sd = s),
y = rnorm(n, sd = s))
df$class <- ifelse (sqrt (df$x^2+df$y^2) < s, "1", "2")
ggplot(df) + geom_point (aes (x = x, y = y, color = class), shape = 19, size = 3)df.svm <- svm (class ~ x + y,
data = df,
type = "C-classification",
kernel = "radial",
cost = 100,
scale = FALSE)
# Support vectors and alpha coefficients
print (cbind (df.svm$SV, df.svm$coefs))## x y
## 4 -0.165310306 0.24918957 100.00000
## 7 -0.187382595 0.23177701 100.00000
## 15 -0.004327054 -0.25178846 100.00000
## 19 0.226950408 -0.12589336 100.00000
## 20 -0.080222352 -0.24190730 100.00000
## 34 0.043514463 -0.25048192 100.00000
## 35 -0.257640065 0.07303283 100.00000
## 38 0.116533543 0.24184943 100.00000
## 41 -0.299725065 0.04736016 100.00000
## 43 0.031132833 0.29126741 100.00000
## 48 0.046874603 0.34197370 13.46842
## 50 0.001214229 -0.29315410 93.33992
## 2 -0.218180525 0.06082180 -100.00000
## 9 -0.209028560 0.10189657 -100.00000
## 11 -0.003372804 -0.21904507 -100.00000
## 14 0.048521603 0.16902783 -100.00000
## 16 -0.093232405 0.14712899 -100.00000
## 18 0.069430423 -0.17241849 -100.00000
## 22 0.056805165 -0.14045407 -100.00000
## 27 0.150049811 0.01558576 -28.41748
## 30 -0.209307956 0.02210783 -78.39086
## 42 -0.124198884 0.15875457 -100.00000
## 45 -0.121178744 -0.20260750 -100.00000
## 46 0.146837725 0.11333645 -100.00000
# Prediction for a dense point grid using a trained model
df.predict <- expand.grid (x = seq (-1, 1, 0.01), y = seq (-1, 1, 0.01))
df.predict$class <- predict (df.svm, df.predict)
ggplot (df.predict) +
geom_tile (aes (x = x, y = y, fill = class), alpha = 0.25) +
geom_point (data = df, aes (x = x, y = y, color = class), shape = 19, size = 3)Random tree is a very effective supervised classification algorithm that consists of two elements:
An example tree structure is shown below. We distinguish the following tree elements:
Conventionally, trees are drawn growing from top to bottom. There is only one path from the root to each leaf. The entire training sample (TS) is concentrated in the root, the subsequent elements of which are moved down along the branches. The decision on which branches the TS elements will follow is made at the nodes using splitting rules. Below is a simple example of data that can be very easily classified using a tree with just five nodes.
set.seed(123)
n <- 30
m1 <- c (2, 2)
S1 <- matrix (c (1,0,0,1), 2, 2)
v1 <- mvrnorm (n, m1, S1)
r <- 6
x2 <- seq (0, r, length.out = n)
y2 <- sqrt(r^2-x2^2)+runif(n,-0.5,0.5)
df <- data.frame (x = c (v1[,1], x2),
y = c (v1[,2], y2),
class = as.factor (rep (c ("1","2"), each = n)))
ggplot (df, aes (x = x, y = y)) +
geom_point (aes (color = class), size = 2) +
geom_vline (xintercept = 4) +
geom_hline (yintercept = 4)In the above chart it is very easy to see simple division rules according to the variables \(x\) and \(y\). A tree trained from such a training sample could look like this:
The partition of the training subsample is made only based on TS elements in a given node and consists of the best partition of the subsample into two parts going to the child nodes. The best partition is the one that maximizes the differences between the resulting subsets. To find it, one needs:
Let us consider a classification problem with classes \(k = 1,2,3,\dots,g\). Let \(n(m)\) be the sample size at node \(m\) and \(n_j(m)\) be the number of observations from class \(j\) at node \(m\). Then the fraction of observations from class \(j\) at node \(m\) can be written as \(\hat{p}_j(m) = n_j(m)/n(m)\). Observations in node \(m\) are assigned to the class \(k\) that satisfies the relationship
\(k(m) = \arg\max_{j = 1,2,\dots,g} \hat{p}_j(m)\).
If node \(m\) is a leaf, \(k(m)\) is the final result of the classification of observations in node \(m\), otherwise \(k(m)\) is information about which class is most represented in the node.
A reasonable diversity measure is one that takes the value 0 when all observations belong to one class and the maximum value when we are dealing with a uniform distribution of classes in the node, i.e. \(\hat{p}_1(m) = \hat{p} _2(m) = \dots = \hat{p}_g(m) = 1/g\). Examples of diversity measures are:
Let us consider the case where nodes \(m_L\) and \(m_R\) are children of node \(m\), and \(\hat{p}_L = n(m_L)/n(m)\) is the fraction of TS elements that from node \(m\) passed to node \(m_L\). Accordingly, \(\hat{p}_R = 1-\hat{p}_L\) is the fraction of TS elements that passed from node \(m\) to node \(m_R\). The difference between the class diversity of parent node and child nodes is defined as
\(\Delta Q(m) = Q(m) - Q(m_L,m_R)\),
where \(Q(m_L,m_R) = \hat{p}_L Q(m_L) + \hat{p}_R Q(m_R)\) is a joint measure of the diversity of the child node classes of \(m\). It is generally assumed that the Gini index and entropy are more sensitive to changes in class diversity at nodes. The algorithm checks in turn all possible monotonic divisions (i.e. of the form \(x^{(l)} \leqslant c\), where \(c\) is some value of the variable observed in TS) for each observation attribute and selects a division that maximizes \(\Delta Q(m)\).
Even if you know the tree creation algorithm, it is not obvious when to finish its construction. One might naively think that one should continue building the tree until only observations from one class remain in the leaves (i.e. the diversity of classes in the leaves is zero). However, this approach will lead to overfitting the tree to the training sample, i.e. overfitting. The most common type of tree construction is:
We expand the tree as long as possible or until a certain natural rule of stopping construction is met, e.g.
We check the classification ability of the tree on a test or validation sample.
We prune the tree to obtain the highest classification ability.
There are various tree pruning algorithms, e.g. the cost-complexity algorithm, which seeks a compromise between the cost of misclassification and the cost of having to build the tree (proportional to the number of leaves). The goal of the algorithm is to minimize the size of \(R_\alpha(\mathcal{T}) = \rho(\mathcal{T}) + \alpha N_l(\mathcal{T})\), where \(\mathcal{T}\) denotes the given tree , \(\rho(\mathcal{T}) = 1 - ACC\) the fraction of misclassifications in the tree \(\mathcal{T}\), \(N_l(\mathcal{T})\) is the number of leaves in this tree, and the symbol \(\alpha\) is the complexity factor, which is a parameter of the algorithm.
In the R package, it is possible to quickly create a classification
tree using the rpart() function from the rpart
library. The minsplit parameter specifies the minimum
number of elements that must be included in a node for it to split,
while the minbucket parameter selects the minimum number of
observations that must be included in a node to be considered a leaf.
Other parameters include: cp (complexity factor) and
maxdepth (largest distance between root and leaves).
library (rpart)
# Train random tree
tree <- rpart (class ~ x + y, df, minsplit = 2, minbucket = 1)
tree## n= 60
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 60 30 1 (0.5000000 0.5000000)
## 2) y< 3.91716 38 8 1 (0.7894737 0.2105263)
## 4) x< 4.050238 30 0 1 (1.0000000 0.0000000) *
## 5) x>=4.050238 8 0 2 (0.0000000 1.0000000) *
## 3) y>=3.91716 22 0 2 (0.0000000 1.0000000) *
## Call:
## rpart(formula = class ~ x + y, data = df, minsplit = 2, minbucket = 1)
## n= 60
##
## CP nsplit rel error xerror xstd
## 1 0.7333333 0 1.0000000 1.4000000 0.1183216
## 2 0.2666667 1 0.2666667 0.5000000 0.1118034
## 3 0.0100000 2 0.0000000 0.2333333 0.0828877
##
## Variable importance
## y x
## 55 45
##
## Node number 1: 60 observations, complexity param=0.7333333
## predicted class=1 expected loss=0.5 P(node) =1
## class counts: 30 30
## probabilities: 0.500 0.500
## left son=2 (38 obs) right son=3 (22 obs)
## Primary splits:
## y < 3.91716 to the left, improve=17.368420, (0 missing)
## x < 3.287871 to the left, improve= 7.511111, (0 missing)
## Surrogate splits:
## x < 0.4486612 to the right, agree=0.667, adj=0.091, (0 split)
##
## Node number 2: 38 observations, complexity param=0.2666667
## predicted class=1 expected loss=0.2105263 P(node) =0.6333333
## class counts: 30 8
## probabilities: 0.789 0.211
## left son=4 (30 obs) right son=5 (8 obs)
## Primary splits:
## x < 4.050238 to the left, improve=12.631580, (0 missing)
## y < 3.296093 to the left, improve= 1.194079, (0 missing)
##
## Node number 3: 22 observations
## predicted class=2 expected loss=0 P(node) =0.3666667
## class counts: 0 22
## probabilities: 0.000 1.000
##
## Node number 4: 30 observations
## predicted class=1 expected loss=0 P(node) =0.5
## class counts: 30 0
## probabilities: 1.000 0.000
##
## Node number 5: 8 observations
## predicted class=2 expected loss=0 P(node) =0.1333333
## class counts: 0 8
## probabilities: 0.000 1.000
The resulting drawing is not very neat. To correct it, we will use
the rpart.plot() function from the rpart.plot
library.
Let’s test the classification tree on more demanding data, e.g. the
Pima data set from the MASS library.
## Call:
## rpart(formula = type ~ ., data = Pima.tr)
## n= 200
##
## CP nsplit rel error xerror xstd
## 1 0.22058824 0 1.0000000 1.0000000 0.09851844
## 2 0.16176471 1 0.7794118 0.8529412 0.09437007
## 3 0.07352941 2 0.6176471 0.8088235 0.09286273
## 4 0.05882353 3 0.5441176 0.8235294 0.09337946
## 5 0.01470588 4 0.4852941 0.7647059 0.09122390
## 6 0.01000000 7 0.4411765 0.8823529 0.09530501
##
## Variable importance
## glu age bmi bp ped npreg skin
## 38 13 11 11 11 8 8
##
## Node number 1: 200 observations, complexity param=0.2205882
## predicted class=No expected loss=0.34 P(node) =1
## class counts: 132 68
## probabilities: 0.660 0.340
## left son=2 (109 obs) right son=3 (91 obs)
## Primary splits:
## glu < 123.5 to the left, improve=19.624700, (0 missing)
## age < 28.5 to the left, improve=15.016410, (0 missing)
## npreg < 6.5 to the left, improve=10.465630, (0 missing)
## bmi < 27.35 to the left, improve= 9.727105, (0 missing)
## skin < 22.5 to the left, improve= 8.201159, (0 missing)
## Surrogate splits:
## age < 30.5 to the left, agree=0.685, adj=0.308, (0 split)
## bp < 77 to the left, agree=0.650, adj=0.231, (0 split)
## npreg < 6.5 to the left, agree=0.640, adj=0.209, (0 split)
## skin < 32.5 to the left, agree=0.635, adj=0.198, (0 split)
## bmi < 30.85 to the left, agree=0.575, adj=0.066, (0 split)
##
## Node number 2: 109 observations, complexity param=0.01470588
## predicted class=No expected loss=0.1376147 P(node) =0.545
## class counts: 94 15
## probabilities: 0.862 0.138
## left son=4 (74 obs) right son=5 (35 obs)
## Primary splits:
## age < 28.5 to the left, improve=3.2182780, (0 missing)
## npreg < 6.5 to the left, improve=2.4578310, (0 missing)
## bmi < 33.5 to the left, improve=1.6403660, (0 missing)
## bp < 59 to the left, improve=0.9851960, (0 missing)
## skin < 24 to the left, improve=0.8342926, (0 missing)
## Surrogate splits:
## npreg < 4.5 to the left, agree=0.798, adj=0.371, (0 split)
## bp < 77 to the left, agree=0.734, adj=0.171, (0 split)
## skin < 36.5 to the left, agree=0.725, adj=0.143, (0 split)
## bmi < 38.85 to the left, agree=0.716, adj=0.114, (0 split)
## glu < 66 to the right, agree=0.688, adj=0.029, (0 split)
##
## Node number 3: 91 observations, complexity param=0.1617647
## predicted class=Yes expected loss=0.4175824 P(node) =0.455
## class counts: 38 53
## probabilities: 0.418 0.582
## left son=6 (35 obs) right son=7 (56 obs)
## Primary splits:
## ped < 0.3095 to the left, improve=6.528022, (0 missing)
## bmi < 28.65 to the left, improve=6.473260, (0 missing)
## skin < 19.5 to the left, improve=4.778504, (0 missing)
## glu < 166 to the left, improve=4.104532, (0 missing)
## age < 39.5 to the left, improve=3.607390, (0 missing)
## Surrogate splits:
## glu < 126.5 to the left, agree=0.670, adj=0.143, (0 split)
## bp < 93 to the right, agree=0.659, adj=0.114, (0 split)
## bmi < 27.45 to the left, agree=0.659, adj=0.114, (0 split)
## npreg < 9.5 to the right, agree=0.648, adj=0.086, (0 split)
## skin < 20.5 to the left, agree=0.637, adj=0.057, (0 split)
##
## Node number 4: 74 observations
## predicted class=No expected loss=0.05405405 P(node) =0.37
## class counts: 70 4
## probabilities: 0.946 0.054
##
## Node number 5: 35 observations, complexity param=0.01470588
## predicted class=No expected loss=0.3142857 P(node) =0.175
## class counts: 24 11
## probabilities: 0.686 0.314
## left son=10 (9 obs) right son=11 (26 obs)
## Primary splits:
## glu < 90 to the left, improve=2.3934070, (0 missing)
## bmi < 33.4 to the left, improve=1.3714290, (0 missing)
## bp < 68 to the right, improve=0.9657143, (0 missing)
## ped < 0.334 to the left, improve=0.9475564, (0 missing)
## skin < 39.5 to the right, improve=0.7958592, (0 missing)
## Surrogate splits:
## ped < 0.1795 to the left, agree=0.8, adj=0.222, (0 split)
##
## Node number 6: 35 observations, complexity param=0.05882353
## predicted class=No expected loss=0.3428571 P(node) =0.175
## class counts: 23 12
## probabilities: 0.657 0.343
## left son=12 (27 obs) right son=13 (8 obs)
## Primary splits:
## glu < 166 to the left, improve=3.438095, (0 missing)
## ped < 0.2545 to the right, improve=1.651429, (0 missing)
## skin < 25.5 to the left, improve=1.651429, (0 missing)
## npreg < 3.5 to the left, improve=1.078618, (0 missing)
## bp < 73 to the right, improve=1.078618, (0 missing)
## Surrogate splits:
## bp < 94.5 to the left, agree=0.8, adj=0.125, (0 split)
##
## Node number 7: 56 observations, complexity param=0.07352941
## predicted class=Yes expected loss=0.2678571 P(node) =0.28
## class counts: 15 41
## probabilities: 0.268 0.732
## left son=14 (11 obs) right son=15 (45 obs)
## Primary splits:
## bmi < 28.65 to the left, improve=5.778427, (0 missing)
## age < 39.5 to the left, improve=3.259524, (0 missing)
## npreg < 6.5 to the left, improve=2.133215, (0 missing)
## ped < 0.8295 to the left, improve=1.746894, (0 missing)
## skin < 22 to the left, improve=1.474490, (0 missing)
## Surrogate splits:
## skin < 19.5 to the left, agree=0.839, adj=0.182, (0 split)
##
## Node number 10: 9 observations
## predicted class=No expected loss=0 P(node) =0.045
## class counts: 9 0
## probabilities: 1.000 0.000
##
## Node number 11: 26 observations, complexity param=0.01470588
## predicted class=No expected loss=0.4230769 P(node) =0.13
## class counts: 15 11
## probabilities: 0.577 0.423
## left son=22 (19 obs) right son=23 (7 obs)
## Primary splits:
## bp < 68 to the right, improve=1.6246390, (0 missing)
## bmi < 33.4 to the left, improve=1.6173080, (0 missing)
## npreg < 6.5 to the left, improve=0.9423077, (0 missing)
## skin < 39.5 to the right, improve=0.6923077, (0 missing)
## ped < 0.334 to the left, improve=0.4923077, (0 missing)
## Surrogate splits:
## glu < 94.5 to the right, agree=0.808, adj=0.286, (0 split)
## ped < 0.2105 to the right, agree=0.808, adj=0.286, (0 split)
##
## Node number 12: 27 observations
## predicted class=No expected loss=0.2222222 P(node) =0.135
## class counts: 21 6
## probabilities: 0.778 0.222
##
## Node number 13: 8 observations
## predicted class=Yes expected loss=0.25 P(node) =0.04
## class counts: 2 6
## probabilities: 0.250 0.750
##
## Node number 14: 11 observations
## predicted class=No expected loss=0.2727273 P(node) =0.055
## class counts: 8 3
## probabilities: 0.727 0.273
##
## Node number 15: 45 observations
## predicted class=Yes expected loss=0.1555556 P(node) =0.225
## class counts: 7 38
## probabilities: 0.156 0.844
##
## Node number 22: 19 observations
## predicted class=No expected loss=0.3157895 P(node) =0.095
## class counts: 13 6
## probabilities: 0.684 0.316
##
## Node number 23: 7 observations
## predicted class=Yes expected loss=0.2857143 P(node) =0.035
## class counts: 2 5
## probabilities: 0.286 0.714
pima.predict <- predict (pima.tree, newdata = Pima.te, type = "class")
pima.true <- Pima.te[,8]
pima.cm <- confusionMatrix (pima.predict, pima.true, positive = "Yes")
pima.cm## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 182 48
## Yes 41 61
##
## Accuracy : 0.7319
## 95% CI : (0.6808, 0.7788)
## No Information Rate : 0.6717
## P-Value [Acc > NIR] : 0.01042
##
## Kappa : 0.382
##
## Mcnemar's Test P-Value : 0.52478
##
## Sensitivity : 0.5596
## Specificity : 0.8161
## Pos Pred Value : 0.5980
## Neg Pred Value : 0.7913
## Prevalence : 0.3283
## Detection Rate : 0.1837
## Detection Prevalence : 0.3072
## Balanced Accuracy : 0.6879
##
## 'Positive' Class : Yes
##
Of course, there is nothing stopping you from having more than two
classes in the data set. Below is an example of using a tree to the
iris dataset.