Laboratory 9

DATA ANALYSIS AND VISUALIZATION IN R, WINTER 2025 EDITION

Linear Discriminant Analysis
Quadratic Discriminant Analysis
Support Vector Machines
Random Tree

Linear Discriminant Analysis

Probably the simplest supervised classification method is Linear Discriminant Analysis (LDA). Its basis is the assumption that a multivariate Gaussian distribution describes the distributions of variables in each class. Moreover, the covariance matrices of all classes are the same. It can then be proven that the two classes are best separated by a hyperplane (i.e., a straight line in two dimensions) given by the equation

\(\ln \frac{\pi_1}{\pi_2} + \frac{1}{2}(\mathbf{m}_1-\mathbf{m}_2)^T \mathbf{S}^{-1}(\mathbf{m}_1+\mathbf{m}_2) + (\mathbf{m}_1-\mathbf{m}_2)^T \mathbf{S}^{-1} \mathbf{x} = 0\),

where \(\pi_1,\pi_2\) are the a priori probabilities of belonging to both classes, \(\mathbf{m}_1, \mathbf{m}_2\) are the means of the distributions, and \(\mathbf{S}\) is the covariance matrix.

We start with a simple example of two classes, each of them represented by points drawn from a two-dimensional normal distribution (function mvrnorm() from the MASS library)

library(MASS)
library(ggplot2)

S <- matrix (c(3,0,0,3),2,2)
m1 <- c (2,2)
m2 <- c (-1,-1)

n1 <- 60
n2 <- 20
n <- n1 + n2

v1 <- mvrnorm (n1, m1, S)
v2 <- mvrnorm (n2, m2, S)

df <- data.frame (x = c (v1[,1],v2[,1]),
                  y = c (v1[,2],v2[,2]),
                  class = as.factor (rep (c ("1","2"), c (n1,n2))))

theme_set (theme_bw())
ggplot(df) + geom_point(aes(x = x, y = y, color = class), shape = 19, size = 3)

In the R package, LDA is implemented by the lda() function (in MASS library), where we provide the dependency between the fields of the data frame as arguments (e.g., z ~ x + y). As a result, we will obtain the values of a priori probabilities (class frequencies), the expected values of points belonging to the classes, and the direction in which the observations should be projected to make the best separation of classes. Unfortunately, the optimal split point is not returned, and you have to determine it yourself by estimating the covariance matrix and using the formula above.

df.lda <- lda (class ~ x + y, data = df)
df.lda

## Call:
## lda(class ~ x + y, data = df)
## 
## Prior probabilities of groups:
##    1    2 
## 0.75 0.25 
## 
## Group means:
##            x         y
## 1  2.3803385  1.554330
## 2 -0.2358954 -1.283169
## 
## Coefficients of linear discriminants:
##          LD1
## x -0.3153548
## y -0.5088512

project <- as.matrix (df[,1:2]) %*% df.lda$scaling
df$project <- project[,1]
pp <- 0
ggplot (df) + 
  geom_point (aes (x = project, y = 0, color = class), shape = 21, size = 5) + 
  geom_vline (xintercept = pp)

Of course, the “best” separation of classes does not mean that we will get a division in which all points from class “1” will be classified as belonging to this group - this can be seen in the figure above. Using the klaR library and the partimat() function implemented in it, you can obtain an explicit straight line separating classes and quickly verify which points have been incorrectly classified. We will quantitatively evaluate the classification accuracy using a confusion matrix.

library (klaR)
partimat (class ~ x + y, data = df, method="lda")

library(caret)
df.pred <- predict (df.lda, df)
df.conf <- confusionMatrix (df.pred$class, df$class, positive = "1")
df.conf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2
##          1 55  6
##          2  5 14
##                                           
##                Accuracy : 0.8625          
##                  95% CI : (0.7673, 0.9293)
##     No Information Rate : 0.75            
##     P-Value [Acc > NIR] : 0.01064         
##                                           
##                   Kappa : 0.6271          
##                                           
##  Mcnemar's Test P-Value : 1.00000         
##                                           
##             Sensitivity : 0.9167          
##             Specificity : 0.7000          
##          Pos Pred Value : 0.9016          
##          Neg Pred Value : 0.7368          
##              Prevalence : 0.7500          
##          Detection Rate : 0.6875          
##    Detection Prevalence : 0.7625          
##       Balanced Accuracy : 0.8083          
##                                           
##        'Positive' Class : 1               
##

Quadratic Discriminant Analysis

In a situation where the assumption of equality of the covariance matrix in all groups is not met, the effectiveness of LDA is much lower and therefore it is better to use Quadratic Discriminant Analysis (QDA). We will compare both methods on an artificial set containing observations from three groups with different covariance matrices.

S1 <- matrix (c (4,2,2,4), 2, 2)
S2 <- matrix (c (2,0.5,0,1), 2, 2)
S3 <- matrix (c (1,-0.5,0.5,4), 2, 2)
m1 <- c (3,4)
m2 <- c (0,-1)
m3 <- c (4,-3)

n1 <- 120
n2 <- 80
n3 <- 100
n <- n1 + n2 + n3

v1 <- mvrnorm (n1, m1, S1)
v2 <- mvrnorm (n2, m2, S2)
v3 <- mvrnorm (n3, m3, S3)

df <- data.frame (x = c (v1[,1], v2[,1], v3[,1]),
                  y = c (v1[,2], v2[,2], v3[,2]),
                  class = as.factor (rep (c ("1", "2", "3"), c (n1, n2, n3))))

ggplot (df) + geom_point (aes(x = x, y = y, color = class), shape = 19, size = 3)

partimat (class ~ x + y, data = df, method="lda")

partimat (class ~ x + y, data = df, method="qda")

# Train LDA and QDA and test with 10-fold cross-validation
trContr <- trainControl (method = "cv", number = 10)
df.lda.cv <- train (class ~ x + y, data = df, method="lda", trControl = trContr)
df.qda.cv <- train (class ~ x + y, data = df, method="qda", trControl = trContr)
df.lda.cv

## Linear Discriminant Analysis 
## 
## 300 samples
##   2 predictor
##   3 classes: '1', '2', '3' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 270, 270, 270, 270, 270, 270, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9266667  0.8895769

df.qda.cv

## Quadratic Discriminant Analysis 
## 
## 300 samples
##   2 predictor
##   3 classes: '1', '2', '3' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 270, 270, 270, 270, 270, 270, ... 
## Resampling results:
## 
##   Accuracy  Kappa    
##   0.94      0.9089515

# Observations projected onto separating lines found by LDA
df.lda <- lda (class ~ x + y, data = df)
df.lda

## Call:
## lda(class ~ x + y, data = df)
## 
## Prior probabilities of groups:
##         1         2         3 
## 0.4000000 0.2666667 0.3333333 
## 
## Group means:
##           x         y
## 1 2.9653630  3.839433
## 2 0.1190764 -1.019548
## 3 4.0480741 -2.970206
## 
## Coefficients of linear discriminants:
##          LD1        LD2
## x  0.3164014 0.60203401
## y -0.6065254 0.09468834
## 
## Proportion of trace:
##   LD1   LD2 
## 0.797 0.203

project <- as.matrix (df[,1:2]) %*% df.lda$scaling
df.project <- data.frame (project, class = df$class)
g <- ggplot (df.project, aes (x = LD1,  y = LD2))
g + geom_point (aes (color = class), size=3)

Support Vector Machines

Support Vector Machines (SVM) is a quite modern (from the 1990s) supervised learning method. Its main feature is a new approach to the problem of the separating hyperplane: the task is to find the largest possible margin separating points belonging to different classes. The margin hyperplanes themselves must pass through specific elements of the training sample, called support vectors. If we denote the observation vector by \(\mathbf{x_i}\) and its class by \(y_i = \{-1,1\}\), the expression for the support vector can be written as

\(\mathbf{w} = \sum_{i=1}^{i=n} y_i \alpha_i x_i\),

where the coefficients \(\alpha_i\) are non-zero only for support vectors. The intercept is equal

\(\rho = \frac{1}{2}(\mathbf{w}*\mathbf{x}_1^*+\mathbf{w}*\mathbf{x}_{-1}^*)\),

where \(\mathbf{x}_1, \mathbf{x}_{-1}\) are the points through which the support vectors pass. As an illustration, we will use a simple example with 12 observations from two classes.

x <- c (1.0, 1.5, 1.0, 1.5, 2.5, 3.5, 2.0, 3.0, 3.5, 4.5, 5.0, 5.0)
y <- c (2.0, 1.0, 4.0, 3.0, 2.0, 2.0, 5.5, 5.0, 4.0, 4.5, 2.0, 3.5)

df <- data.frame (x = x,
                  y = y,
                  class = as.factor (rep (c("-1", "1"), each=6)))

g <- ggplot (df) + geom_point (aes (x = x, y = y, color = class), shape = 19, size = 3)
g

For classification we will use the svm() function from the e1071 library. It has many parameters, including:

type - the type of machine we want to use, there are five options to choose from, three are classification and two are regression,
kernel - kernel used for training and prediction, selectable: linear, polynomial, radial and sigmoid,
scale - a logical variable deciding whether the data should be scaled (subtracting the mean and dividing by the variance), by default it has the value TRUE,
cost - a parameter regulating the penalty for exceeding the margin boundary.

library(e1071)

df.svm <- svm (class ~ x + y,
               data = df,
               type = "C-classification",
               kernel = "linear",
               cost = 10,
               scale = FALSE)

# Support vectors
print (df.svm$SV)

##      x   y
## 6  3.5 2.0
## 7  2.0 5.5
## 11 5.0 2.0

# Alpha coefficients
print(df.svm$coefs)

##            [,1]
## [1,]  1.5416840
## [2,] -0.3264949
## [3,] -1.2151892

As one can see, only three points are support vectors (SV), and the values of the \(\alpha\) coefficients immediately encode the class through the sign before the coefficient. We will determine the \(\mathbf{w}\) vector and use it to plot the separating hyperplane and the margin hyperplanes on the graph.

# Determining the vector w
w <- t (df.svm$SV) %*% df.svm$coefs
w

##        [,1]
## x -1.333041
## y -1.142732

b <- df.svm$rho
b

## [1] -7.950963

xx <- 0:6

hyper <- data.frame (xx = xx,
                     r = -w[1]/w[2]*xx + b/w[2],
                     m1 = -w[1]/w[2]*xx + (b+1)/w[2],
                     m2 = -w[1]/w[2]*xx + (b-1)/w[2])

# Separating hyperplane
r <- geom_line (data = hyper, aes (x=xx, y=r), color = "red")

# Margin hyperplanes
m1 <- geom_line (data = hyper, aes (x=xx, y=m1))
m2 <- geom_line (data = hyper, aes (x=xx, y=m2))

g + r + m1 + m2

Of course, perfect class separability is desirable, but not always possible to achieve. In such cases, the value of the “penalty” for exceeding the margins plays a key role.

S <- matrix (c (3,1,1,3),2,2)
m1 <- c (3,3)
m2 <- c (-1,-1)

n1 <- 30
n2 <- 20
n <- n1 + n2

v1 <- mvrnorm (n1, m1, S)
v2 <- mvrnorm (n2, m2, S)

df <- data.frame (x = c (v1[,1], v2[,1]),
                  y = c (v1[,2], v2[,2]),
                  class = as.factor (rep (c("1","2"), c(n1,n2))))

g <- ggplot(df) + geom_point (aes (x = x, y = y, color = class), shape = 19, size = 3)
g

df.svm <- svm (class ~ x + y,
               data = df,
               type = "C-classification",
               kernel = "linear",
               cost = 10,
               scale = FALSE)

# Support vectors and alpha coefficients
print (cbind (df.svm$SV, df.svm$coefs))

##             x           y          
## 12  1.9287689  0.09670212  10.00000
## 15  0.4927069  3.12663755   4.61808
## 18  0.9622093 -1.02128873  10.00000
## 21  1.6090437  0.78932677  10.00000
## 28 -0.1890131  0.98394459  10.00000
## 33 -1.9260764  3.41839590  -6.28876
## 36  0.8761095  2.41351080 -10.00000
## 40 -0.4542006  2.04466948 -10.00000
## 45  2.3546441 -3.52691678  -8.32932
## 46  3.2773316 -1.43270609 -10.00000

# Determining the vector w
w <- t (df.svm$SV) %*% df.svm$coefs
w

##        [,1]
## x 0.8930932
## y 0.5505165

b <- df.svm$rho
b

## [1] 1.161434

xx <- -4:4

hyper <- data.frame (xx = xx,
                     r = -w[1]/w[2]*xx + b/w[2],
                     m1 = -w[1]/w[2]*xx + (b+1)/w[2],
                     m2 = -w[1]/w[2]*xx + (b-1)/w[2])

# Separating hyperplane
r <- geom_line (data = hyper, aes (x=xx, y=r), color = "red")

# Marings hyperplanes
m1 <- geom_line (data = hyper, aes (x=xx, y=m1))
m2 <- geom_line (data = hyper, aes (x=xx, y=m2))

g + r + m1 + m2

The choice of the appropriate kernel affects the shape of the margin hyperplanes. Below is an example of using a radial kernel.

n <- 50
s <- 0.25
df <- data.frame (x = rnorm(n, sd = s),
                  y = rnorm(n, sd = s))

df$class <- ifelse (sqrt (df$x^2+df$y^2) < s, "1", "2")

ggplot(df) + geom_point (aes (x = x, y = y, color = class), shape = 19, size = 3)

df.svm <- svm (class ~ x + y,
               data = df,
               type = "C-classification",
               kernel = "radial",
               cost = 100,
               scale = FALSE)

# Support vectors and alpha coefficients
print (cbind (df.svm$SV, df.svm$coefs))

##               x           y           
## 4  -0.165310306  0.24918957  100.00000
## 7  -0.187382595  0.23177701  100.00000
## 15 -0.004327054 -0.25178846  100.00000
## 19  0.226950408 -0.12589336  100.00000
## 20 -0.080222352 -0.24190730  100.00000
## 34  0.043514463 -0.25048192  100.00000
## 35 -0.257640065  0.07303283  100.00000
## 38  0.116533543  0.24184943  100.00000
## 41 -0.299725065  0.04736016  100.00000
## 43  0.031132833  0.29126741  100.00000
## 48  0.046874603  0.34197370   13.46842
## 50  0.001214229 -0.29315410   93.33992
## 2  -0.218180525  0.06082180 -100.00000
## 9  -0.209028560  0.10189657 -100.00000
## 11 -0.003372804 -0.21904507 -100.00000
## 14  0.048521603  0.16902783 -100.00000
## 16 -0.093232405  0.14712899 -100.00000
## 18  0.069430423 -0.17241849 -100.00000
## 22  0.056805165 -0.14045407 -100.00000
## 27  0.150049811  0.01558576  -28.41748
## 30 -0.209307956  0.02210783  -78.39086
## 42 -0.124198884  0.15875457 -100.00000
## 45 -0.121178744 -0.20260750 -100.00000
## 46  0.146837725  0.11333645 -100.00000

# Prediction for a dense point grid using a trained model
df.predict <- expand.grid (x = seq (-1, 1, 0.01), y = seq (-1, 1, 0.01))
df.predict$class <- predict (df.svm, df.predict)

ggplot (df.predict) +
  geom_tile (aes (x = x, y = y, fill = class), alpha = 0.25) +
  geom_point (data = df, aes (x = x, y = y, color = class), shape = 19, size = 3)

Random Tree

Random tree is a very effective supervised classification algorithm that consists of two elements:

structure - directed tree, i.e. an acyclic, connected directed graph,
splitting rules - placed in non-leaf nodes, determine how the training sample is to be splitted.

An example tree structure is shown below. We distinguish the following tree elements:

nodes - elements A-E in the figure below, splitting rules are placed in the nodes,
branches - graph edges directed from parents to children,
parents (A,C) - nodes from which branches reach other nodes (children),
children (B-E) - nodes, to which reach branches from other nodes (parents),
descendants - children and children of children,
root - a node without parent, only one in the whole tree,
leaves - nodes without children.

Conventionally, trees are drawn growing from top to bottom. There is only one path from the root to each leaf. The entire training sample (TS) is concentrated in the root, the subsequent elements of which are moved down along the branches. The decision on which branches the TS elements will follow is made at the nodes using splitting rules. Below is a simple example of data that can be very easily classified using a tree with just five nodes.

set.seed(123)

n <- 30
m1 <- c (2, 2)
S1 <- matrix (c (1,0,0,1), 2, 2)
v1 <- mvrnorm (n, m1, S1)

r <- 6
x2 <- seq (0, r, length.out = n)
y2 <- sqrt(r^2-x2^2)+runif(n,-0.5,0.5)

df <- data.frame (x = c (v1[,1], x2),
                  y = c (v1[,2], y2),
                  class = as.factor (rep (c ("1","2"), each = n)))

ggplot (df, aes (x = x, y = y)) +
  geom_point (aes (color = class), size = 2) +
  geom_vline (xintercept = 4) +
  geom_hline (yintercept = 4)

In the above chart it is very easy to see simple division rules according to the variables \(x\) and \(y\). A tree trained from such a training sample could look like this:

Splitting rules

The partition of the training subsample is made only based on TS elements in a given node and consists of the best partition of the subsample into two parts going to the child nodes. The best partition is the one that maximizes the differences between the resulting subsets. To find it, one needs:

an appropriate measure of the diversity of classes in a node,
a measure of the difference between the diversity of classes in the parent node and the child nodes,
diversity difference maximization algorithm.

Let us consider a classification problem with classes \(k = 1,2,3,\dots,g\). Let \(n(m)\) be the sample size at node \(m\) and \(n_j(m)\) be the number of observations from class \(j\) at node \(m\). Then the fraction of observations from class \(j\) at node \(m\) can be written as \(\hat{p}_j(m) = n_j(m)/n(m)\). Observations in node \(m\) are assigned to the class \(k\) that satisfies the relationship

\(k(m) = \arg\max_{j = 1,2,\dots,g} \hat{p}_j(m)\).

If node \(m\) is a leaf, \(k(m)\) is the final result of the classification of observations in node \(m\), otherwise \(k(m)\) is information about which class is most represented in the node.

Diversity measures

A reasonable diversity measure is one that takes the value 0 when all observations belong to one class and the maximum value when we are dealing with a uniform distribution of classes in the node, i.e. \(\hat{p}_1(m) = \hat{p} _2(m) = \dots = \hat{p}_g(m) = 1/g\). Examples of diversity measures are:

fraction of misclassifications \(Q_f(m) = 1 - \hat{p}_{k(m)}(m)\)
Gini index \(Q_g(m) = \sum_{j=1}^{g} \hat{p}_j(m)(1 - \hat{p}_j(m))\)
entropy \(Q_e(m) = -\sum_{j=1}^{g} \hat{p}_j(m) \log\hat{p}_j(m)\).

Let us consider the case where nodes \(m_L\) and \(m_R\) are children of node \(m\), and \(\hat{p}_L = n(m_L)/n(m)\) is the fraction of TS elements that from node \(m\) passed to node \(m_L\). Accordingly, \(\hat{p}_R = 1-\hat{p}_L\) is the fraction of TS elements that passed from node \(m\) to node \(m_R\). The difference between the class diversity of parent node and child nodes is defined as

\(\Delta Q(m) = Q(m) - Q(m_L,m_R)\),

where \(Q(m_L,m_R) = \hat{p}_L Q(m_L) + \hat{p}_R Q(m_R)\) is a joint measure of the diversity of the child node classes of \(m\). It is generally assumed that the Gini index and entropy are more sensitive to changes in class diversity at nodes. The algorithm checks in turn all possible monotonic divisions (i.e. of the form \(x^{(l)} \leqslant c\), where \(c\) is some value of the variable observed in TS) for each observation attribute and selects a division that maximizes \(\Delta Q(m)\).

Tree optimization

Even if you know the tree creation algorithm, it is not obvious when to finish its construction. One might naively think that one should continue building the tree until only observations from one class remain in the leaves (i.e. the diversity of classes in the leaves is zero). However, this approach will lead to overfitting the tree to the training sample, i.e. overfitting. The most common type of tree construction is:

We expand the tree as long as possible or until a certain natural rule of stopping construction is met, e.g.
- leaves contain elements of only one class,
- the leaves contain observations with the same values, although belonging to different classes,
- recognition from the top to the leaf of a node that has received no more than 5 TS elements,
- the established maximum distance from root to leaf has been reached.
We check the classification ability of the tree on a test or validation sample.
We prune the tree to obtain the highest classification ability.

There are various tree pruning algorithms, e.g. the cost-complexity algorithm, which seeks a compromise between the cost of misclassification and the cost of having to build the tree (proportional to the number of leaves). The goal of the algorithm is to minimize the size of \(R_\alpha(\mathcal{T}) = \rho(\mathcal{T}) + \alpha N_l(\mathcal{T})\), where \(\mathcal{T}\) denotes the given tree , \(\rho(\mathcal{T}) = 1 - ACC\) the fraction of misclassifications in the tree \(\mathcal{T}\), \(N_l(\mathcal{T})\) is the number of leaves in this tree, and the symbol \(\alpha\) is the complexity factor, which is a parameter of the algorithm.

Rpart package

In the R package, it is possible to quickly create a classification tree using the rpart() function from the rpart library. The minsplit parameter specifies the minimum number of elements that must be included in a node for it to split, while the minbucket parameter selects the minimum number of observations that must be included in a node to be considered a leaf. Other parameters include: cp (complexity factor) and maxdepth (largest distance between root and leaves).

library (rpart)

# Train random tree
tree <- rpart (class ~ x + y, df, minsplit = 2, minbucket = 1)
tree

## n= 60 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 60 30 1 (0.5000000 0.5000000)  
##   2) y< 3.91716 38  8 1 (0.7894737 0.2105263)  
##     4) x< 4.050238 30  0 1 (1.0000000 0.0000000) *
##     5) x>=4.050238 8  0 2 (0.0000000 1.0000000) *
##   3) y>=3.91716 22  0 2 (0.0000000 1.0000000) *

summary (tree)

## Call:
## rpart(formula = class ~ x + y, data = df, minsplit = 2, minbucket = 1)
##   n= 60 
## 
##          CP nsplit rel error    xerror      xstd
## 1 0.7333333      0 1.0000000 1.4000000 0.1183216
## 2 0.2666667      1 0.2666667 0.5000000 0.1118034
## 3 0.0100000      2 0.0000000 0.2333333 0.0828877
## 
## Variable importance
##  y  x 
## 55 45 
## 
## Node number 1: 60 observations,    complexity param=0.7333333
##   predicted class=1  expected loss=0.5  P(node) =1
##     class counts:    30    30
##    probabilities: 0.500 0.500 
##   left son=2 (38 obs) right son=3 (22 obs)
##   Primary splits:
##       y < 3.91716   to the left,  improve=17.368420, (0 missing)
##       x < 3.287871  to the left,  improve= 7.511111, (0 missing)
##   Surrogate splits:
##       x < 0.4486612 to the right, agree=0.667, adj=0.091, (0 split)
## 
## Node number 2: 38 observations,    complexity param=0.2666667
##   predicted class=1  expected loss=0.2105263  P(node) =0.6333333
##     class counts:    30     8
##    probabilities: 0.789 0.211 
##   left son=4 (30 obs) right son=5 (8 obs)
##   Primary splits:
##       x < 4.050238  to the left,  improve=12.631580, (0 missing)
##       y < 3.296093  to the left,  improve= 1.194079, (0 missing)
## 
## Node number 3: 22 observations
##   predicted class=2  expected loss=0  P(node) =0.3666667
##     class counts:     0    22
##    probabilities: 0.000 1.000 
## 
## Node number 4: 30 observations
##   predicted class=1  expected loss=0  P(node) =0.5
##     class counts:    30     0
##    probabilities: 1.000 0.000 
## 
## Node number 5: 8 observations
##   predicted class=2  expected loss=0  P(node) =0.1333333
##     class counts:     0     8
##    probabilities: 0.000 1.000

# Plot a tree
plot (tree)
text (tree, use.n = TRUE, all = TRUE, font = 2)

The resulting drawing is not very neat. To correct it, we will use the rpart.plot() function from the rpart.plot library.

library (rpart.plot)
rpart.plot (tree, type = 1, extra = 1)

Let’s test the classification tree on more demanding data, e.g. the Pima data set from the MASS library.

pima.tree <- rpart (type ~., Pima.tr)
summary (pima.tree)

## Call:
## rpart(formula = type ~ ., data = Pima.tr)
##   n= 200 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.22058824      0 1.0000000 1.0000000 0.09851844
## 2 0.16176471      1 0.7794118 0.8529412 0.09437007
## 3 0.07352941      2 0.6176471 0.8088235 0.09286273
## 4 0.05882353      3 0.5441176 0.8235294 0.09337946
## 5 0.01470588      4 0.4852941 0.7647059 0.09122390
## 6 0.01000000      7 0.4411765 0.8823529 0.09530501
## 
## Variable importance
##   glu   age   bmi    bp   ped npreg  skin 
##    38    13    11    11    11     8     8 
## 
## Node number 1: 200 observations,    complexity param=0.2205882
##   predicted class=No   expected loss=0.34  P(node) =1
##     class counts:   132    68
##    probabilities: 0.660 0.340 
##   left son=2 (109 obs) right son=3 (91 obs)
##   Primary splits:
##       glu   < 123.5  to the left,  improve=19.624700, (0 missing)
##       age   < 28.5   to the left,  improve=15.016410, (0 missing)
##       npreg < 6.5    to the left,  improve=10.465630, (0 missing)
##       bmi   < 27.35  to the left,  improve= 9.727105, (0 missing)
##       skin  < 22.5   to the left,  improve= 8.201159, (0 missing)
##   Surrogate splits:
##       age   < 30.5   to the left,  agree=0.685, adj=0.308, (0 split)
##       bp    < 77     to the left,  agree=0.650, adj=0.231, (0 split)
##       npreg < 6.5    to the left,  agree=0.640, adj=0.209, (0 split)
##       skin  < 32.5   to the left,  agree=0.635, adj=0.198, (0 split)
##       bmi   < 30.85  to the left,  agree=0.575, adj=0.066, (0 split)
## 
## Node number 2: 109 observations,    complexity param=0.01470588
##   predicted class=No   expected loss=0.1376147  P(node) =0.545
##     class counts:    94    15
##    probabilities: 0.862 0.138 
##   left son=4 (74 obs) right son=5 (35 obs)
##   Primary splits:
##       age   < 28.5   to the left,  improve=3.2182780, (0 missing)
##       npreg < 6.5    to the left,  improve=2.4578310, (0 missing)
##       bmi   < 33.5   to the left,  improve=1.6403660, (0 missing)
##       bp    < 59     to the left,  improve=0.9851960, (0 missing)
##       skin  < 24     to the left,  improve=0.8342926, (0 missing)
##   Surrogate splits:
##       npreg < 4.5    to the left,  agree=0.798, adj=0.371, (0 split)
##       bp    < 77     to the left,  agree=0.734, adj=0.171, (0 split)
##       skin  < 36.5   to the left,  agree=0.725, adj=0.143, (0 split)
##       bmi   < 38.85  to the left,  agree=0.716, adj=0.114, (0 split)
##       glu   < 66     to the right, agree=0.688, adj=0.029, (0 split)
## 
## Node number 3: 91 observations,    complexity param=0.1617647
##   predicted class=Yes  expected loss=0.4175824  P(node) =0.455
##     class counts:    38    53
##    probabilities: 0.418 0.582 
##   left son=6 (35 obs) right son=7 (56 obs)
##   Primary splits:
##       ped  < 0.3095 to the left,  improve=6.528022, (0 missing)
##       bmi  < 28.65  to the left,  improve=6.473260, (0 missing)
##       skin < 19.5   to the left,  improve=4.778504, (0 missing)
##       glu  < 166    to the left,  improve=4.104532, (0 missing)
##       age  < 39.5   to the left,  improve=3.607390, (0 missing)
##   Surrogate splits:
##       glu   < 126.5  to the left,  agree=0.670, adj=0.143, (0 split)
##       bp    < 93     to the right, agree=0.659, adj=0.114, (0 split)
##       bmi   < 27.45  to the left,  agree=0.659, adj=0.114, (0 split)
##       npreg < 9.5    to the right, agree=0.648, adj=0.086, (0 split)
##       skin  < 20.5   to the left,  agree=0.637, adj=0.057, (0 split)
## 
## Node number 4: 74 observations
##   predicted class=No   expected loss=0.05405405  P(node) =0.37
##     class counts:    70     4
##    probabilities: 0.946 0.054 
## 
## Node number 5: 35 observations,    complexity param=0.01470588
##   predicted class=No   expected loss=0.3142857  P(node) =0.175
##     class counts:    24    11
##    probabilities: 0.686 0.314 
##   left son=10 (9 obs) right son=11 (26 obs)
##   Primary splits:
##       glu  < 90     to the left,  improve=2.3934070, (0 missing)
##       bmi  < 33.4   to the left,  improve=1.3714290, (0 missing)
##       bp   < 68     to the right, improve=0.9657143, (0 missing)
##       ped  < 0.334  to the left,  improve=0.9475564, (0 missing)
##       skin < 39.5   to the right, improve=0.7958592, (0 missing)
##   Surrogate splits:
##       ped < 0.1795 to the left,  agree=0.8, adj=0.222, (0 split)
## 
## Node number 6: 35 observations,    complexity param=0.05882353
##   predicted class=No   expected loss=0.3428571  P(node) =0.175
##     class counts:    23    12
##    probabilities: 0.657 0.343 
##   left son=12 (27 obs) right son=13 (8 obs)
##   Primary splits:
##       glu   < 166    to the left,  improve=3.438095, (0 missing)
##       ped   < 0.2545 to the right, improve=1.651429, (0 missing)
##       skin  < 25.5   to the left,  improve=1.651429, (0 missing)
##       npreg < 3.5    to the left,  improve=1.078618, (0 missing)
##       bp    < 73     to the right, improve=1.078618, (0 missing)
##   Surrogate splits:
##       bp < 94.5   to the left,  agree=0.8, adj=0.125, (0 split)
## 
## Node number 7: 56 observations,    complexity param=0.07352941
##   predicted class=Yes  expected loss=0.2678571  P(node) =0.28
##     class counts:    15    41
##    probabilities: 0.268 0.732 
##   left son=14 (11 obs) right son=15 (45 obs)
##   Primary splits:
##       bmi   < 28.65  to the left,  improve=5.778427, (0 missing)
##       age   < 39.5   to the left,  improve=3.259524, (0 missing)
##       npreg < 6.5    to the left,  improve=2.133215, (0 missing)
##       ped   < 0.8295 to the left,  improve=1.746894, (0 missing)
##       skin  < 22     to the left,  improve=1.474490, (0 missing)
##   Surrogate splits:
##       skin < 19.5   to the left,  agree=0.839, adj=0.182, (0 split)
## 
## Node number 10: 9 observations
##   predicted class=No   expected loss=0  P(node) =0.045
##     class counts:     9     0
##    probabilities: 1.000 0.000 
## 
## Node number 11: 26 observations,    complexity param=0.01470588
##   predicted class=No   expected loss=0.4230769  P(node) =0.13
##     class counts:    15    11
##    probabilities: 0.577 0.423 
##   left son=22 (19 obs) right son=23 (7 obs)
##   Primary splits:
##       bp    < 68     to the right, improve=1.6246390, (0 missing)
##       bmi   < 33.4   to the left,  improve=1.6173080, (0 missing)
##       npreg < 6.5    to the left,  improve=0.9423077, (0 missing)
##       skin  < 39.5   to the right, improve=0.6923077, (0 missing)
##       ped   < 0.334  to the left,  improve=0.4923077, (0 missing)
##   Surrogate splits:
##       glu < 94.5   to the right, agree=0.808, adj=0.286, (0 split)
##       ped < 0.2105 to the right, agree=0.808, adj=0.286, (0 split)
## 
## Node number 12: 27 observations
##   predicted class=No   expected loss=0.2222222  P(node) =0.135
##     class counts:    21     6
##    probabilities: 0.778 0.222 
## 
## Node number 13: 8 observations
##   predicted class=Yes  expected loss=0.25  P(node) =0.04
##     class counts:     2     6
##    probabilities: 0.250 0.750 
## 
## Node number 14: 11 observations
##   predicted class=No   expected loss=0.2727273  P(node) =0.055
##     class counts:     8     3
##    probabilities: 0.727 0.273 
## 
## Node number 15: 45 observations
##   predicted class=Yes  expected loss=0.1555556  P(node) =0.225
##     class counts:     7    38
##    probabilities: 0.156 0.844 
## 
## Node number 22: 19 observations
##   predicted class=No   expected loss=0.3157895  P(node) =0.095
##     class counts:    13     6
##    probabilities: 0.684 0.316 
## 
## Node number 23: 7 observations
##   predicted class=Yes  expected loss=0.2857143  P(node) =0.035
##     class counts:     2     5
##    probabilities: 0.286 0.714

rpart.plot (pima.tree, type = 1, extra = 1)

pima.predict <- predict (pima.tree, newdata = Pima.te, type = "class")
pima.true <- Pima.te[,8]

pima.cm <- confusionMatrix (pima.predict, pima.true, positive = "Yes")
pima.cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  182  48
##        Yes  41  61
##                                           
##                Accuracy : 0.7319          
##                  95% CI : (0.6808, 0.7788)
##     No Information Rate : 0.6717          
##     P-Value [Acc > NIR] : 0.01042         
##                                           
##                   Kappa : 0.382           
##                                           
##  Mcnemar's Test P-Value : 0.52478         
##                                           
##             Sensitivity : 0.5596          
##             Specificity : 0.8161          
##          Pos Pred Value : 0.5980          
##          Neg Pred Value : 0.7913          
##              Prevalence : 0.3283          
##          Detection Rate : 0.1837          
##    Detection Prevalence : 0.3072          
##       Balanced Accuracy : 0.6879          
##                                           
##        'Positive' Class : Yes             
##

Of course, there is nothing stopping you from having more than two classes in the data set. Below is an example of using a tree to the iris dataset.

iris.tree <- rpart (Species ~., iris)
rpart.plot (iris.tree, type = 1, extra = 1)