Laboratory 3

DATA ANALYSIS AND VISUALIZATION IN R, WINTER 2025 EDITION

Creating plots
Creating histograms
Probability distributions
Sampling
Empirical distributions

Creating plots

The standard R command for creating a simple scatter plot is plot (x,y), where x and y are numerical vectors of equal length.

x <- 1:10
plot(x, x^2)

The plot() function has a substantial set of different options, with the most commonly used are:

xlab, ylab, main - axis (and main) titles,
xlim, ylim - axis ranges,
type - plot type: “p” - points (default), “l” - lines, “b” - both points and lines (see ?plot for more information),
pch - point character/symbol (see ?points for possible values),
lty - linetype (see ?par for possible values),
col - color of points or lines,
bg - filling/background color,
cex - size of points,
lwd - line thickness,
font - font type: 1 - normal, 2 - bold, 3 - italic, 4 - bold italic.

In addition to the parameters of the plot() function, there are also global options for plotting that can be controlled using the par() function, for example:

bg - the color to be used for the background of the device region,
mai, mar - the size of the margins,
mfrow, mfcol - parameters for plotting multi0panel plots.

For axis titles we can use the expression() function, which encodes mathematic expression and greek letters (e.g. ^ is a superscript, [..] is a subscript).

plot (x,
     x^2,
     xlab = "x",
     ylab = expression(f(xi)==x^2),
     main = "The plot of f(x) function",
     col = "red",
     pch = 19,
     font = 2,
     font.lab = 4,
     font.main = 3,
     cex = 2)

To add further series to the panel, one should use the points or lines function, depending on what type the next series is to be. The next call to the plot function will replace the previous series instead of adding another one. It is worth remembering that the range of axes on the plot is determined automatically when the plot function is called and is not corrected later when adding subsequent series. In the example below, the red series is truncated because it falls outside the range of the blue series.

y <- 2*x+2
z <- 3*x-1
plot (x, y, type = "l", lwd = 2, col = "blue")
lines (x, z, lwd = 2, col = "red")

To save the plot to a graphical file, use one of the commands png(), jpeg() or tiff(), and after casting the plot, close the stream with the dev.off() command.

png ("fig2.png")
plot (x, y, type = "l", lwd = 2, col = "blue")
lines (x, z, lwd = 2, col = "red")
dev.off()

## png 
##   2

A legend can be added to the plot by the legend() function, which the most important parameters are:

x - the coordinate of the left edge of the legend or one of the predefined keywords: top,bottom,left,right,center,topleft,topright,bottomleft,bottomright,
y - the coordinate of the left edge of the legend (NULL by default, it is used in conjunction with the x parameter),
legend - a character vector of names of legend elements,
pch - an integer vector of symbols to display next to the names of legend elements,
lty - an integer vector of types of lines to display next to the names of legend elements,
col - a vector with the (border) colors of symbols,
pt.bg - a vector with the background colors of symbols.

x <- 1:10
plot (x,
      x^2,
      xlab = "x",
      ylab = expression(f(x)),
      col = "red",
      bg = "red",
      pch = 21,
      cex = 2)
points (x,
        x^1.5,
        col = "blue",
        bg = "blue",
        pch = 22,
        cex = 2)
legend ("topleft",
        legend = c (expression(x^1.5),expression(x^2)),
        pch = c (22,19),
        col = c ("blue","red"),
        pt.bg = c("blue","red"))

There is also a particular function curve() for plotting a continuous function over the interval. This function can be used instead of plot() (it creates a fresh canvas) or as an additional series after using a parameter add = TRUE.

curve (x^2, from = 0, to = 10, col = "red")

plot (x,
      x^2,
      xlab = "x",
      ylab = expression(f(x)),
      col = "red",
      bg = "red",
      pch = 21,
      cex = 2)
curve (x^2, from = 0, to = 11, col = "red", add = TRUE)
points (x,
        x^1.5,
        col = "blue",
        bg = "blue",
        pch = 22,
        cex = 2)
curve (x^1.5, from = 0, to = 11, col = "blue", add = TRUE)

Creating histograms

The most straightforward function for counting repeated values is table() which transforms numeric or character vectors into factors before counting. It can be used for one or two vectors. In the latter case, it builds a contingency table of the counts at each combination of factor levels.

v <- c(0, 0, 1, 2, 3, 1, 2, 3, 4)
table (v)

## v
## 0 1 2 3 4 
## 2 2 2 2 1

v <- c(0, 0, 1.1, 1.9, 2.1, 2,3)
table (v)

## v
##   0 1.1 1.9   2 2.1   3 
##   2   1   1   1   1   1

x <- c(1, 1, 2, 2, 3, 4, 5)
y <- c(2, 2, 3, 1, 5, 5, 5)
table (x, y)

##    y
## x   1 2 3 5
##   1 0 2 0 0
##   2 1 0 1 0
##   3 0 0 0 1
##   4 0 0 0 1
##   5 0 0 0 1

It is also possible to pass a data frame to the table() function instead of two separate vectors.

df <- data.frame (x, y)
df

##   x y
## 1 1 2
## 2 1 2
## 3 2 3
## 4 2 1
## 5 3 5
## 6 4 5
## 7 5 5

table (df)

##    y
## x   1 2 3 5
##   1 0 2 0 0
##   2 1 0 1 0
##   3 0 0 0 1
##   4 0 0 0 1
##   5 0 0 0 1

Although the table() command is very useful, the generic function for creating a histogram is hist(). By default, it plots the histogram immediately after calling. The essential argument of the function is a numeric vector. The second most important is breaks, which can be a vector of the breakpoints or a single number giving the number of bins or a function that returns one of these two.

x <- c (1,3,1,2,1,2,3,2,4,2,5,4,2)
hist (x)

hist (x, breaks = 2)

hist (x, breaks = 0.5:5.5)

The hist() function does not only plot the histogram, but also returns a list with all information needed for further plotting, like breaks, counts or mids.

h <- hist (x, breaks = 0.5:5.5)

## $breaks
## [1] 0.5 1.5 2.5 3.5 4.5 5.5
## 
## $counts
## [1] 3 5 2 2 1
## 
## $density
## [1] 0.23076923 0.38461538 0.15384615 0.15384615 0.07692308
## 
## $mids
## [1] 1 2 3 4 5
## 
## $xname
## [1] "x"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

Probability distributions

One of the base R package, namely stats, implements a set of popular probability distributions (like normal, binomial, exponential, etc., see ?distributions for details). Each distribution has its own short name (like norm, unif, binom) which is used in combination with one of four letters denoting different type of function:

r - random generation (e.g. runif(5) generates five random numbers from uniform distribution),

runif (5)

## [1] 0.9850465 0.9748526 0.1752976 0.9008859 0.2586559

runif (5, -10, 10)

## [1]  5.481890  1.797279  6.756874  8.291918 -8.843545

p - the cumulative distribution function (e.g. pnorm (2, 0, 1.5) returns the value of the cumulative distribution function of the normal distribution with mean = 0 and sd = 1,5 for x = 2),
d - the probability density function (e.g. dexp (2, 0.1) gives the value of the probability density function of the exponential distribution with rate = 0.1 for x = 2),

x <- seq (-2, 2, 0.1)
sigmas <- c (0.5, 1, 0.3)

plot (x, dnorm(x, 0, sigmas[1]), ylim = c(0,1.5), pch = 19)
points (x, dnorm(x, 0, sigmas[2]), pch = 19, col = "blue")
points (x, dnorm(x, 0, sigmas[3]), pch = 19, col = "green", t = "o")
curve (pnorm(x, 0, sigmas[3]), from = -2, to = 2, col = "red", lwd = 2, add = TRUE)
legend ("topleft",
        legend = c (expression (sigma==0.5),
                    expression (sigma==1.0),
                    expression (sigma==0.3),
                    expression (sigma==0.3)),
        lty = c (0, 0, 1, 1),
        pch = c (19, 19, 19, NA),
        col = c ("black","blue", "green", "red")
        )

q - the quantile function (e.g. qnorm (0.95, 0, 1) returns the quantile of order q = 0.95 of the normal distribution with mean = 0 and sd = 1).

qnorm (0.95, 0, 1)

## [1] 1.644854

qnorm (0.5, 3, 0.5)

## [1] 3

Sampling

A very useful function in the R package is sample(). Calling sample (x) returns a random permutation of a vector x. Calling sample (x, size) gives a randomly chosen size elements (without replacement) from a vector x. Sampling with replacement can be done with the use of the parameter replace = TRUE. Finally, setting the parameter prob allows for sampling elements from a vector x with unequal probabilities. It is worth noting that a vector x does not have to be of numeric type.

sample (1:10)

##  [1]  2  1  7 10  9  5  4  8  6  3

sample (1:10, 2)

## [1] 9 8

sample (1:10, 4)

## [1] 3 6 4 7

sample (1:10, 20, replace=TRUE)

##  [1]  5  1  8  5  5  9  1  7  1  1  2  9  2  3 10  7  3  6  7  9

sample (1:3, 10, replace=TRUE, prob=c(0.1,0.8,0.1))

##  [1] 2 2 2 2 2 2 1 2 2 3

sample (letters[1:3], 10, replace=TRUE)

##  [1] "a" "c" "b" "c" "a" "c" "c" "b" "a" "c"

sample (c(0.1, 0.2, 0.3), 10, replace=TRUE)

##  [1] 0.2 0.1 0.1 0.1 0.2 0.1 0.3 0.3 0.1 0.1

Empirical distributions

The ecdf() function is a quick and straightforward method for obtaining the empirical cumulative distribution function, which is often a first step for identifying the proper distribution. The object returned by this function has an overload plot () function, which means that can be easily plotted.

x <- c(1,1,1,2,5,6,3,7,8,10)
x.ecdf <- ecdf (x)
plot (x.ecdf)

It is interesting to see how the apperance of the empirical cumulative distribution changes with the sample size. The function par() called in the chunk of code below is used for changing the global plotting parameters like the margins or the division of canvas.

make.plot <- function(N) {
  x <- rnorm (N, 0, 1)
  plot (ecdf(x), main = N)
  curve (pnorm(x, 0, 1), from = min(x), to = max(x), col = "red", lwd = 2, add=TRUE)  
}

par (mfrow = c(2,2))

N <- c(10, 20, 50, 100)

sapply (N, make.plot)

##   [,1]        [,2]        [,3]        [,4]       
## x numeric,101 numeric,101 numeric,101 numeric,101
## y numeric,101 numeric,101 numeric,101 numeric,101