DATA ANALYSIS AND VISUALIZATION IN R, WINTER 2025 EDITION



Creating plots

The standard R command for creating a simple scatter plot is plot (x,y), where x and y are numerical vectors of equal length.

x <- 1:10
plot(x, x^2)

The plot() function has a substantial set of different options, with the most commonly used are:

In addition to the parameters of the plot() function, there are also global options for plotting that can be controlled using the par() function, for example:

For axis titles we can use the expression() function, which encodes mathematic expression and greek letters (e.g. ^ is a superscript, [..] is a subscript).

plot (x,
     x^2,
     xlab = "x",
     ylab = expression(f(xi)==x^2),
     main = "The plot of f(x) function",
     col = "red",
     pch = 19,
     font = 2,
     font.lab = 4,
     font.main = 3,
     cex = 2)

To add further series to the panel, one should use the points or lines function, depending on what type the next series is to be. The next call to the plot function will replace the previous series instead of adding another one. It is worth remembering that the range of axes on the plot is determined automatically when the plot function is called and is not corrected later when adding subsequent series. In the example below, the red series is truncated because it falls outside the range of the blue series.

y <- 2*x+2
z <- 3*x-1
plot (x, y, type = "l", lwd = 2, col = "blue")
lines (x, z, lwd = 2, col = "red")

To save the plot to a graphical file, use one of the commands png(), jpeg() or tiff(), and after casting the plot, close the stream with the dev.off() command.

png ("fig2.png")
plot (x, y, type = "l", lwd = 2, col = "blue")
lines (x, z, lwd = 2, col = "red")
dev.off()
## png 
##   2

A legend can be added to the plot by the legend() function, which the most important parameters are:

x <- 1:10
plot (x,
      x^2,
      xlab = "x",
      ylab = expression(f(x)),
      col = "red",
      bg = "red",
      pch = 21,
      cex = 2)
points (x,
        x^1.5,
        col = "blue",
        bg = "blue",
        pch = 22,
        cex = 2)
legend ("topleft",
        legend = c (expression(x^1.5),expression(x^2)),
        pch = c (22,19),
        col = c ("blue","red"),
        pt.bg = c("blue","red"))

There is also a particular function curve() for plotting a continuous function over the interval. This function can be used instead of plot() (it creates a fresh canvas) or as an additional series after using a parameter add = TRUE.

curve (x^2, from = 0, to = 10, col = "red")

plot (x,
      x^2,
      xlab = "x",
      ylab = expression(f(x)),
      col = "red",
      bg = "red",
      pch = 21,
      cex = 2)
curve (x^2, from = 0, to = 11, col = "red", add = TRUE)
points (x,
        x^1.5,
        col = "blue",
        bg = "blue",
        pch = 22,
        cex = 2)
curve (x^1.5, from = 0, to = 11, col = "blue", add = TRUE)


Creating histograms

The most straightforward function for counting repeated values is table() which transforms numeric or character vectors into factors before counting. It can be used for one or two vectors. In the latter case, it builds a contingency table of the counts at each combination of factor levels.

v <- c(0, 0, 1, 2, 3, 1, 2, 3, 4)
table (v)
## v
## 0 1 2 3 4 
## 2 2 2 2 1
v <- c(0, 0, 1.1, 1.9, 2.1, 2,3)
table (v)
## v
##   0 1.1 1.9   2 2.1   3 
##   2   1   1   1   1   1
x <- c(1, 1, 2, 2, 3, 4, 5)
y <- c(2, 2, 3, 1, 5, 5, 5)
table (x, y)
##    y
## x   1 2 3 5
##   1 0 2 0 0
##   2 1 0 1 0
##   3 0 0 0 1
##   4 0 0 0 1
##   5 0 0 0 1

It is also possible to pass a data frame to the table() function instead of two separate vectors.

df <- data.frame (x, y)
df
##   x y
## 1 1 2
## 2 1 2
## 3 2 3
## 4 2 1
## 5 3 5
## 6 4 5
## 7 5 5
table (df)
##    y
## x   1 2 3 5
##   1 0 2 0 0
##   2 1 0 1 0
##   3 0 0 0 1
##   4 0 0 0 1
##   5 0 0 0 1

Although the table() command is very useful, the generic function for creating a histogram is hist(). By default, it plots the histogram immediately after calling. The essential argument of the function is a numeric vector. The second most important is breaks, which can be a vector of the breakpoints or a single number giving the number of bins or a function that returns one of these two.

x <- c (1,3,1,2,1,2,3,2,4,2,5,4,2)
hist (x)

hist (x, breaks = 2)

hist (x, breaks = 0.5:5.5)

The hist() function does not only plot the histogram, but also returns a list with all information needed for further plotting, like breaks, counts or mids.

h <- hist (x, breaks = 0.5:5.5)

h
## $breaks
## [1] 0.5 1.5 2.5 3.5 4.5 5.5
## 
## $counts
## [1] 3 5 2 2 1
## 
## $density
## [1] 0.23076923 0.38461538 0.15384615 0.15384615 0.07692308
## 
## $mids
## [1] 1 2 3 4 5
## 
## $xname
## [1] "x"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"


Probability distributions

One of the base R package, namely stats, implements a set of popular probability distributions (like normal, binomial, exponential, etc., see ?distributions for details). Each distribution has its own short name (like norm, unif, binom) which is used in combination with one of four letters denoting different type of function:

runif (5)
## [1] 0.9850465 0.9748526 0.1752976 0.9008859 0.2586559
runif (5, -10, 10)
## [1]  5.481890  1.797279  6.756874  8.291918 -8.843545
x <- seq (-2, 2, 0.1)
sigmas <- c (0.5, 1, 0.3)

plot (x, dnorm(x, 0, sigmas[1]), ylim = c(0,1.5), pch = 19)
points (x, dnorm(x, 0, sigmas[2]), pch = 19, col = "blue")
points (x, dnorm(x, 0, sigmas[3]), pch = 19, col = "green", t = "o")
curve (pnorm(x, 0, sigmas[3]), from = -2, to = 2, col = "red", lwd = 2, add = TRUE)
legend ("topleft",
        legend = c (expression (sigma==0.5),
                    expression (sigma==1.0),
                    expression (sigma==0.3),
                    expression (sigma==0.3)),
        lty = c (0, 0, 1, 1),
        pch = c (19, 19, 19, NA),
        col = c ("black","blue", "green", "red")
        )

qnorm (0.95, 0, 1)
## [1] 1.644854
qnorm (0.5, 3, 0.5)
## [1] 3


Sampling

A very useful function in the R package is sample(). Calling sample (x) returns a random permutation of a vector x. Calling sample (x, size) gives a randomly chosen size elements (without replacement) from a vector x. Sampling with replacement can be done with the use of the parameter replace = TRUE. Finally, setting the parameter prob allows for sampling elements from a vector x with unequal probabilities. It is worth noting that a vector x does not have to be of numeric type.

sample (1:10)
##  [1]  2  1  7 10  9  5  4  8  6  3
sample (1:10, 2)
## [1] 9 8
sample (1:10, 4)
## [1] 3 6 4 7
sample (1:10, 20, replace=TRUE)
##  [1]  5  1  8  5  5  9  1  7  1  1  2  9  2  3 10  7  3  6  7  9
sample (1:3, 10, replace=TRUE, prob=c(0.1,0.8,0.1))
##  [1] 2 2 2 2 2 2 1 2 2 3
sample (letters[1:3], 10, replace=TRUE)
##  [1] "a" "c" "b" "c" "a" "c" "c" "b" "a" "c"
sample (c(0.1, 0.2, 0.3), 10, replace=TRUE)
##  [1] 0.2 0.1 0.1 0.1 0.2 0.1 0.3 0.3 0.1 0.1


Empirical distributions

The ecdf() function is a quick and straightforward method for obtaining the empirical cumulative distribution function, which is often a first step for identifying the proper distribution. The object returned by this function has an overload plot () function, which means that can be easily plotted.

x <- c(1,1,1,2,5,6,3,7,8,10)
x.ecdf <- ecdf (x)
plot (x.ecdf)

It is interesting to see how the apperance of the empirical cumulative distribution changes with the sample size. The function par() called in the chunk of code below is used for changing the global plotting parameters like the margins or the division of canvas.

make.plot <- function(N) {
  x <- rnorm (N, 0, 1)
  plot (ecdf(x), main = N)
  curve (pnorm(x, 0, 1), from = min(x), to = max(x), col = "red", lwd = 2, add=TRUE)  
}

par (mfrow = c(2,2))

N <- c(10, 20, 50, 100)

sapply (N, make.plot)

##   [,1]        [,2]        [,3]        [,4]       
## x numeric,101 numeric,101 numeric,101 numeric,101
## y numeric,101 numeric,101 numeric,101 numeric,101