DATA ANALYSIS AND VISUALIZATION IN R, WINTER 2025 EDITION



Packages installation

Installing additional packages in R is very simple and can be done by typing appropriate command in the console install.packages("package_name"), e.g.

install.packages ("readxl")
install.packages ("Hmisc")


Apply, lapply and sapply functions

The following functions have one common goal: to limit the use of loops in R.


apply

The apply (X, MARGIN, FUN) function is used to determine the boundary values of the X matrix (e.g. sum, mean, etc.), when the MARGIN parameter determines whether we want to apply an operation to rows (MARGIN = 1) or columns (MARGIN = 2).

A <- matrix (1:16, 4, 4)
apply (A, 1, sum)
## [1] 28 32 36 40
apply (A, 2, sum)
## [1] 10 26 42 58
apply (A, 1, mean)
## [1]  7  8  9 10
apply (A, 2, sd)
## [1] 1.290994 1.290994 1.290994 1.290994


lapply

The lapply (X, FUN) function evokes the FUN function sequentially on each element of a vector or a list X. As a result it returns a list of the same length as X. FUN can be predefined or called anonymously.

L <- list (int = 1:10, x = 2.71, text = c("a", "b", "c"), logic = rep(T, 5))
lapply (L, length)
## $int
## [1] 10
## 
## $x
## [1] 1
## 
## $text
## [1] 3
## 
## $logic
## [1] 5
m <- 1:5
n <- 10
r <- lapply (m, function(i) rnorm(n, mean=i, sd=0.5)); r
## [[1]]
##  [1] 0.005078587 0.660794570 1.245165357 0.887368930 0.842661773 0.716263767
##  [7] 1.915931213 0.915662164 1.873249717 0.986089078
## 
## [[2]]
##  [1] 2.7748339 1.7214926 1.8914082 1.5479707 2.6562717 2.3089072 2.6018279
##  [8] 0.8780944 2.1745133 2.3705904
## 
## [[3]]
##  [1] 2.329021 3.410441 2.573362 3.267112 3.686378 2.821386 2.971045 3.345516
##  [9] 2.996008 2.752659
## 
## [[4]]
##  [1] 4.002492 4.819009 4.074014 5.119237 4.900571 4.140094 3.664646 4.004025
##  [9] 4.400554 3.429842
## 
## [[5]]
##  [1] 5.636694 6.030051 4.392863 4.738748 4.743111 5.631820 4.691662 5.221882
##  [9] 5.272831 5.652140


sapply

The sapply (X, FUN) function is a useful wrapper of lapply which returns a vector, matrix or an array instead of a list.

sapply (r, mean)
## [1] 1.004827 2.092591 3.015293 4.255448 5.201180
x <- seq (0, 2*pi, pi/6)
sapply (x, function(xx) sin(xx)^2+cos(2*xx))
##  [1] 1.00 0.75 0.25 0.00 0.25 0.75 1.00 0.75 0.25 0.00 0.25 0.75 1.00
a <- 1:10
sapply (a, function(x) {sapply (rev(a), function(y) x * y)})
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]   10   20   30   40   50   60   70   80   90   100
##  [2,]    9   18   27   36   45   54   63   72   81    90
##  [3,]    8   16   24   32   40   48   56   64   72    80
##  [4,]    7   14   21   28   35   42   49   56   63    70
##  [5,]    6   12   18   24   30   36   42   48   54    60
##  [6,]    5   10   15   20   25   30   35   40   45    50
##  [7,]    4    8   12   16   20   24   28   32   36    40
##  [8,]    3    6    9   12   15   18   21   24   27    30
##  [9,]    2    4    6    8   10   12   14   16   18    20
## [10,]    1    2    3    4    5    6    7    8    9    10


BONUS: mclapply (does not work on Windows)

The mclapply (X, FUN) function can be used to distribute computation over multiple cores. It relies on forking and hence is not available on Windows

library (parallel)

random_matrix_det <- function (n, lower, upper) {
  det (matrix (runif (n*n, lower, upper), n, n))
}

system.time (
  l1 <- lapply (1:100, function(x) random_matrix_det (500, 0, 0.01))
)
##    user  system elapsed 
##   2.733   0.017   2.753
system.time (
  l2 <- mclapply (1:100, function(x) random_matrix_det (500, 0, 0.01), mc.cores = 4)
)
##    user  system elapsed 
##   0.003   0.003   0.747


Built-in datasets

Among the set of basic R libraries is the datasets package, which contains several dozen ready-to-use data sets. The data() command returns a list of these sets, and loading specific data into memory is done by passing the set name as an argument to the data("set_name") function.

# List datasets from package datasets
data ()

# Load specific dataset
data ("ChickWeight")

# List datasets from all packages installed
data (package = .packages(all.available = TRUE))


Reading data


Reading data from the console

Using the scan() function, you can read data directly from the console in the R package. We separate subsequent elements with spaces or the Enter key. End the entry by pressing the Enter key on a new line. You can also redirect the data stream entered in this way to a variable.

> scan()
1: 5 3
3: 7 8 9 10
7: 
Read 6 items
[1]  5  3  7  8  9 10
> x <- scan()
1: 6 3 9 10 23 -9
7: 
Read 6 items
> x
[1]  6  3  9 10 23 -9

Inputting other types of variables is also supported by specifying the what parameter.

> x <- scan (what="character")
1: a b c g 
5: cos
6: 
Read 5 items
> x
[1] "a"   "b"   "c"   "g"   "cos"


Reading data from text files

One of the most frequently used functions for reading data from a file is read.table(), which creates a data frame from the loaded file. This means that each line in the file should contain the same number of fields, and each column must contain the same data type.

# "df1.dat"

1 "Patient 1" 2 0.5
20 "Patient 10" 10 0.11111
30 "No name" 1 0.99
df <- read.table("df1.dat"); df
##   V1         V2 V3      V4
## 1  1  Patient 1  2 0.50000
## 2 20 Patient 10 10 0.11111
## 3 30    No name  1 0.99000

If we want to give names to individual columns of the frame, we provide the col.names parameter. Spaces in the names will be replaced with dots.

df <- read.table("df1.dat", col.names=c("id", "name", "degree", "clust coeff")); df
##   id       name degree clust.coeff
## 1  1  Patient 1      2     0.50000
## 2 20 Patient 10     10     0.11111
## 3 30    No name      1     0.99000
df$name
## [1] "Patient 1"  "Patient 10" "No name"

It is also possible to provide column names in the file itself - if the first row has one field less than the next one, the function will automatically treat the first line as column names and the first column as row names. If the file contains column names but no row names (i.e. the first row contains the same number of fields), then use the header = TRUE option inside the read.table() function.

# "df2.dat"

name degree cluster
1 "Patient 1" 2 0.5
20 "Patient 10" 10 0.11111
30 "No name" 1 0.99

# "df3.dat"
name age weight
"Kowalski" 38 94.3
"Nowak" 25 67.5
"Malinowski" 49 84.7
df2 <- read.table ("df2.dat"); df2
##          name degree cluster
## 1   Patient 1      2 0.50000
## 20 Patient 10     10 0.11111
## 30    No name      1 0.99000
colnames(df2)
## [1] "name"    "degree"  "cluster"
rownames(df2)
## [1] "1"  "20" "30"
df3 <- read.table ("df3.dat", header = TRUE); df3
##         name age weight
## 1   Kowalski  38   94.3
## 2      Nowak  25   67.5
## 3 Malinowski  49   84.7

If we know the url of the file, instead of downloading and loading it locally, we can simply provide its location.

df2 <- read.table ("https://rpaluch.fens.org.pl/daavir/data/table2.dat")
df2$name
## [1] "Patient 1"  "Patient 10" "No name"
typeof (df2$name)
## [1] "character"

It is also worth to mention other useful functions, such as read.csv() or read.delim() which are simple wrappers of read.table().


Reading non-tabular data

To read a non-tabular data from the text files or other sources (e.g. URLs or socket connections), the function readLines() can be used. This function reads file line be line and returns a character vector of length the number of lines read.

gen <- readLines ("https://rpaluch.fens.org.pl/daavir/data/genesis.txt")
gen[1:5]
## [1] "1:1: In the beginning God created the heaven and the earth."                                                                                         
## [2] "1:2: And the earth was without form, and void; and darkness was upon the face of the deep.  And the Spirit of God moved upon the face of the waters."
## [3] "1:3: And God said, Let there be light: and there was light."                                                                                         
## [4] "1:4: And God saw the light, that it was good: and God divided the light from the darkness."                                                          
## [5] "1:5: And God called the light Day, and the darkness he called Night.  And the evening and the morning were the first day."


Reading data from xls and xlsx files

To properly import an xls or xlsx file into R, it is necessary to use an additional library, e.g. readxl. This package contains the functions read_xls() and read_xlsx(), and the more general function read_exel() in case we don’t know which of these two functions we should use.

library (readxl)

filename <- "POL_NIOT_nov16.xlsx"
sheet <- "NIOTS"
range <- "E1:BO1801"

spreadsheet <- read_xlsx (path = filename,
                          sheet = sheet,
                          range = range,
                          col_names = TRUE)

head (spreadsheet)
## # A tibble: 6 × 63
##        A01     A02    A03        B `C10-C12` `C13-C15`      C16     C17    C18
##      <dbl>   <dbl>  <dbl>    <dbl>     <dbl>     <dbl>    <dbl>   <dbl>  <dbl>
## 1 2343.      6.75  4.88     2.87      4131.     16.7     3.40    1.58   0.633 
## 2   16.1   245.    0.259    4.18        28.5     3.43  363.     83.8    0.715 
## 3    0.632   0.216 8.55     0.0425      90.7     0.135   0.0815  0.0463 0.0252
## 4   29.5     1.92  0.0942 115.          27.1     5.84    5.99    5.74   2.12  
## 5 1059.      8.48  7.84    11.1       2989.     41.3    10.2    11.6    5.51  
## 6   10.8     0.699 0.132    2.33        21.1     8.35    3.05    2.91   1.92  
## # ℹ 54 more variables: C19 <dbl>, C20 <dbl>, C21 <dbl>, C22 <dbl>, C23 <dbl>,
## #   C24 <dbl>, C25 <dbl>, C26 <dbl>, C27 <dbl>, C28 <dbl>, C29 <dbl>,
## #   C30 <dbl>, C31_C32 <dbl>, C33 <dbl>, D35 <dbl>, E36 <dbl>, `E37-E39` <dbl>,
## #   F <dbl>, G45 <dbl>, G46 <dbl>, G47 <dbl>, H49 <dbl>, H50 <dbl>, H51 <dbl>,
## #   H52 <dbl>, H53 <dbl>, I <dbl>, J58 <dbl>, J59_J60 <dbl>, J61 <dbl>,
## #   J62_J63 <dbl>, K64 <dbl>, K65 <dbl>, K66 <dbl>, L68 <dbl>, M69_M70 <dbl>,
## #   M71 <dbl>, M72 <dbl>, M73 <dbl>, M74_M75 <dbl>, N <dbl>, O84 <dbl>, …


Writing data

In the case of matrices and data frames, the appropriate way to write data is to use the write.table() function. One can drop column and row names by setting col.names=FALSE, row.names=FALSE parameters.

df4 <- data.frame(x=1:3, names=c("Aaaa", "Bbbb", "Ccc")); df4
##   x names
## 1 1  Aaaa
## 2 2  Bbbb
## 3 3   Ccc
write.table (df4, "df4.dat")
A <- matrix(1:10, 2, 5)
write.table (A, "matrix1.dat")
write.table (A, "matrix2.dat", row.names=FALSE, col.names=FALSE)

A useful function when writing data to a file is format(), which can be used to set the maximum number of decimal places of the saved real numbers.

df5 <- data.frame (degrees=seq(0,90,15), radians=seq(0,90,15)*pi/180); df5
##   degrees   radians
## 1       0 0.0000000
## 2      15 0.2617994
## 3      30 0.5235988
## 4      45 0.7853982
## 5      60 1.0471976
## 6      75 1.3089969
## 7      90 1.5707963
write.table (df5, "df5a.dat", row.names=FALSE, quote = FALSE)
write.table (format (df5, digits = 3), "df5b.dat", row.names=FALSE, quote = FALSE)

Finally, there is the most direct method, which is to save the variable to a file using the save() statement.

save(df5, file="df")
ls()
##  [1] "a"                 "A"                 "df"               
##  [4] "df2"               "df3"               "df4"              
##  [7] "df5"               "filename"          "gen"              
## [10] "L"                 "l1"                "l2"               
## [13] "m"                 "n"                 "r"                
## [16] "random_matrix_det" "range"             "sheet"            
## [19] "spreadsheet"       "x"
rm(df5)
ls()
##  [1] "a"                 "A"                 "df"               
##  [4] "df2"               "df3"               "df4"              
##  [7] "filename"          "gen"               "L"                
## [10] "l1"                "l2"                "m"                
## [13] "n"                 "r"                 "random_matrix_det"
## [16] "range"             "sheet"             "spreadsheet"      
## [19] "x"
df5
## Error: object 'df5' not found
load("df")
df5
##   degrees   radians
## 1       0 0.0000000
## 2      15 0.2617994
## 3      30 0.5235988
## 4      45 0.7853982
## 5      60 1.0471976
## 6      75 1.3089969
## 7      90 1.5707963


Pipe-ing with magrittr

R is a functional language, which means that the code is often peppered with parentheses, which greatly obscure the picture of what is actually being done. Especially when the code is complex and functions are nested within each other, this leads to aesthetic and programming problems.

x <- c(0.109, 0.359, 0.63, 0.996, 0.515, 0.142, 0.017, 0.829, 0.907)

round(exp(diff(log(x))), 1)
## [1]  3.3  1.8  1.6  0.5  0.3  0.1 48.8  1.1

Instead of such a notation, it is convenient to use the magrittr package, which offers the pipeline operator %>%:

library(magrittr)

x %>%
  log() %>%
  diff() %>%
  exp() %>%
  round(1)
## [1]  3.3  1.8  1.6  0.5  0.3  0.1 48.8  1.1

The general idea is that the pipeline operator takes the variable on the left side of the operation and replaces it in the function (or operation) on the right side. The most common case is to write the operation f(x) as x %>% f e.g.

log (2)
## [1] 0.6931472
2 %>% log
## [1] 0.6931472

It is also possible to replace the operation f(x,y) with x %>% f(y)

round (pi, 6)
## [1] 3.141593
pi %>% round(6)
## [1] 3.141593

In case the default argument is not the first one, we use a placeholder implemented with a dot f(x, .)

6 %>% round (pi, digits=.)
## [1] 3.141593

The %>% operator is not the only pipe operator in the magrittr package. If we want to change a variable, i.e. perform an operation on the right side of the operator and then assign the result to the left side, the %<>% operator comes in handy.

x <- 2
x %<>% log()
x
## [1] 0.6931472

The next issue is extracting a specific variable from the larger object on the left and passing it for use on the right. The %$% operator is then used.

df <- data.frame (a = 1:10, b = runif(10,-2,2), c = sample (1:10))

# Full correlation matrix
df %>% cor
##           a          b          c
## a 1.0000000  0.1216815  0.2484848
## b 0.1216815  1.0000000 -0.3528192
## c 0.2484848 -0.3528192  1.0000000
# Now we want to correlate only columns a and b
df %>% cor (a,b)
## Error: object 'b' not found
# But we need %$% operator to do that
df %$% cor (a,b)
## [1] 0.1216815

It should be mentioned that since version 4.1 R also has its native, built-in pipe operator |>. However, it is only suitable for executing the simplest expressions like lhs |> rhs, where lhs is an expression that returns some value x, and rhs is an expression of the form f(x) or f(x, y).

2 |> sqrt()
## [1] 1.414214
pi |> round(6)
## [1] 3.141593

You can read more about the |> operator by typing ?pipeOp in the terminal.


Transforming data with dplyr

Conditioning variables (i.e. writing out parts of an object that meet certain specific assumptions), and especially columns in a data frame object, is often a very big problem. The same applies to seemingly simple operations such as sorting or even selecting specific columns. The dplyr package is very convenient to handle such cases. Let’s look at a simple example of a data frame df1 that we would like to sort by. second column. Normally, we would have to use the following construction:

df[order (df$b),]
##     a          b  c
## 7   7 -1.9022940  6
## 1   1 -1.0842583  4
## 9   9 -0.6998884  9
## 4   4  0.1427654  5
## 2   2  0.4451878  7
## 8   8  0.7853572  8
## 6   6  0.9968008 10
## 5   5  1.0549868  3
## 3   3  1.7279978  1
## 10 10  1.9300347  2

When using the dplyr package, we use the arrange() function specifying the variable we want to use.

library (dplyr)
arrange (df, b)
##     a          b  c
## 1   7 -1.9022940  6
## 2   1 -1.0842583  4
## 3   9 -0.6998884  9
## 4   4  0.1427654  5
## 5   2  0.4451878  7
## 6   8  0.7853572  8
## 7   6  0.9968008 10
## 8   5  1.0549868  3
## 9   3  1.7279978  1
## 10 10  1.9300347  2

In turn, the select() function selects columns from the frame.

df[,c("b","a")]
##             b  a
## 1  -1.0842583  1
## 2   0.4451878  2
## 3   1.7279978  3
## 4   0.1427654  4
## 5   1.0549868  5
## 6   0.9968008  6
## 7  -1.9022940  7
## 8   0.7853572  8
## 9  -0.6998884  9
## 10  1.9300347 10
select (df, b, a)
##             b  a
## 1  -1.0842583  1
## 2   0.4451878  2
## 3   1.7279978  3
## 4   0.1427654  4
## 5   1.0549868  5
## 6   0.9968008  6
## 7  -1.9022940  7
## 8   0.7853572  8
## 9  -0.6998884  9
## 10  1.9300347 10

Whereas filter() allows you to select specific record values.

df[df$a < 6 & df$b < 0,]
##   a         b c
## 1 1 -1.084258 4
filter(df, a < 6, b < 0)
##   a         b c
## 1 1 -1.084258 4

Of course, nothing stops you from using the known pipeline mechanism here:

df %>%
  filter(a < 6, b < 0) %>% 
  select(c, b) %>%
  arrange(desc(b), c)
##   c         b
## 1 4 -1.084258

Another useful function is mutate, which adds a new column to the frame.

df %<>% mutate(d=a*b+c)
df
##     a          b  c         d
## 1   1 -1.0842583  4  2.915742
## 2   2  0.4451878  7  7.890376
## 3   3  1.7279978  1  6.183993
## 4   4  0.1427654  5  5.571061
## 5   5  1.0549868  3  8.274934
## 6   6  0.9968008 10 15.980805
## 7   7 -1.9022940  6 -7.316058
## 8   8  0.7853572  8 14.282858
## 9   9 -0.6998884  9  2.701005
## 10 10  1.9300347  2 21.300347

A very frequently used function from the dplyr package is summarise() in combination with the group_by() function. The first one is used to perform functions (e.g. average or standard deviation) on the elements of a specific column, and the second one allows you to group records from one column depending on the value in another column.

# Load datasets about flowers
data ("iris")
# Present the data
head (iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
summary (iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
# Calculating the average length and width of the flower spal for different iris species
iris %>%
  group_by(Species) %>%
  summarise(mean.sepal.length = mean(Sepal.Length), mean.sepal.width = mean(Sepal.Width))
## # A tibble: 3 × 3
##   Species    mean.sepal.length mean.sepal.width
##   <fct>                  <dbl>            <dbl>
## 1 setosa                  5.01             3.43
## 2 versicolor              5.94             2.77
## 3 virginica               6.59             2.97


Tidying data with tidyr

The dplyr package expects the data is tidy according to a certain pattern. This scheme assumes that each variable has its own column and each observation or case is in a separate row. A set of data organized in this way is sometimes referred to as a long table, as opposed to a wide table set. The example below shows the same dataset in these two forms.

# Wide table
rank.wide <- read.table("rank_vs_rate.txt", header=TRUE)
rank.wide
##   id rate0.1 rate0.2 rate0.3
## 1  1      25      10       2
## 2  2      32      21       5
## 3  3      18       8       1
## 4  4      22      14       1
## 5  5      47      28       9
## 6  6      31      12       3
# Long table - tidy data
rank.long <- read.table("rank.txt", header=TRUE)
rank.long
##    id rate rank
## 1   1  0.1   25
## 2   2  0.1   32
## 3   3  0.1   18
## 4   4  0.1   22
## 5   5  0.1   47
## 6   6  0.1   31
## 7   1  0.2   10
## 8   2  0.2   21
## 9   3  0.2    8
## 10  4  0.2   14
## 11  5  0.2   28
## 12  6  0.2   12
## 13  1  0.3    2
## 14  2  0.3    5
## 15  3  0.3    1
## 16  4  0.3    1
## 17  5  0.3    9
## 18  6  0.3    3

The tidyr package allows you to easily switch from one character to another. The pivot_longer() and pivot_wider() functions are used for this purpose.

library(tidyr)
rank.longer <- pivot_longer(rank.wide, 
                            cols=2:4, 
                            names_to = "rate", 
                            values_to = "rank", 
                            names_prefix = "rate", 
                            names_transform = list(rate=as.numeric))
rank.longer
## # A tibble: 18 × 3
##       id  rate  rank
##    <int> <dbl> <int>
##  1     1   0.1    25
##  2     1   0.2    10
##  3     1   0.3     2
##  4     2   0.1    32
##  5     2   0.2    21
##  6     2   0.3     5
##  7     3   0.1    18
##  8     3   0.2     8
##  9     3   0.3     1
## 10     4   0.1    22
## 11     4   0.2    14
## 12     4   0.3     1
## 13     5   0.1    47
## 14     5   0.2    28
## 15     5   0.3     9
## 16     6   0.1    31
## 17     6   0.2    12
## 18     6   0.3     3
rank.wider <- pivot_wider(rank.long,
                          names_from = rate,
                          names_prefix = "rate",
                          values_from = rank)
rank.wider
## # A tibble: 6 × 4
##      id rate0.1 rate0.2 rate0.3
##   <int>   <int>   <int>   <int>
## 1     1      25      10       2
## 2     2      32      21       5
## 3     3      18       8       1
## 4     4      22      14       1
## 5     5      47      28       9
## 6     6      31      12       3