Installing additional packages in R is very simple and can be done by
typing appropriate command in the console
install.packages("package_name"), e.g.
The following functions have one common goal: to limit the use of loops in R.
The apply (X, MARGIN, FUN) function is used to determine
the boundary values of the X matrix (e.g. sum, mean, etc.),
when the MARGIN parameter determines whether we want to
apply an operation to rows (MARGIN = 1) or columns
(MARGIN = 2).
## [1] 28 32 36 40
## [1] 10 26 42 58
## [1] 7 8 9 10
## [1] 1.290994 1.290994 1.290994 1.290994
The lapply (X, FUN) function evokes the FUN
function sequentially on each element of a vector or a list
X. As a result it returns a list of the same length as
X. FUN can be predefined or called
anonymously.
## $int
## [1] 10
##
## $x
## [1] 1
##
## $text
## [1] 3
##
## $logic
## [1] 5
## [[1]]
## [1] 0.005078587 0.660794570 1.245165357 0.887368930 0.842661773 0.716263767
## [7] 1.915931213 0.915662164 1.873249717 0.986089078
##
## [[2]]
## [1] 2.7748339 1.7214926 1.8914082 1.5479707 2.6562717 2.3089072 2.6018279
## [8] 0.8780944 2.1745133 2.3705904
##
## [[3]]
## [1] 2.329021 3.410441 2.573362 3.267112 3.686378 2.821386 2.971045 3.345516
## [9] 2.996008 2.752659
##
## [[4]]
## [1] 4.002492 4.819009 4.074014 5.119237 4.900571 4.140094 3.664646 4.004025
## [9] 4.400554 3.429842
##
## [[5]]
## [1] 5.636694 6.030051 4.392863 4.738748 4.743111 5.631820 4.691662 5.221882
## [9] 5.272831 5.652140
The sapply (X, FUN) function is a useful wrapper of
lapply which returns a vector, matrix or an array instead
of a list.
## [1] 1.004827 2.092591 3.015293 4.255448 5.201180
## [1] 1.00 0.75 0.25 0.00 0.25 0.75 1.00 0.75 0.25 0.00 0.25 0.75 1.00
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 10 20 30 40 50 60 70 80 90 100
## [2,] 9 18 27 36 45 54 63 72 81 90
## [3,] 8 16 24 32 40 48 56 64 72 80
## [4,] 7 14 21 28 35 42 49 56 63 70
## [5,] 6 12 18 24 30 36 42 48 54 60
## [6,] 5 10 15 20 25 30 35 40 45 50
## [7,] 4 8 12 16 20 24 28 32 36 40
## [8,] 3 6 9 12 15 18 21 24 27 30
## [9,] 2 4 6 8 10 12 14 16 18 20
## [10,] 1 2 3 4 5 6 7 8 9 10
The mclapply (X, FUN) function can be used to distribute
computation over multiple cores. It relies on forking and hence is not
available on Windows
library (parallel)
random_matrix_det <- function (n, lower, upper) {
det (matrix (runif (n*n, lower, upper), n, n))
}
system.time (
l1 <- lapply (1:100, function(x) random_matrix_det (500, 0, 0.01))
)## user system elapsed
## 2.733 0.017 2.753
## user system elapsed
## 0.003 0.003 0.747
Among the set of basic R libraries is the
datasets package, which contains several dozen ready-to-use
data sets. The data() command returns a list of these sets,
and loading specific data into memory is done by passing the set name as
an argument to the data("set_name") function.
Using the scan() function, you can read data directly
from the console in the R package. We separate subsequent elements with
spaces or the Enter key. End the entry by pressing the
Enter key on a new line. You can also redirect the data
stream entered in this way to a variable.
> scan()
1: 5 3
3: 7 8 9 10
7:
Read 6 items
[1] 5 3 7 8 9 10
> x <- scan()
1: 6 3 9 10 23 -9
7:
Read 6 items
> x
[1] 6 3 9 10 23 -9Inputting other types of variables is also supported by specifying
the what parameter.
One of the most frequently used functions for reading data from a
file is read.table(), which creates a data frame from the
loaded file. This means that each line in the file should contain the
same number of fields, and each column must contain the same data
type.
## V1 V2 V3 V4
## 1 1 Patient 1 2 0.50000
## 2 20 Patient 10 10 0.11111
## 3 30 No name 1 0.99000
If we want to give names to individual columns of the frame, we
provide the col.names parameter. Spaces in the names will
be replaced with dots.
## id name degree clust.coeff
## 1 1 Patient 1 2 0.50000
## 2 20 Patient 10 10 0.11111
## 3 30 No name 1 0.99000
## [1] "Patient 1" "Patient 10" "No name"
It is also possible to provide column names in the file itself - if
the first row has one field less than the next one, the function will
automatically treat the first line as column names and the first column
as row names. If the file contains column names but no row names
(i.e. the first row contains the same number of fields), then use the
header = TRUE option inside the read.table()
function.
# "df2.dat"
name degree cluster
1 "Patient 1" 2 0.5
20 "Patient 10" 10 0.11111
30 "No name" 1 0.99
# "df3.dat"
name age weight
"Kowalski" 38 94.3
"Nowak" 25 67.5
"Malinowski" 49 84.7## name degree cluster
## 1 Patient 1 2 0.50000
## 20 Patient 10 10 0.11111
## 30 No name 1 0.99000
## [1] "name" "degree" "cluster"
## [1] "1" "20" "30"
## name age weight
## 1 Kowalski 38 94.3
## 2 Nowak 25 67.5
## 3 Malinowski 49 84.7
If we know the url of the file, instead of downloading and loading it locally, we can simply provide its location.
## [1] "Patient 1" "Patient 10" "No name"
## [1] "character"
It is also worth to mention other useful functions, such as
read.csv() or read.delim() which are simple
wrappers of read.table().
To read a non-tabular data from the text files or other sources
(e.g. URLs or socket connections), the function readLines()
can be used. This function reads file line be line and returns a
character vector of length the number of lines read.
## [1] "1:1: In the beginning God created the heaven and the earth."
## [2] "1:2: And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters."
## [3] "1:3: And God said, Let there be light: and there was light."
## [4] "1:4: And God saw the light, that it was good: and God divided the light from the darkness."
## [5] "1:5: And God called the light Day, and the darkness he called Night. And the evening and the morning were the first day."
To properly import an xls or xlsx file into R, it is necessary to use
an additional library, e.g. readxl. This package contains
the functions read_xls() and read_xlsx(), and
the more general function read_exel() in case we don’t know
which of these two functions we should use.
library (readxl)
filename <- "POL_NIOT_nov16.xlsx"
sheet <- "NIOTS"
range <- "E1:BO1801"
spreadsheet <- read_xlsx (path = filename,
sheet = sheet,
range = range,
col_names = TRUE)
head (spreadsheet)## # A tibble: 6 × 63
## A01 A02 A03 B `C10-C12` `C13-C15` C16 C17 C18
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2343. 6.75 4.88 2.87 4131. 16.7 3.40 1.58 0.633
## 2 16.1 245. 0.259 4.18 28.5 3.43 363. 83.8 0.715
## 3 0.632 0.216 8.55 0.0425 90.7 0.135 0.0815 0.0463 0.0252
## 4 29.5 1.92 0.0942 115. 27.1 5.84 5.99 5.74 2.12
## 5 1059. 8.48 7.84 11.1 2989. 41.3 10.2 11.6 5.51
## 6 10.8 0.699 0.132 2.33 21.1 8.35 3.05 2.91 1.92
## # ℹ 54 more variables: C19 <dbl>, C20 <dbl>, C21 <dbl>, C22 <dbl>, C23 <dbl>,
## # C24 <dbl>, C25 <dbl>, C26 <dbl>, C27 <dbl>, C28 <dbl>, C29 <dbl>,
## # C30 <dbl>, C31_C32 <dbl>, C33 <dbl>, D35 <dbl>, E36 <dbl>, `E37-E39` <dbl>,
## # F <dbl>, G45 <dbl>, G46 <dbl>, G47 <dbl>, H49 <dbl>, H50 <dbl>, H51 <dbl>,
## # H52 <dbl>, H53 <dbl>, I <dbl>, J58 <dbl>, J59_J60 <dbl>, J61 <dbl>,
## # J62_J63 <dbl>, K64 <dbl>, K65 <dbl>, K66 <dbl>, L68 <dbl>, M69_M70 <dbl>,
## # M71 <dbl>, M72 <dbl>, M73 <dbl>, M74_M75 <dbl>, N <dbl>, O84 <dbl>, …
In the case of matrices and data frames, the appropriate way to write
data is to use the write.table() function. One can drop
column and row names by setting
col.names=FALSE, row.names=FALSE parameters.
## x names
## 1 1 Aaaa
## 2 2 Bbbb
## 3 3 Ccc
write.table (df4, "df4.dat")
A <- matrix(1:10, 2, 5)
write.table (A, "matrix1.dat")
write.table (A, "matrix2.dat", row.names=FALSE, col.names=FALSE)A useful function when writing data to a file is
format(), which can be used to set the maximum number of
decimal places of the saved real numbers.
## degrees radians
## 1 0 0.0000000
## 2 15 0.2617994
## 3 30 0.5235988
## 4 45 0.7853982
## 5 60 1.0471976
## 6 75 1.3089969
## 7 90 1.5707963
write.table (df5, "df5a.dat", row.names=FALSE, quote = FALSE)
write.table (format (df5, digits = 3), "df5b.dat", row.names=FALSE, quote = FALSE)Finally, there is the most direct method, which is to save the
variable to a file using the save() statement.
## [1] "a" "A" "df"
## [4] "df2" "df3" "df4"
## [7] "df5" "filename" "gen"
## [10] "L" "l1" "l2"
## [13] "m" "n" "r"
## [16] "random_matrix_det" "range" "sheet"
## [19] "spreadsheet" "x"
## [1] "a" "A" "df"
## [4] "df2" "df3" "df4"
## [7] "filename" "gen" "L"
## [10] "l1" "l2" "m"
## [13] "n" "r" "random_matrix_det"
## [16] "range" "sheet" "spreadsheet"
## [19] "x"
## Error: object 'df5' not found
## degrees radians
## 1 0 0.0000000
## 2 15 0.2617994
## 3 30 0.5235988
## 4 45 0.7853982
## 5 60 1.0471976
## 6 75 1.3089969
## 7 90 1.5707963
R is a functional language, which means that the code is
often peppered with parentheses, which greatly obscure the picture of
what is actually being done. Especially when the code is complex and
functions are nested within each other, this leads to aesthetic and
programming problems.
## [1] 3.3 1.8 1.6 0.5 0.3 0.1 48.8 1.1
Instead of such a notation, it is convenient to use the
magrittr package, which offers the pipeline operator
%>%:
## [1] 3.3 1.8 1.6 0.5 0.3 0.1 48.8 1.1
The general idea is that the pipeline operator takes the variable on
the left side of the operation and replaces it in the function (or
operation) on the right side. The most common case is to write the
operation f(x) as x %>% f e.g.
## [1] 0.6931472
## [1] 0.6931472
It is also possible to replace the operation f(x,y) with
x %>% f(y)
## [1] 3.141593
## [1] 3.141593
In case the default argument is not the first one, we use a
placeholder implemented with a dot f(x, .)
## [1] 3.141593
The %>% operator is not the only pipe operator in the
magrittr package. If we want to change a variable,
i.e. perform an operation on the right side of the operator and then
assign the result to the left side, the %<>% operator
comes in handy.
## [1] 0.6931472
The next issue is extracting a specific variable from the larger
object on the left and passing it for use on the right. The
%$% operator is then used.
df <- data.frame (a = 1:10, b = runif(10,-2,2), c = sample (1:10))
# Full correlation matrix
df %>% cor## a b c
## a 1.0000000 0.1216815 0.2484848
## b 0.1216815 1.0000000 -0.3528192
## c 0.2484848 -0.3528192 1.0000000
## Error: object 'b' not found
## [1] 0.1216815
It should be mentioned that since version 4.1 R also has its native,
built-in pipe operator |>. However, it is only suitable
for executing the simplest expressions like lhs |> rhs,
where lhs is an expression that returns some value
x, and rhs is an expression of the form
f(x) or f(x, y).
## [1] 1.414214
## [1] 3.141593
You can read more about the |> operator by typing
?pipeOp in the terminal.
Conditioning variables (i.e. writing out parts of an object that meet
certain specific assumptions), and especially columns in a data frame
object, is often a very big problem. The same applies to seemingly
simple operations such as sorting or even selecting specific columns.
The dplyr package is very convenient to handle such cases.
Let’s look at a simple example of a data frame df1 that we
would like to sort by. second column. Normally, we would have to use the
following construction:
## a b c
## 7 7 -1.9022940 6
## 1 1 -1.0842583 4
## 9 9 -0.6998884 9
## 4 4 0.1427654 5
## 2 2 0.4451878 7
## 8 8 0.7853572 8
## 6 6 0.9968008 10
## 5 5 1.0549868 3
## 3 3 1.7279978 1
## 10 10 1.9300347 2
When using the dplyr package, we use the
arrange() function specifying the variable we want to
use.
## a b c
## 1 7 -1.9022940 6
## 2 1 -1.0842583 4
## 3 9 -0.6998884 9
## 4 4 0.1427654 5
## 5 2 0.4451878 7
## 6 8 0.7853572 8
## 7 6 0.9968008 10
## 8 5 1.0549868 3
## 9 3 1.7279978 1
## 10 10 1.9300347 2
In turn, the select() function selects columns from the
frame.
## b a
## 1 -1.0842583 1
## 2 0.4451878 2
## 3 1.7279978 3
## 4 0.1427654 4
## 5 1.0549868 5
## 6 0.9968008 6
## 7 -1.9022940 7
## 8 0.7853572 8
## 9 -0.6998884 9
## 10 1.9300347 10
## b a
## 1 -1.0842583 1
## 2 0.4451878 2
## 3 1.7279978 3
## 4 0.1427654 4
## 5 1.0549868 5
## 6 0.9968008 6
## 7 -1.9022940 7
## 8 0.7853572 8
## 9 -0.6998884 9
## 10 1.9300347 10
Whereas filter() allows you to select specific record
values.
## a b c
## 1 1 -1.084258 4
## a b c
## 1 1 -1.084258 4
Of course, nothing stops you from using the known pipeline mechanism here:
## c b
## 1 4 -1.084258
Another useful function is mutate, which adds a new
column to the frame.
## a b c d
## 1 1 -1.0842583 4 2.915742
## 2 2 0.4451878 7 7.890376
## 3 3 1.7279978 1 6.183993
## 4 4 0.1427654 5 5.571061
## 5 5 1.0549868 3 8.274934
## 6 6 0.9968008 10 15.980805
## 7 7 -1.9022940 6 -7.316058
## 8 8 0.7853572 8 14.282858
## 9 9 -0.6998884 9 2.701005
## 10 10 1.9300347 2 21.300347
A very frequently used function from the dplyr package
is summarise() in combination with the
group_by() function. The first one is used to perform
functions (e.g. average or standard deviation) on the elements of a
specific column, and the second one allows you to group records from one
column depending on the value in another column.
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
# Calculating the average length and width of the flower spal for different iris species
iris %>%
group_by(Species) %>%
summarise(mean.sepal.length = mean(Sepal.Length), mean.sepal.width = mean(Sepal.Width))## # A tibble: 3 × 3
## Species mean.sepal.length mean.sepal.width
## <fct> <dbl> <dbl>
## 1 setosa 5.01 3.43
## 2 versicolor 5.94 2.77
## 3 virginica 6.59 2.97
The dplyr package expects the data is tidy
according to a certain pattern. This scheme assumes that each variable
has its own column and each observation or case is in a separate row. A
set of data organized in this way is sometimes referred to as a long
table, as opposed to a wide table set. The example below
shows the same dataset in these two forms.
## id rate0.1 rate0.2 rate0.3
## 1 1 25 10 2
## 2 2 32 21 5
## 3 3 18 8 1
## 4 4 22 14 1
## 5 5 47 28 9
## 6 6 31 12 3
## id rate rank
## 1 1 0.1 25
## 2 2 0.1 32
## 3 3 0.1 18
## 4 4 0.1 22
## 5 5 0.1 47
## 6 6 0.1 31
## 7 1 0.2 10
## 8 2 0.2 21
## 9 3 0.2 8
## 10 4 0.2 14
## 11 5 0.2 28
## 12 6 0.2 12
## 13 1 0.3 2
## 14 2 0.3 5
## 15 3 0.3 1
## 16 4 0.3 1
## 17 5 0.3 9
## 18 6 0.3 3
The tidyr package allows you to easily switch from one
character to another. The pivot_longer() and
pivot_wider() functions are used for this purpose.
library(tidyr)
rank.longer <- pivot_longer(rank.wide,
cols=2:4,
names_to = "rate",
values_to = "rank",
names_prefix = "rate",
names_transform = list(rate=as.numeric))
rank.longer## # A tibble: 18 × 3
## id rate rank
## <int> <dbl> <int>
## 1 1 0.1 25
## 2 1 0.2 10
## 3 1 0.3 2
## 4 2 0.1 32
## 5 2 0.2 21
## 6 2 0.3 5
## 7 3 0.1 18
## 8 3 0.2 8
## 9 3 0.3 1
## 10 4 0.1 22
## 11 4 0.2 14
## 12 4 0.3 1
## 13 5 0.1 47
## 14 5 0.2 28
## 15 5 0.3 9
## 16 6 0.1 31
## 17 6 0.2 12
## 18 6 0.3 3
rank.wider <- pivot_wider(rank.long,
names_from = rate,
names_prefix = "rate",
values_from = rank)
rank.wider## # A tibble: 6 × 4
## id rate0.1 rate0.2 rate0.3
## <int> <int> <int> <int>
## 1 1 25 10 2
## 2 2 32 21 5
## 3 3 18 8 1
## 4 4 22 14 1
## 5 5 47 28 9
## 6 6 31 12 3