DATA ANALYSIS AND VISUALIZATION IN R, WINTER 2024 EDITION



Work in the console and in RStudio


Console
  • We start a new R session in the console/terminal by entering the R command.
  • To close the session, type quit() or q() in the console.
  • A command is executed after pressing the Enter button.
  • A few commands can be written in one line if separated by the semicolon. The semicolon at the end of the line is redundant and ignored.
2*pi;cos(1)
## [1] 6.283185
## [1] 0.5403023
  • There are three assign operators: =, <-, <<-.
    • <- has the highest priority and it is standard assign operator in R,
    • <<- is used inside the function to overwrite the value of a global variable,
    • = is used for passing the parameter values into the function.
a <- 5
2 -> b
c = 3
a; b; c
## [1] 5
## [1] 2
## [1] 3
b <- a = 10 # gives error
## Error in b <- a = 10: could not find function "<-<-"
a; b
## [1] 5
## [1] 2
b = a <- 10
a; b
## [1] 10
## [1] 10
increment <- function(x){
  x <<- x+1
}
x <- 1; increment (x); x
## [1] 2
rnorm (n = 3, mean = 2, sd = 0.5)
## [1] 2.698830 2.821433 1.984869
  • Writing and executing a code in the console, line by line is not an efficient way of programming in R. The right way is to put the code in a script and run it with one command - source().
########################## test_script.R ##############################

# Function
# f <- function(x, y) {
#   x <- 2*x
#   y <<- 2*y
# }

# Main part of the script

# x <- 2
# y <- 2
# 
# print(x)
# print(y)
# x
# 
# f(2,2)
# 
# cat("x =",x,"\n")
# cat("y =",y,"\n")

########################## Script execution ###############################
source("test_script.R")
## [1] 2
## [1] 2
## x = 2 
## y = 8


RStudio
  • A comment in the script begins with the # character.
  • To run a script opened in RStudio, press the Source button.
  • You can run selected a chunk of code by highlighting it and pressing the Ctrl+Enter key combination or the Run button.
  • The print() and cat() functions are used to print variable values to the screen.
x <- 2
y <- x+1
z <- x^2

cat("y =",y,"\n")
## y = 3
print(z)
## [1] 4


Data types


Atomic
Numeric
  • The default numeric types in R are real numbers.
a <- 10
a; typeof(a)
## [1] 10
## [1] "double"
  • To create an integer variable you have to add L letter after the number.
b <- 10L
typeof(b)
## [1] "integer"
typeof(b+1)
## [1] "double"
typeof(b+1L)
## [1] "integer"
  • You can create a complex number using the format a+bi.
d <- 2+3i
d; typeof(d)
## [1] 2+3i
## [1] "complex"
sqrt(-1)
## Warning in sqrt(-1): NaNs produced
## [1] NaN
sqrt(-1+0i)
## [1] 0+1i
  • Scientific notation is allowed.
a <- 2.3e3
a
## [1] 2300
  • The special values are NaN (Not a Number) nad infinities Inf, -Inf
1/0; exp(-Inf); 0 * Inf
## [1] Inf
## [1] 0
## [1] NaN
Character
  • Textual type (string) starts and ends with a character ’ or “.
string <- "Ala ma kota"; string
## [1] "Ala ma kota"
string <- 'Pakiet R'; string; typeof (string)
## [1] "Pakiet R"
## [1] "character"
  • The paste() function is used to combine strings.
word1 <- "I"
word2 <- "like"
word3 <- "trains"
paste (word1, word2, word3, sep = " ")
## [1] "I like trains"
  • The variable of type character is not a vector of chars.
length (word3)
## [1] 1
Logical
  • Represents logical true (TRUE or T) and false (FALSE or F).
  • When used in an arithmetic expression, it is automatically converted into 0 and 1 numbers.
1 == 7
## [1] FALSE
z <- 1 == 1
z; typeof(z)
## [1] TRUE
## [1] "logical"
y <- (1 == 1) + 1
y; typeof(y) 
## [1] 2
## [1] "double"


Data structures
Vector
  • An ordered set of objects of the same type (except for NA - Not Available).
  • Primary data structure in R: operations performed on vectors are the most efficient.
  • The c() function creates a vector from individual elements of the same type.
v <- c(-1,2,5)
v
## [1] -1  2  5
  • Arithmetic sequences can be easily generated by the seq() function or with a simple colon.
u <- 1:10
u
##  [1]  1  2  3  4  5  6  7  8  9 10
w <- seq(-10,10,2)
w
##  [1] -10  -8  -6  -4  -2   0   2   4   6   8  10
  • The rep() function is used to generate vectors with repetitions.
x <- rep(TRUE, 5)
x
## [1] TRUE TRUE TRUE TRUE TRUE
y <- rep(c(1,2,3),3)
y
## [1] 1 2 3 1 2 3 1 2 3
z <- rep(c(1,2,3), each=3)
z
## [1] 1 1 1 2 2 2 3 3 3
Factor
  • Useful for storing vectors of values occurring at several levels.
  • It is used to represent categorical and qualitative data.
  • Created using the factor() function.
  • The levels() function returns the levels.
education <- factor (c ("primary", "tertiary", "secondary", "secondary", "tertiary", "secondary"))
education
## [1] primary   tertiary  secondary secondary tertiary  secondary
## Levels: primary secondary tertiary
levels (education)
## [1] "primary"   "secondary" "tertiary"
  • It takes up less memory than the corresponding character type - it is stored as consecutive natural numbers, but arithmetic operations cannot be performed on them.
typeof (education)
## [1] "integer"
education+1
## Warning in Ops.factor(education, 1): '+' not meaningful for factors
## [1] NA NA NA NA NA NA
List
  • An ordered set of objects (e.g. vectors) that can be of various types and any length.
  • Created using the list() function.
L <- list (int = 1:10, x = 2.71, text = c("a", "b", "c"), logic = rep(T, 5))
L
## $int
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $x
## [1] 2.71
## 
## $text
## [1] "a" "b" "c"
## 
## $logic
## [1] TRUE TRUE TRUE TRUE TRUE
Matrix
  • The two-dimensional matrix is created by the matrix() function.
A <- matrix (0, 2, 3); A
##      [,1] [,2] [,3]
## [1,]    0    0    0
## [2,]    0    0    0
A <- matrix (1:8, 4, 2); A
##      [,1] [,2]
## [1,]    1    5
## [2,]    2    6
## [3,]    3    7
## [4,]    4    8
A <- matrix (c ("a", "b", "c", "d"),2 , 2); A
##      [,1] [,2]
## [1,] "a"  "c" 
## [2,] "b"  "d"
  • The matrix is filled columnar by default, but this can be changed by setting the value of the byrow = TRUE parameter.
A <- matrix (1:8, 4, 2, byrow = TRUE); A
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4
## [3,]    5    6
## [4,]    7    8
  • In the case of multidimensional arrays (D > 2), we use the array() function.
B <- array (1:27, dim = c (3, 3, 3)); B 
## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]   10   13   16
## [2,]   11   14   17
## [3,]   12   15   18
## 
## , , 3
## 
##      [,1] [,2] [,3]
## [1,]   19   22   25
## [2,]   20   23   26
## [3,]   21   24   27
Data frame
  • A list of vectors of the same length.
  • Elements in each column are of the same type.
  • Elements in different columns can be of different types.
  • Very often used as a fundamental type in various R packages (e.g. ggplot2),
  • Created using the data.frame() function.
frame <- data.frame (numbers = 5:1, logic = T, text = letters[1:5]); frame
##   numbers logic text
## 1       5  TRUE    a
## 2       4  TRUE    b
## 3       3  TRUE    c
## 4       2  TRUE    d
## 5       1  TRUE    e


Indexing in R

w <- 11:20
w[1:5]
## [1] 11 12 13 14 15
w[-1]
## [1] 12 13 14 15 16 17 18 19 20
w[c(1:4,8)]
## [1] 11 12 13 14 18
w[c(-2,-5)]
## [1] 11 13 14 16 17 18 19 20
M <- matrix(1:9, 3, 3)
M
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
# First row
M[1,]
## [1] 1 4 7
# First column
M[,1]
## [1] 1 2 3
# Two first rows
M[1:2,] 
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
# Omit third column
M[,-3]
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6
# Omit second row and second column
M[-2,-2] 
##      [,1] [,2]
## [1,]    1    7
## [2,]    3    9
named.vec <- c (v1 = 1, v2 = 0, v3 = 2, v4 = -1)
named.vec
## v1 v2 v3 v4 
##  1  0  2 -1
names (named.vec)
## [1] "v1" "v2" "v3" "v4"
named.vec["v1"]
## v1 
##  1
named.vec["v3"]
## v3 
##  2
colnames (M) <- c ("col1", "col2", "col3") 
rownames (M) <- c ("row1", "row2", "row3")
M
##      col1 col2 col3
## row1    1    4    7
## row2    2    5    8
## row3    3    6    9
names (M)
## NULL
M["row2",]
## col1 col2 col3 
##    2    5    8
M[,"col3"]
## row1 row2 row3 
##    7    8    9
M["row1","col2"]
## [1] 4
L$int
##  [1]  1  2  3  4  5  6  7  8  9 10
L[1]
## $int
##  [1]  1  2  3  4  5  6  7  8  9 10
L[[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10
frame
##   numbers logic text
## 1       5  TRUE    a
## 2       4  TRUE    b
## 3       3  TRUE    c
## 4       2  TRUE    d
## 5       1  TRUE    e
# First three rows
frame[1:3,]
##   numbers logic text
## 1       5  TRUE    a
## 2       4  TRUE    b
## 3       3  TRUE    c
# Second column
frame[,2]
## [1] TRUE TRUE TRUE TRUE TRUE
# First column
frame$numbers
## [1] 5 4 3 2 1
w[6:10][1:2]
## [1] 16 17
L[[3]][2:3]
## [1] "b" "c"
frame$numbers[1:3]
## [1] 5 4 3
frame[1:3,][1]
##   numbers
## 1       5
## 2       4
## 3       3
frame[1:3,]["numbers"]
##   numbers
## 1       5
## 2       4
## 3       3


Operations on vectors and matrices

Let define following vectors w, u and matrices A, B:

w <- c (1,2)
v <- c (3,4)
A <- matrix (1:4, 2, 2)
B <- matrix (4:1, 2, 2)
w; v; A; B
## [1] 1 2
## [1] 3 4
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
##      [,1] [,2]
## [1,]    4    2
## [2,]    3    1

We can perform the following operations on vectors:

w + v
## [1] 4 6
w + 5
## [1] 6 7
2 * w
## [1] 2 4
w %*% v
##      [,1]
## [1,]   11
w * v
## [1] 3 8

Similar operations can be performed on matrices. In addition, we can use another, very useful functions:

A + B
##      [,1] [,2]
## [1,]    5    5
## [2,]    5    5
A + 1
##      [,1] [,2]
## [1,]    2    4
## [2,]    3    5
A * 2
##      [,1] [,2]
## [1,]    2    6
## [2,]    4    8
t (A)
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4
det (A)
## [1] -2
A %*% B
##      [,1] [,2]
## [1,]   13    5
## [2,]   20    8
eigen (A)
## eigen() decomposition
## $values
## [1]  5.3722813 -0.3722813
## 
## $vectors
##            [,1]       [,2]
## [1,] -0.5657675 -0.9093767
## [2,] -0.8245648  0.4159736


Loops and conditional statements


FOR and WHILE loops
x <- 1:10
for(i in x) print(i)
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
x <- 1
while(x < 5) {
  print(x)
  x <- x + 1
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4


IF… ELSE… conditional statement
x <- 5
if(x < 5) print(x) else print(x ^ 2)
## [1] 25

The length of the condition has to be one, otherwise the conditional statement will not be executed.

x <- 1:10
if(x %% 3) {
  print("It is not divisible by 3")
} else {
  print("It is divisible by 3")
}
## Error in if (x%%3) {: the condition has length > 1


IFELSE(…,…,…) function

The conditions with length greater than one can be used in ifelse() function, which check the condition for each element of a given vector.

x <- 1:10
ifelse(x %% 3, "It is not divisible by 3", "It is divisible by 3")
##  [1] "It is not divisible by 3" "It is not divisible by 3"
##  [3] "It is divisible by 3"     "It is not divisible by 3"
##  [5] "It is not divisible by 3" "It is divisible by 3"    
##  [7] "It is not divisible by 3" "It is not divisible by 3"
##  [9] "It is divisible by 3"     "It is not divisible by 3"


Selected basic functions


Vector handling functions

The following functions are very useful when processing data in vector format:

  • number of vector elements,
x <- c (2,-1,0,3,-5)
length (x)
## [1] 5
  • average value of vector elements,
mean (x)
## [1] -0.2
  • standard deviation of vector elements,
sd (x)
## [1] 3.114482
  • reversing the order of vector elements,
rev (x)
## [1] -5  3  0 -1  2
  • sum of vector elements,
sum (x)
## [1] -1
  • cumulative sum of vector elements,
cumsum (x)
## [1]  2  1  1  4 -1
  • product of vector elements,
prod (x)
## [1] 0
  • cumulative product of vector elements,
cumprod (x)
## [1]  2 -2  0  0  0
  • smallest element of a vector,
min (x)
## [1] -5
  • index of the smallest element of a vector,
which.min (x)
## [1] 5
  • largest element of a vector,
max (x)
## [1] 3
  • index of the largest element of a vector,
which.max (x)
## [1] 4
  • a function that arranges elements in ascending or descending order,
sort (x)
## [1] -5 -1  0  2  3
sort (x, decreasing = TRUE)
## [1]  3  2  0 -1 -5
sort (x, index = TRUE)
## $x
## [1] -5 -1  0  2  3
## 
## $ix
## [1] 5 2 3 1 4
  • the above functions will not work correct if a given vector contains NA (Not Available) elements until the parameter na.rm=TRUE is used.
y <- c(1, NA, 2, 5, 7)
sum(y)
## [1] NA
mean(y)
## [1] NA
sum(y, na.rm = TRUE)
## [1] 15
mean(y, na.rm = TRUE)
## [1] 3.75
  • The which function returns the indexes of elements that meet a given condition.
which(y > 2)
## [1] 4 5
which(y == 2)
## [1] 3
which(y == NA)
## integer(0)
  • Calling the command which(y == NA) will not give the expected result. To find the indexes of NA, NaN or Inf elements, use the is.na, is.nan, is.finite and is.infinite functions inside the which function .
z <- c(0/0, NA, 1/0, -1/0, 10, 15); z 
## [1]  NaN   NA  Inf -Inf   10   15
is.na(z)
## [1]  TRUE  TRUE FALSE FALSE FALSE FALSE
is.nan(z)
## [1]  TRUE FALSE FALSE FALSE FALSE FALSE
is.infinite(z)
## [1] FALSE FALSE  TRUE  TRUE FALSE FALSE
which(is.na(z))
## [1] 1 2
which(is.nan(z))
## [1] 1
which(is.infinite(z))
## [1] 3 4


Matrix handling functions

Some of the above functions can also be applied to matrices, but in the case of which.min() and which.max() it is necessary to use the arrayInd() function additionally to determine the matrix indexes - otherwise we will only get the vector index.

A <- matrix(1:16, 4, 4)
A
##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
## [3,]    3    7   11   15
## [4,]    4    8   12   16
length (A)
## [1] 16
sum (A)
## [1] 136
mean (A)
## [1] 8.5
sd (A)
## [1] 4.760952
min (A)
## [1] 1
max (A)
## [1] 16
which.min (A)
## [1] 1
which.max (A)
## [1] 16
arrayInd (which.min (A), dim (A))
##      [,1] [,2]
## [1,]    1    1
arrayInd (which.max (A), dim (A))
##      [,1] [,2]
## [1,]    4    4


Writing your own functions

As in any programming language, a very important element of the R is creating your own functions. The syntax for creating a function is as follows.

function_name <- function(x, y, ...) {
  ...
  ...
  return (...)
}

The function parameters (arguments) can be any data type or data structure, e.g. vectors, matrices or lists.

multiplication_table <- function (range1, range2) {
  return (range1 %o% range2)
}
multiplication_table (1:10, 1:10)
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    1    2    3    4    5    6    7    8    9    10
##  [2,]    2    4    6    8   10   12   14   16   18    20
##  [3,]    3    6    9   12   15   18   21   24   27    30
##  [4,]    4    8   12   16   20   24   28   32   36    40
##  [5,]    5   10   15   20   25   30   35   40   45    50
##  [6,]    6   12   18   24   30   36   42   48   54    60
##  [7,]    7   14   21   28   35   42   49   56   63    70
##  [8,]    8   16   24   32   40   48   56   64   72    80
##  [9,]    9   18   27   36   45   54   63   72   81    90
## [10,]   10   20   30   40   50   60   70   80   90   100

The return() instruction is not obligatory - the value of the function is the value specified in its last line. The dot character . in R does not have any special role. Therefore it can be used in the names of variables and functions.

add.two <- function (x, y) {
  x*y
  cos(x)
  x+y
}
add.two (2,5)
## [1] 7

All values passed to the function are visible and changed locally. If you need to globally change the value of a variable, use the assignment operator <<-.

f <- function (x, y) {
  x <- x * 2
  y <<- y * 2
}
x <- 2
y <- 2
f(2,2)
x; y
## [1] 2
## [1] 4


Cheat sheets

A popular form of assistance in data analysis in R is the so-called cheat sheets. They can be found at https://www.rstudio.com/resources/cheatsheets/. Another way is to open the Help menu in RStudio and select Cheat Sheets from the drop-down list.