ADVANCED R PROGRAMMING, SUMMER 2025 EDITION

Built-in functions for processing text data
The stringr package
Simple text search
Regular expressions
Web scraping
- HTML and CSS
- The rvest package

Built-in functions for processing text data

String is an appropriately encoded sequence of bytes, i.e. small integers in the range {0,1….255}.

text <- "Advanced R programming"
hex <- charToRaw (text) # in the hexadecimal system (HEX)
as.integer (hex) # in the decimal system (DEC)

##  [1]  65 100 118  97 110  99 101 100  32  82  32 112 114 111 103 114  97 109 109
## [20] 105 110 103

print (rawToChar (hex))

## [1] "Advanced R programming"

The R language has many built-in functions for working with text data, e.g. paste(), substr(), tolower(), toupper().

text <- paste ("Advanced", "R", "programming")
text

## [1] "Advanced R programming"

substr (text, 12, 14)

## [1] "pro"

substr (text, 12, 14) <- "PRO"
text

## [1] "Advanced R PROgramming"

tolower (text)

## [1] "advanced r programming"

toupper (text)

## [1] "ADVANCED R PROGRAMMING"

The sprintf() function, well known to C users, is used to conveniently combine text and numeric values.

sprintf ("Test #%d of %d", 1:3, 3)

## [1] "Test #1 of 3" "Test #2 of 3" "Test #3 of 3"

sprintf ("%-20s%.4f", "data science", pi)

## [1] "data science        3.1416"

sprintf ("SELECT * FROM %s WHERE id IN (%s)", "myTable", paste (sample (1:9, 3), collapse = ", "))

## [1] "SELECT * FROM myTable WHERE id IN (7, 9, 1)"

The functions match (x,table) and pmatch (x,table) return the positions of the elements of the vector x in the vector table. The pmatch() function is especially useful for strings because it allows partial matching. The match() function has a %in% operator associated with it, which returns a logical value instead of an index.

veggies <- c ("carrot", "cauliflower", "potato", "broccoli")
v <- c ("carrot", "car", "broccoli")
match (v, veggies)

## [1]  1 NA  4

v %in% veggies

## [1]  TRUE FALSE  TRUE

pmatch ("car", veggies)

## [1] 1

The stringr package

Some of the built-in text processing functions are not vectorized correctly or do not handle the NA values.

paste (c ("Advanced", "R", "programming", NA))

## [1] "Advanced"    "R"           "programming" "NA"

The above problems were solved in the ‘stringi’ package, co-authored by Marek Gągolewski from the Faculty of MiNI, Warsaw University of Technology. This package has gained huge popularity around the world and has been appreciated by the creators of tidyverse, who have written functions that simplify the use of the package. These functions, called wrappers, are placed in a package called stringr.

Combining strings

str_c() combines multiple character vectors into a single character vector. It’s very similar to paste0() but uses tidyverse recycling and NA rules.

library (stringr)
str_c ("Advanced", "R", "programming", "2025")

## [1] "AdvancedRprogramming2025"

str_c ("Advanced", "R", "programming", "2025", sep = " ")

## [1] "Advanced R programming 2025"

text <- str_c (c ("Advanced", "programming"), c("R", "2025"), sep = " ", collapse = " ")
print (text)

## [1] "Advanced R programming 2025"

length (text)

## [1] 1

str_c (c ("a", "b"), 1:2, sep = ",", collapse = ";")

## [1] "a,1;b,2"

str_flatten() reduces a character vector to a single string. This is a summary function because regardless of the length of the input x, it always returns a single string.

text <- str_flatten (c ("Advanced", "R", "programming", "2025"), collapse = " ")
print (text)

## [1] "Advanced R programming 2025"

length (text)

## [1] 1

Duplicating strings

str_dup ("a", 1:5)

## [1] "a"     "aa"    "aaa"   "aaaa"  "aaaaa"

Substrings

x <- "abc123ąęś"
str_sub (x, 1, 3)

## [1] "abc"

str_sub (x, -3)

## [1] "ąęś"

str_sub (x, -3, -2)

## [1] "ąę"

str_sub (x, -3) <- "AES"
x

## [1] "abc123AES"

Trimming and justification

str_trim ("\t abc\n ", side = "both")

## [1] "abc"

str_squish ("   This   text includes   too much     spaces.   ")

## [1] "This text includes too much spaces."

cat (str_pad (c ("abc", "defghij"), 10, side = "left"), sep = "\n")

##        abc
##    defghij

Simple text search

The stringr package offers the following types (engines) of text searches:

regex - regular expressions, default mode,
fixed - very fast, coding-independent substring search,
coll - coding-dependent (specific language) pattern search (for natural language processing, works slowly),
boundary - searches for fragments constituting boundaries along which the text can be divided into smaller units.

Basic search functions in the stringr package:

str_detect,
str_count,
str_extract, str_extract_all,
str_locate, str_locate_all,
str_replace, str_replace_all,
str_match, str_match_all.

The pattern of using the above functions is very similar. The first argument is the character vector to be searched, and the second is the pattern (pattern), which is a regular expression by default (regex() function), but can be changed to fixed, coll or boundary.

The str_detect() function checks whether a given pattern occurs in a given string:

str_detect ("science", pattern = fixed ("en"))

## [1] TRUE

str_detect (c ("data", "science"), pattern = fixed ("en"))

## [1] FALSE  TRUE

str_detect ("science", fixed (c ("en", "em", NA)))

## [1]  TRUE FALSE    NA

str_detect (c ("data", "science"), fixed (c ("at", "en")))

## [1] TRUE TRUE

The str_count() function calculates the number of times a given pattern appears in a given string:

str_count ("a1a2a3", fixed ("a"))

## [1] 3

fruit

##  [1] "apple"             "apricot"           "avocado"          
##  [4] "banana"            "bell pepper"       "bilberry"         
##  [7] "blackberry"        "blackcurrant"      "blood orange"     
## [10] "blueberry"         "boysenberry"       "breadfruit"       
## [13] "canary melon"      "cantaloupe"        "cherimoya"        
## [16] "cherry"            "chili pepper"      "clementine"       
## [19] "cloudberry"        "coconut"           "cranberry"        
## [22] "cucumber"          "currant"           "damson"           
## [25] "date"              "dragonfruit"       "durian"           
## [28] "eggplant"          "elderberry"        "feijoa"           
## [31] "fig"               "goji berry"        "gooseberry"       
## [34] "grape"             "grapefruit"        "guava"            
## [37] "honeydew"          "huckleberry"       "jackfruit"        
## [40] "jambul"            "jujube"            "kiwi fruit"       
## [43] "kumquat"           "lemon"             "lime"             
## [46] "loquat"            "lychee"            "mandarine"        
## [49] "mango"             "mulberry"          "nectarine"        
## [52] "nut"               "olive"             "orange"           
## [55] "pamelo"            "papaya"            "passionfruit"     
## [58] "peach"             "pear"              "persimmon"        
## [61] "physalis"          "pineapple"         "plum"             
## [64] "pomegranate"       "pomelo"            "purple mangosteen"
## [67] "quince"            "raisin"            "rambutan"         
## [70] "raspberry"         "redcurrant"        "rock melon"       
## [73] "salal berry"       "satsuma"           "star fruit"       
## [76] "strawberry"        "tamarillo"         "tangerine"        
## [79] "ugli fruit"        "watermelon"

str_count (fruit[1:10], fixed ("a"))

##  [1] 1 1 2 3 0 0 1 2 1 0

str_count (fruit[1:10], fixed ("b"))

##  [1] 0 0 0 1 1 2 2 1 1 2

str_count (fruit[1:10], "c")

##  [1] 0 1 1 0 0 0 1 2 0 0

str_count (fruit[1:10], "e")

##  [1] 1 0 0 0 3 1 1 0 1 2

str_count (fruit[1:4], c ("p", "p", "a", "a"))

## [1] 2 1 2 3

The str_extract() function searches for substrings matching a given pattern:

str_extract (str_flatten(fruit, collapse = " "), fixed ("berry"))

## [1] "berry"

str_extract_all (str_flatten(fruit, collapse = " "), fixed ("berry"))

## [[1]]
##  [1] "berry" "berry" "berry" "berry" "berry" "berry" "berry" "berry" "berry"
## [10] "berry" "berry" "berry" "berry" "berry"

The str_locate() function returns an integer matrix containing the location of the pattern in the searched string. The first column contains the location of the start of the pattern and the second column contains the location of the end.

str_locate ("Where is a cat? Kitty, kitty, kitty! Where are you?", fixed ("cat"))

##      start end
## [1,]    12  14

str_locate_all ("One cat, two cats, three cats, four cats... Too much cats!", fixed ("cats"))

## [[1]]
##      start end
## [1,]    14  17
## [2,]    26  29
## [3,]    37  40
## [4,]    54  57

str_replace wyszukuje fragment tekstu i zastępuje go innym.

str_replace ("I love cats! I think cats are the best pets!", fixed ("cats"), "dogs")

## [1] "I love dogs! I think cats are the best pets!"

str_replace_all ("I love cats! I think cats are the best pets!", fixed ("cats"), "dogs")

## [1] "I love dogs! I think dogs are the best pets!"

The str_split() function splits the text into smaller pieces. It is useful to use the boundary() pattern for this function.

text <- "First sentence. Second sentence. Third sentence. Fourth sentence."
str_split (text, boundary (type = "sentence"))

## [[1]]
## [1] "First sentence. "  "Second sentence. " "Third sentence. " 
## [4] "Fourth sentence."

str_split (text, boundary (type = "word"))

## [[1]]
## [1] "First"    "sentence" "Second"   "sentence" "Third"    "sentence" "Fourth"  
## [8] "sentence"

str_split (text, boundary (type = "character"))

## [[1]]
##  [1] "F" "i" "r" "s" "t" " " "s" "e" "n" "t" "e" "n" "c" "e" "." " " "S" "e" "c"
## [20] "o" "n" "d" " " "s" "e" "n" "t" "e" "n" "c" "e" "." " " "T" "h" "i" "r" "d"
## [39] " " "s" "e" "n" "t" "e" "n" "c" "e" "." " " "F" "o" "u" "r" "t" "h" " " "s"
## [58] "e" "n" "t" "e" "n" "c" "e" "."

text <- c ("apples and oranges and pears and bananas", "pineapples and mangos and guavas")
str_split (text, " and ")

## [[1]]
## [1] "apples"  "oranges" "pears"   "bananas"
## 
## [[2]]
## [1] "pineapples" "mangos"     "guavas"

str_split (text, " and ", simplify = TRUE)

##      [,1]         [,2]      [,3]     [,4]     
## [1,] "apples"     "oranges" "pears"  "bananas"
## [2,] "pineapples" "mangos"  "guavas" ""

Regular expressions

Regular expressions are used to encode patterns in text. They are universal, that is, they have the same syntax regardless of the programming language.

str_count ("a1b2c3d4", "[ac]")

## [1] 2

str_count ("a1b2c3d4", "[a-z]")

## [1] 4

str_count ("a1b2ß3ą4", "[0-9]")

## [1] 4

str_count ("a1b2ß3ą4", "\\p{L}") #unicode letters

## [1] 4

str_count ("a1b2ß3ą4", "[^\\p{L}]") #non-letters

## [1] 4

str_count ("a1b2ß3ą4", "[^a-z]")

## [1] 6

str_extract_all ("abdwzAWZ12! @","[adz]")

## [[1]]
## [1] "a" "d" "z"

str_extract_all ("abdwzAWZ12! @","[\\p{Ll}]")

## [[1]]
## [1] "a" "b" "d" "w" "z"

str_extract_all ("abdwzAWZ12! @","\\p{Ll}\\p{Lu}")

## [[1]]
## [1] "zA"

shopping_list <- c("apples x4", "bag of flour", "bag of sugar", "milk x2")
str_extract_all(shopping_list, "[a-z]+")

## [[1]]
## [1] "apples" "x"     
## 
## [[2]]
## [1] "bag"   "of"    "flour"
## 
## [[3]]
## [1] "bag"   "of"    "sugar"
## 
## [[4]]
## [1] "milk" "x"

str_extract_all(shopping_list, "\\b[a-z]+\\b")

## [[1]]
## [1] "apples"
## 
## [[2]]
## [1] "bag"   "of"    "flour"
## 
## [[3]]
## [1] "bag"   "of"    "sugar"
## 
## [[4]]
## [1] "milk"

str_extract_all(shopping_list, "\\d")

## [[1]]
## [1] "4"
## 
## [[2]]
## character(0)
## 
## [[3]]
## character(0)
## 
## [[4]]
## [1] "2"

strings <- c(" 219 733 8965", "329-293-8753 ", "banana", "595 794 7569",
  "387 287 6718", "apple", "233.398.9187  ", "482 952 3315",
  "239 923 8115 and 842 566 4692", "Work: 579-499-7527", "$1000",
  "Home: 543.355.3679")
phone <- "([2-9][0-9]{2})[- .]([0-9]{3})[- .]([0-9]{4})"
str_match (strings, phone)

##       [,1]           [,2]  [,3]  [,4]  
##  [1,] "219 733 8965" "219" "733" "8965"
##  [2,] "329-293-8753" "329" "293" "8753"
##  [3,] NA             NA    NA    NA    
##  [4,] "595 794 7569" "595" "794" "7569"
##  [5,] "387 287 6718" "387" "287" "6718"
##  [6,] NA             NA    NA    NA    
##  [7,] "233.398.9187" "233" "398" "9187"
##  [8,] "482 952 3315" "482" "952" "3315"
##  [9,] "239 923 8115" "239" "923" "8115"
## [10,] "579-499-7527" "579" "499" "7527"
## [11,] NA             NA    NA    NA    
## [12,] "543.355.3679" "543" "355" "3679"

Web scraping

Web scraping extracts information from HTML, CSS, and Javascript code lines. The term usually refers to an automated process, which is less error-prone and faster than gathering data by hand.

It is important to note that web scraping can raise ethical concerns, as it involves accessing and using data from websites without the explicit permission of the website owner. It is a good practice to respect a website’s terms of use and seek written permission before scraping large amounts of data.

HTML and CSS

Before starting it is important to have a basic knowledge of HTML and CSS. This section aims to briefly explain how HTML and CSS work.

Starting from HTML, an HTML file looks like the following piece of code.

<!DOCTYPE html>
<html lang="en">
<body>

<h1 href="https://en.wikipedia.org/wiki/Carl_Friedrich_Gauss"> Carl Friedrich Gauss</h1>
<h2> Biography </h2>
<p> Johann Carl Friedrich Gauss was born on 30 April 1777 in Brunswick. </p>
<h2> Profession </h2>
<p> Gauss is considered as one of the greatest mathematician, statistician and physicist of all time. </p>

</body>
</html>

Those instructions produce the following:

Carl Friedrich Gauss

Biography

Johann Carl Friedrich Gauss was born on 30 April 1777 in Brunswick.

Profession

Gauss is considered as one of the greatest mathematician, statistician and physicist of all time.

As you read above, HTML is used to describe the infrastructure of a web page, for example we may want to define the headings, the paragraphs, etc.

This infrastructure is represented by what are called tags (for example <h1>...<\h1> or <p>...<\p> are tags). Tags are the core of an HTML document as they represent the nature of what is inside the tag (for example h1 stands for heading 1). It is important to observe that there are two types of tags:

starting tags (e.g. <h1>)
ending tags (e.g. <\h1>)

This is what allows to nest different tags. Tags can also have attributes, for example in <h1 href="https://en.wikipedia.org/wiki/Carl_Friedrich_Gauss"Carl Friedrich Gauss</h1>, href is an attribute of the tag h1 that specifies an URL.

As the output of the above HTML code is not super elegant, CSS is used to style the final website. For example CSS is used to define the font, the color, the size, the spacing and many more features of a website.

What is important for this article are CSS selectors, which are patterns used to select elements. The most important is the .class selector, which selects all elements with the same class. For example the .xyz selector selects all elements with class="xyz".

The rvest package

Inspired by beautiful soup and RoboBrowser (two Python libraries for web scraping), rvest has a similar syntax, making it the most eligible package for those from Python.

The rvest package provides functions to access a web page and specific elements using CSS selectors and XPath. The library is a part of the tidyverse collection of packages, i.e., it shares some coding conventions (e.g., the pipes) with other libraries such as tibble and ggplot2.

library (rvest)

The web scraping operation is usually made in 3 steps:

HTTP GET request
Parsing HTML content
Getting HTML element attributes

Step 1: HTTP GET request

The HTTP GET method is a method used to send a server a question to get specific data and information. It is essential to notice that this method does not change the server’s state. To send a GET request, we need the link (as a character) to the page we want to scrape:

link <- "https://www.nytimes.com/"

Sending the request to the page is simple, rvest provides the read_html function, which returns an object of html_document type:

NYT_page <- read_html(link)
NYT_page

## {html_document}
## <html lang="en" class=" nytapp-vi-homepage " xmlns:og="http://opengraphprotocol.org/schema/">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n    <div id="app">\n<a class="css-wsvg60" href="#site-content">Sk ...

Step 2: Parsing HTML content

As we saw in the last chunk of code, NYT_page contains the raw HTML code, which is not so easily readable. In order to make it readable from R it has to be parsed, which means generating a Document Object Model (DOM) from the raw HTML. DOM is what connects scripts and web pages by representing the structure of a document in memory. The rvest package provides 2 ways to select HTML elements:

XPath
CSS selectors

Selecting elements with rvest is simple, for XPath we use the following syntax:

NYT_page %>%
  html_elements (xpath = "")

while for CSS selector we need:

NYT_page %>%
  html_elements (css = "")

Suppose that for a project you need the summaries of the articles of the NYT (note that what is in the following picture is not what you see now in the New York Times web page).

$Printscreen of the New York Times webpage, accessed 16/04/2024.$

Printscreen of the New York Times webpage, accessed 16/04/2024.

Searching in the HTML code, it is not that complex to find <p class="summary-class">, which is the markup of what we are looking for. To parse the HTML using this selector we use the html_element() function:

summaries_css <- NYT_page %>%
  html_elements (css = ".summary-class")

head (summaries_css)

## {xml_nodeset (6)}
## [1] <p class="summary-class css-ofqxyv">Makers of a vast array of American pr ...
## [2] <p class="summary-class css-1l5zmz6">Japanese automakers, initially optim ...
## [3] <p class="summary-class css-1l5zmz6">Moscow sees economic and geopolitica ...
## [4] <p class="summary-class css-1l5zmz6">In more than three years of full-sca ...
## [5] <p class="summary-class css-1l5zmz6">President Trump is seeking to block  ...
## [6] <p class="summary-class css-1l5zmz6">The Greenlandic government is callin ...

The easiest way to obtain a CSS selector is opening the inspect mode in your browser.

Step 3: Getting HTML element attributes

Since the chunk of code above collect all the elements p with the class summary, we render all the elements of NYT_summary as a text using the html_text() function:

NYT_summary <- html_text (summaries_css)
head (NYT_summary)

## [1] "Makers of a vast array of American products are weighing the risks, and potential payoffs, of the sweeping tariffs President Trump promised on April 2."     
## [2] "Japanese automakers, initially optimistic about some of President Trump’s policies, are reckoning with potentially devastating tariffs on foreign-made cars."
## [3] "Moscow sees economic and geopolitical benefits in humoring President Trump’s push for a cease-fire in Ukraine. But the Kremlin’s war aims haven’t shifted."  
## [4] "In more than three years of full-scale war, Ukrainian families of children with long-term illnesses have had to overcome countless challenges."              
## [5] "President Trump is seeking to block a judge’s ruling ordering his administration to rehire thousands of federal workers who were on probationary status."    
## [6] "The Greenlandic government is calling President Trump’s decision to send a delegation there “aggressive,” pushing the island further from the U.S."

Another important functions to extract data from a HTML elements are html_text2() and html_table(). The first one works similar to html_text() but it simulates how text looks in a browser. Roughly speaking, it converts <br /> to \n, add blank lines around <p> tags, and lightly formats tabular data. However, it is much slower than html_text().

The html_table() function parses an HTML table into a data frame. As an example we will scrape data from the Wikipedia’s page of Formula 1 drivers.

https://en.wikipedia.org/wiki/List_of_Formula_One_drivers, accessed 16/04/2024.

link <- "https://en.wikipedia.org/wiki/List_of_Formula_One_drivers"
page <- read_html(link)
drivers_F1_raw <- html_element(page, "table.sortable")
drivers_F1_raw

## {html_node}
## <table class="wikitable sortable sticky-header" style="font-size: 85%; text-align:center">
## [1] <tbody>\n<tr>\n<th scope="col">Driver name\n</th>\n<th scope="col">\n<a h ...

drivers_F1_table <- html_table (drivers_F1_raw)
head (drivers_F1_table)

## # A tibble: 6 × 11
##   `Driver name`     Nationality    `Seasons competed` `Drivers' Championships`
##   <chr>             <chr>          <chr>              <chr>                   
## 1 Carlo Abate       Italy          1962–1963          0                       
## 2 George Abecassis  United Kingdom 1951–1952          0                       
## 3 Kenny Acheson     United Kingdom 1983, 1985         0                       
## 4 Andrea de Adamich Italy          1968, 1970–1973    0                       
## 5 Philippe Adams    Belgium        1994               0                       
## 6 Walt Ader         United States  1950               0                       
## # ℹ 7 more variables: `Race entries` <chr>, `Race starts` <chr>,
## #   `Pole positions` <chr>, `Race wins` <chr>, Podiums <chr>,
## #   `Fastest laps` <chr>, `Points[a]` <chr>

str (drivers_F1_table)

## tibble [874 × 11] (S3: tbl_df/tbl/data.frame)
##  $ Driver name           : chr [1:874] "Carlo Abate" "George Abecassis" "Kenny Acheson" "Andrea de Adamich" ...
##  $ Nationality           : chr [1:874] "Italy" "United Kingdom" "United Kingdom" "Italy" ...
##  $ Seasons competed      : chr [1:874] "1962–1963" "1951–1952" "1983, 1985" "1968, 1970–1973" ...
##  $ Drivers' Championships: chr [1:874] "0" "0" "0" "0" ...
##  $ Race entries          : chr [1:874] "3" "2" "10" "36" ...
##  $ Race starts           : chr [1:874] "0" "2" "3" "30" ...
##  $ Pole positions        : chr [1:874] "0" "0" "0" "0" ...
##  $ Race wins             : chr [1:874] "0" "0" "0" "0" ...
##  $ Podiums               : chr [1:874] "0" "0" "0" "0" ...
##  $ Fastest laps          : chr [1:874] "0" "0" "0" "0" ...
##  $ Points[a]             : chr [1:874] "0" "0" "0" "6" ...

Laboratory 3