String is an appropriately encoded sequence of bytes, i.e. small integers in the range {0,1….255}.
text <- "Advanced R programming"
hex <- charToRaw (text) # in the hexadecimal system (HEX)
as.integer (hex) # in the decimal system (DEC)
## [1] 65 100 118 97 110 99 101 100 32 82 32 112 114 111 103 114 97 109 109
## [20] 105 110 103
## [1] "Advanced R programming"
The R language has many built-in functions for working with text
data, e.g. paste()
, substr()
,
tolower()
, toupper()
.
## [1] "Advanced R programming"
## [1] "pro"
## [1] "Advanced R PROgramming"
## [1] "advanced r programming"
## [1] "ADVANCED R PROGRAMMING"
The sprintf()
function, well known to C users, is used
to conveniently combine text and numeric values.
## [1] "Test #1 of 3" "Test #2 of 3" "Test #3 of 3"
## [1] "data science 3.1416"
## [1] "SELECT * FROM myTable WHERE id IN (7, 9, 1)"
The functions match (x,table)
and
pmatch (x,table)
return the positions of the elements of
the vector x
in the vector table
. The
pmatch()
function is especially useful for strings because
it allows partial matching. The match()
function has a
%in%
operator associated with it, which returns a logical
value instead of an index.
veggies <- c ("carrot", "cauliflower", "potato", "broccoli")
v <- c ("carrot", "car", "broccoli")
match (v, veggies)
## [1] 1 NA 4
## [1] TRUE FALSE TRUE
## [1] 1
Some of the built-in text processing functions are not vectorized
correctly or do not handle the NA
values.
## [1] "Advanced" "R" "programming" "NA"
The above problems were solved in the ‘stringi’ package, co-authored
by Marek Gągolewski from the Faculty of MiNI, Warsaw University of
Technology. This package has gained huge popularity around the world and
has been appreciated by the creators of tidyverse
, who have
written functions that simplify the use of the package. These functions,
called wrappers, are placed in a package called
stringr
.
str_c()
combines multiple character vectors into a
single character vector. It’s very similar to paste0() but uses
tidyverse recycling and NA rules.
## [1] "AdvancedRprogramming2025"
## [1] "Advanced R programming 2025"
text <- str_c (c ("Advanced", "programming"), c("R", "2025"), sep = " ", collapse = " ")
print (text)
## [1] "Advanced R programming 2025"
## [1] 1
## [1] "a,1;b,2"
str_flatten()
reduces a character vector to a single
string. This is a summary function because regardless of the length of
the input x, it always returns a single string.
## [1] "Advanced R programming 2025"
## [1] 1
## [1] "abc"
## [1] "ąęś"
## [1] "ąę"
## [1] "abc123AES"
The stringr
package offers the following types (engines)
of text searches:
regex
- regular expressions, default mode,
fixed
- very fast, coding-independent substring
search,
coll
- coding-dependent (specific language) pattern
search (for natural language processing, works slowly),
boundary
- searches for fragments constituting
boundaries along which the text can be divided into smaller
units.
Basic search functions in the stringr
package:
str_detect
,
str_count
,
str_extract
, str_extract_all
,
str_locate
, str_locate_all
,
str_replace
, str_replace_all
,
str_match
, str_match_all
.
The pattern of using the above functions is very similar. The first
argument is the character vector to be searched, and the second is the
pattern (pattern
), which is a regular expression by default
(regex()
function), but can be changed to
fixed
, coll
or boundary
.
The str_detect()
function checks whether a given pattern
occurs in a given string:
## [1] TRUE
## [1] FALSE TRUE
## [1] TRUE FALSE NA
## [1] TRUE TRUE
The str_count()
function calculates the number of times
a given pattern appears in a given string:
## [1] 3
## [1] "apple" "apricot" "avocado"
## [4] "banana" "bell pepper" "bilberry"
## [7] "blackberry" "blackcurrant" "blood orange"
## [10] "blueberry" "boysenberry" "breadfruit"
## [13] "canary melon" "cantaloupe" "cherimoya"
## [16] "cherry" "chili pepper" "clementine"
## [19] "cloudberry" "coconut" "cranberry"
## [22] "cucumber" "currant" "damson"
## [25] "date" "dragonfruit" "durian"
## [28] "eggplant" "elderberry" "feijoa"
## [31] "fig" "goji berry" "gooseberry"
## [34] "grape" "grapefruit" "guava"
## [37] "honeydew" "huckleberry" "jackfruit"
## [40] "jambul" "jujube" "kiwi fruit"
## [43] "kumquat" "lemon" "lime"
## [46] "loquat" "lychee" "mandarine"
## [49] "mango" "mulberry" "nectarine"
## [52] "nut" "olive" "orange"
## [55] "pamelo" "papaya" "passionfruit"
## [58] "peach" "pear" "persimmon"
## [61] "physalis" "pineapple" "plum"
## [64] "pomegranate" "pomelo" "purple mangosteen"
## [67] "quince" "raisin" "rambutan"
## [70] "raspberry" "redcurrant" "rock melon"
## [73] "salal berry" "satsuma" "star fruit"
## [76] "strawberry" "tamarillo" "tangerine"
## [79] "ugli fruit" "watermelon"
## [1] 1 1 2 3 0 0 1 2 1 0
## [1] 0 0 0 1 1 2 2 1 1 2
## [1] 0 1 1 0 0 0 1 2 0 0
## [1] 1 0 0 0 3 1 1 0 1 2
## [1] 2 1 2 3
The str_extract()
function searches for substrings
matching a given pattern:
## [1] "berry"
## [[1]]
## [1] "berry" "berry" "berry" "berry" "berry" "berry" "berry" "berry" "berry"
## [10] "berry" "berry" "berry" "berry" "berry"
The str_locate()
function returns an integer matrix
containing the location of the pattern in the searched string. The first
column contains the location of the start of the pattern and the second
column contains the location of the end.
## start end
## [1,] 12 14
## [[1]]
## start end
## [1,] 14 17
## [2,] 26 29
## [3,] 37 40
## [4,] 54 57
str_replace
wyszukuje fragment tekstu i zastępuje go
innym.
## [1] "I love dogs! I think cats are the best pets!"
## [1] "I love dogs! I think dogs are the best pets!"
The str_split()
function splits the text into smaller
pieces. It is useful to use the boundary()
pattern for this
function.
text <- "First sentence. Second sentence. Third sentence. Fourth sentence."
str_split (text, boundary (type = "sentence"))
## [[1]]
## [1] "First sentence. " "Second sentence. " "Third sentence. "
## [4] "Fourth sentence."
## [[1]]
## [1] "First" "sentence" "Second" "sentence" "Third" "sentence" "Fourth"
## [8] "sentence"
## [[1]]
## [1] "F" "i" "r" "s" "t" " " "s" "e" "n" "t" "e" "n" "c" "e" "." " " "S" "e" "c"
## [20] "o" "n" "d" " " "s" "e" "n" "t" "e" "n" "c" "e" "." " " "T" "h" "i" "r" "d"
## [39] " " "s" "e" "n" "t" "e" "n" "c" "e" "." " " "F" "o" "u" "r" "t" "h" " " "s"
## [58] "e" "n" "t" "e" "n" "c" "e" "."
text <- c ("apples and oranges and pears and bananas", "pineapples and mangos and guavas")
str_split (text, " and ")
## [[1]]
## [1] "apples" "oranges" "pears" "bananas"
##
## [[2]]
## [1] "pineapples" "mangos" "guavas"
## [,1] [,2] [,3] [,4]
## [1,] "apples" "oranges" "pears" "bananas"
## [2,] "pineapples" "mangos" "guavas" ""
Regular expressions are used to encode patterns in text. They are universal, that is, they have the same syntax regardless of the programming language.
## [1] 2
## [1] 4
## [1] 4
## [1] 4
## [1] 4
## [1] 6
## [[1]]
## [1] "a" "d" "z"
## [[1]]
## [1] "a" "b" "d" "w" "z"
## [[1]]
## [1] "zA"
shopping_list <- c("apples x4", "bag of flour", "bag of sugar", "milk x2")
str_extract_all(shopping_list, "[a-z]+")
## [[1]]
## [1] "apples" "x"
##
## [[2]]
## [1] "bag" "of" "flour"
##
## [[3]]
## [1] "bag" "of" "sugar"
##
## [[4]]
## [1] "milk" "x"
## [[1]]
## [1] "apples"
##
## [[2]]
## [1] "bag" "of" "flour"
##
## [[3]]
## [1] "bag" "of" "sugar"
##
## [[4]]
## [1] "milk"
## [[1]]
## [1] "4"
##
## [[2]]
## character(0)
##
## [[3]]
## character(0)
##
## [[4]]
## [1] "2"
strings <- c(" 219 733 8965", "329-293-8753 ", "banana", "595 794 7569",
"387 287 6718", "apple", "233.398.9187 ", "482 952 3315",
"239 923 8115 and 842 566 4692", "Work: 579-499-7527", "$1000",
"Home: 543.355.3679")
phone <- "([2-9][0-9]{2})[- .]([0-9]{3})[- .]([0-9]{4})"
str_match (strings, phone)
## [,1] [,2] [,3] [,4]
## [1,] "219 733 8965" "219" "733" "8965"
## [2,] "329-293-8753" "329" "293" "8753"
## [3,] NA NA NA NA
## [4,] "595 794 7569" "595" "794" "7569"
## [5,] "387 287 6718" "387" "287" "6718"
## [6,] NA NA NA NA
## [7,] "233.398.9187" "233" "398" "9187"
## [8,] "482 952 3315" "482" "952" "3315"
## [9,] "239 923 8115" "239" "923" "8115"
## [10,] "579-499-7527" "579" "499" "7527"
## [11,] NA NA NA NA
## [12,] "543.355.3679" "543" "355" "3679"
Web scraping extracts information from HTML, CSS, and Javascript code lines. The term usually refers to an automated process, which is less error-prone and faster than gathering data by hand.
It is important to note that web scraping can raise ethical concerns, as it involves accessing and using data from websites without the explicit permission of the website owner. It is a good practice to respect a website’s terms of use and seek written permission before scraping large amounts of data.
Before starting it is important to have a basic knowledge of HTML and CSS. This section aims to briefly explain how HTML and CSS work.
Starting from HTML, an HTML file looks like the following piece of code.
<!DOCTYPE html>
<html lang="en">
<body>
<h1 href="https://en.wikipedia.org/wiki/Carl_Friedrich_Gauss"> Carl Friedrich Gauss</h1>
<h2> Biography </h2>
<p> Johann Carl Friedrich Gauss was born on 30 April 1777 in Brunswick. </p>
<h2> Profession </h2>
<p> Gauss is considered as one of the greatest mathematician, statistician and physicist of all time. </p>
</body>
</html>
Those instructions produce the following:
Johann Carl Friedrich Gauss was born on 30 April 1777 in Brunswick.
Gauss is considered as one of the greatest mathematician, statistician and physicist of all time.
As you read above, HTML is used to describe the infrastructure of a web page, for example we may want to define the headings, the paragraphs, etc.
This infrastructure is represented by what are called tags (for
example <h1>...<\h1>
or
<p>...<\p>
are tags). Tags are the core of an
HTML document as they represent the nature of what is inside the tag
(for example h1
stands for heading 1). It is important to
observe that there are two types of tags:
<h1>
)<\h1>
)This is what allows to nest different tags. Tags can also have
attributes, for example in
<h1 href="https://en.wikipedia.org/wiki/Carl_Friedrich_Gauss"Carl Friedrich Gauss</h1>
,
href
is an attribute of the tag h1
that
specifies an URL.
As the output of the above HTML code is not super elegant, CSS is used to style the final website. For example CSS is used to define the font, the color, the size, the spacing and many more features of a website.
What is important for this article are CSS
selectors, which are patterns used to select elements. The most
important is the .class
selector, which selects all
elements with the same class. For example the .xyz
selector
selects all elements with class="xyz"
.
Inspired by beautiful soup
and RoboBrowser
(two Python libraries for web scraping), rvest
has a
similar syntax, making it the most eligible package for those from
Python.
The rvest
package provides functions to access a web
page and specific elements using CSS selectors and XPath. The library is
a part of the tidyverse collection of packages, i.e., it shares some
coding conventions (e.g., the pipes) with other libraries such as tibble
and ggplot2.
The web scraping operation is usually made in 3 steps:
The HTTP GET method is a method used to send a server a question to get specific data and information. It is essential to notice that this method does not change the server’s state. To send a GET request, we need the link (as a character) to the page we want to scrape:
Sending the request to the page is simple, rvest provides the read_html function, which returns an object of html_document type:
## {html_document}
## <html lang="en" class=" nytapp-vi-homepage " xmlns:og="http://opengraphprotocol.org/schema/">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n <div id="app">\n<a class="css-wsvg60" href="#site-content">Sk ...
As we saw in the last chunk of code, NYT_page
contains
the raw HTML code, which is not so easily readable. In order to make it
readable from R it has to be parsed, which means generating a
Document Object Model (DOM) from the raw HTML. DOM is
what connects scripts and web pages by representing the structure of a
document in memory. The rvest
package provides 2 ways to
select HTML elements:
Selecting elements with rvest
is simple, for XPath we
use the following syntax:
while for CSS selector we need:
Suppose that for a project you need the summaries of the articles of the NYT (note that what is in the following picture is not what you see now in the New York Times web page).
Searching in the HTML code, it is not that complex to find
<p class="summary-class">
, which is the markup of
what we are looking for. To parse the HTML using this selector we use
the html_element()
function:
## {xml_nodeset (6)}
## [1] <p class="summary-class css-ofqxyv">Makers of a vast array of American pr ...
## [2] <p class="summary-class css-1l5zmz6">Japanese automakers, initially optim ...
## [3] <p class="summary-class css-1l5zmz6">Moscow sees economic and geopolitica ...
## [4] <p class="summary-class css-1l5zmz6">In more than three years of full-sca ...
## [5] <p class="summary-class css-1l5zmz6">President Trump is seeking to block ...
## [6] <p class="summary-class css-1l5zmz6">The Greenlandic government is callin ...
The easiest way to obtain a CSS selector is opening the inspect mode in your browser.
Since the chunk of code above collect all the elements p
with the class summary
, we render all the elements of
NYT_summary
as a text using the html_text()
function:
## [1] "Makers of a vast array of American products are weighing the risks, and potential payoffs, of the sweeping tariffs President Trump promised on April 2."
## [2] "Japanese automakers, initially optimistic about some of President Trump’s policies, are reckoning with potentially devastating tariffs on foreign-made cars."
## [3] "Moscow sees economic and geopolitical benefits in humoring President Trump’s push for a cease-fire in Ukraine. But the Kremlin’s war aims haven’t shifted."
## [4] "In more than three years of full-scale war, Ukrainian families of children with long-term illnesses have had to overcome countless challenges."
## [5] "President Trump is seeking to block a judge’s ruling ordering his administration to rehire thousands of federal workers who were on probationary status."
## [6] "The Greenlandic government is calling President Trump’s decision to send a delegation there “aggressive,” pushing the island further from the U.S."
Another important functions to extract data from a HTML elements are
html_text2()
and html_table()
. The first one
works similar to html_text()
but it simulates how text
looks in a browser. Roughly speaking, it converts
<br />
to \n
, add blank lines around
<p>
tags, and lightly formats tabular data. However,
it is much slower than html_text()
.
The html_table()
function parses an HTML table into a
data frame. As an example we will scrape data from the Wikipedia’s page
of Formula 1 drivers.
link <- "https://en.wikipedia.org/wiki/List_of_Formula_One_drivers"
page <- read_html(link)
drivers_F1_raw <- html_element(page, "table.sortable")
drivers_F1_raw
## {html_node}
## <table class="wikitable sortable sticky-header" style="font-size: 85%; text-align:center">
## [1] <tbody>\n<tr>\n<th scope="col">Driver name\n</th>\n<th scope="col">\n<a h ...
## # A tibble: 6 × 11
## `Driver name` Nationality `Seasons competed` `Drivers' Championships`
## <chr> <chr> <chr> <chr>
## 1 Carlo Abate Italy 1962–1963 0
## 2 George Abecassis United Kingdom 1951–1952 0
## 3 Kenny Acheson United Kingdom 1983, 1985 0
## 4 Andrea de Adamich Italy 1968, 1970–1973 0
## 5 Philippe Adams Belgium 1994 0
## 6 Walt Ader United States 1950 0
## # ℹ 7 more variables: `Race entries` <chr>, `Race starts` <chr>,
## # `Pole positions` <chr>, `Race wins` <chr>, Podiums <chr>,
## # `Fastest laps` <chr>, `Points[a]` <chr>
## tibble [874 × 11] (S3: tbl_df/tbl/data.frame)
## $ Driver name : chr [1:874] "Carlo Abate" "George Abecassis" "Kenny Acheson" "Andrea de Adamich" ...
## $ Nationality : chr [1:874] "Italy" "United Kingdom" "United Kingdom" "Italy" ...
## $ Seasons competed : chr [1:874] "1962–1963" "1951–1952" "1983, 1985" "1968, 1970–1973" ...
## $ Drivers' Championships: chr [1:874] "0" "0" "0" "0" ...
## $ Race entries : chr [1:874] "3" "2" "10" "36" ...
## $ Race starts : chr [1:874] "0" "2" "3" "30" ...
## $ Pole positions : chr [1:874] "0" "0" "0" "0" ...
## $ Race wins : chr [1:874] "0" "0" "0" "0" ...
## $ Podiums : chr [1:874] "0" "0" "0" "0" ...
## $ Fastest laps : chr [1:874] "0" "0" "0" "0" ...
## $ Points[a] : chr [1:874] "0" "0" "0" "6" ...