ADVANCED R PROGRAMMING, SUMMER 2025 EDITION



The knitr package

The knitr package is used to generate documents using the markup language Markdown. With this package you can easily combine R codes, their execution results and text descriptions. To create a report, you need a file with the Rmd extension, which means a combination of R and Markdown.


YAML header

The Rmd file begins with a header in the YAML format (YAML Ain’t Markup Language), which specifies the configuration of the conversion process. Lines describing the author, title and date are optional. The last one describes the format, in this case it is output: html_document, therefore the file will be converted to an HTML file. Other popular choices are pdf_document and word_document.

---
author: "Robert Paluch"
title: "Exemplary report"
date: "March 26, 2024"
output: html_document
---


Chunks

A chunk is a piece of code placed between triple apostrophes. The curly brackets following the upper triple apostrophe indicate what language should be used to run the insert (in our case r) and various additional options, the most frequently used of which are:

  • message=FALSE, warning=FALSE, by default all messages, including information about errors and warnings, are pasted into the resulting document. However, these most often take up space, so it is worth turning them off.

  • cache=TRUE, by default each chunk is compiled every time an Rmd document is transformed. However, if the calculation of a specific chunk takes a long time and the content of the chunk has not changed, then with the cache parameter enabled, the chunk will not be re-executed and its result will be loaded from the local repository.

  • eval=FALSE, this way you can disable chunk processing. It will not be executed, although the corresponding source code will be included in the report. A useful option when we want to present a fragment of code but do not need to execute it (e.g. we illustrate how to download data from the Internet, but indeed we load from a local copy).

  • echo=FALSE causes only the output of the R code to appear in the report, but not the source. Thanks to this, we only have the results in the report and we can send the report to people to whom the R code means nothing and may be disturbing.

  • include=FALSE causes both the chunk itself and the result of its execution not to be shown in the document at the place where the chunk was placed (but the chunk is executed and its result can be used later).

  • fig.width=, fig.height=, these parameters can be used to determine how large the drawing pasted into the report should be.

More options can be found here.


Plots

If a R code placed in a chunk causes a plot to be drawn (e.g. the plot or hist function), the plot will appear in the document, unless eval=FALSE is set in the chunk options.

data ("mtcars")
plot (mtcars$hp, mtcars$mpg, xlab = "power [KM]", ylab = "range [mpg]")


Including graphics

It is possible to insert graphics from a file using the include_graphics ("file path") function. You can also use the chunk options to change the size or position of the graphics (below out.width="50%" and fig.align="center).

knitr::include_graphics("rstudio.png")


Data frames

Data frames can be placed in a document in two ways. The first one is an ordinary listing, analogous to displaying a frame in the console.

head (mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

The second method is more elegant and uses the kable (data, caption) function and allows you to easily add a caption to the table.

knitr::kable (head (mtcars), caption = "First six rows from the mtcars dataset")
First six rows from the mtcars dataset
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1


Global options

It is possible to set common options for all chunks using the knitr::opts_chunk$set(...) command. For this command to be executed, it should be placed in a chunk at the top of the document, e.g.

knitr::opts_chunk$set (message = FALSE, warning = FALSE)


LaTeX and HTML

The knitr package also allows you to place mathematical formulas using the \(\LaTeX\) syntax. The formula should be placed between the $ symbols. For example, writing $f(x) = \frac{x^2}{x-1}$ will result in \(f(x) = \frac{x^2}{x-1}\).

If the target document format is HTML, it is possible to place HTML code directly in the Rmd document. This way you can easily change the text style, e.g. <p style="color: green; font-weight: bold;">green bold text</p> gives

green bold text


R Markdown

The Markdown markup language was created to make text creation and formatting as simple as possible. It was originally created in Perl and later rewritten in many other languages. It is distributed under the BSD license and is available as a plug-in for several content management systems. It is currently widely used in the GitHub repository.


Paragraphs, headers, quotes

A paragraph is one or more lines of text separated from the rest by one or more blank lines. To create a header, insert a # in the number from 1 to 6 at the beginning of the line, e.g.

# Header 1

Header 1

## Header 2

Header 2

### Header 3

Header 3

Blocks of quotes are marked with > characters borrowed from e-mails.

> This is a block quote.
>
> This is a second paragraph in a block quote.
>
> ## This is a H2 header in a block quote.

This is a block quote.

This is a second paragraph in a block quote.

This is a H2 header in a block quote.


Italics, bold, verbatim code

Markdown uses asterisks and underscores to highlight text. In a regular paragraph, you can format the code by surrounding the text with the grave accent (`).

Some of this text *is italic*.
Part of this text _is also italic_.

Use two asterisks for **bold**.
Or, if you prefer, __use two underscores__.

Part of this line is formatted LaTeX code `$a_{ij} = b_i \cdot c_j^2$`.

Some of this text is italic. Part of this text is also italic.

Use two asterisks for bold. Or, if you prefer, use two underscores.

Part of this line is formatted LaTeX code $a_{ij} = b_i \cdot c_j^2$.


Lists

Bulleted lists use asterisks, pluses, and minuses (*, +, and -) as list markers. These three tags are interchangeable:

* The first item in the list.
+ The second list item.
- The third item on the list.
  • The first item in the list.
  • The second list item.
  • The third item on the list.

A numbered list uses numbers ending with a dot as list markers:

1. The first item on the list.
2. The second item on the list.
3. Third item on the list.
  1. The first item on the list.
  2. The second item on the list.
  3. Third item on the list.

Bullet lists can be nested using four spaces or a tab:

- The first item on the list
  - The first item of the first item in the list.
  - The second item of the first list item.
  - The third element of the first element.
- The second item on the list.
- The third item on the list.
  • The first item on the list
    • The first item of the first item in the list.
    • The second item of the first list item.
    • The third element of the first element.
  • The second item on the list.
  • The third item on the list.


Pictures

The basic command for inserting external graphics into a document is ![Caption](<file path>). This command differs only in the exclamation mark from the command that inserts a link.

![R Markdown logo](rmarkdown.png)
R Markdown logo
R Markdown logo


It is important to leave a line break below the command, otherwise, the image will be inserted into the text and will look unattractive.


If the image is too large, it can be made smaller by adding {width="value"} at the end, where the value can be given in pixels or as a percentage of the page width. The original image is 368 pixels wide, let’s reduce it by half.

![](rmarkdown.png){width="184"}


Tables

Creating a table in Markdown is also possible, although this is somewhat problematic. The first row contains column headings separated by vertical bars. In the second line, dashes are placed between the vertical lines separating the columns. The second row also specifies the alignment of the elements in the column. It may be different in each column. The remaining rows are the contents of the columns separated by dashes. A table made in Markdown will always take up the entire width of the page.

| Default alignment | Align left | Center | Align right |
|---|:--|:-:|--:|
| 1 | 2 | 3 | 4 |
| 5 | 6 | 7 | 8 |
| 9 | 10 | 11 | 12 |
Default alignment Align left Center Align right
1 2 3 4
5 6 7 8
9 10 11 12


Developing own R package

There are several reasons why you should create your own packages. First, it forces us to better organize and document our code. Secondly, it makes it easier to use the same functions in different projects without having to copy code. Thirdly, it allows you to easily share your code with others. To create your own project you need two libraries:

install.packages ("devtools")
install.packages ("roxygen2")
# Optionally to check the names of existing packages
install.packages("available")

There are several ways to establish a new package. In this manual we will use RStudio for this purpose. Additionally, we will learn how to publish a package to the GitHub repository.


Step 1: create a new project

At the top of the menu, select File > New Project... > New Directory > R Package. A dialog box will appear in which we must provide the package name. We can use the suggest() and available() functions from the available package to choose the package name.

available::available ("testr", browse = FALSE)
## ── testr ───────────────────────────────────────────────────────────────────────
## Name valid: ✔
## Available on CRAN: ✔ 
## Available on Bioconductor: ✔
## Available on GitHub:  ✖ 
## Abbreviations: http://www.abbreviations.com/test
## Wikipedia: https://en.wikipedia.org/wiki/test
## Wiktionary: https://en.wiktionary.org/wiki/test
## Sentiment:???

If we intend to make the package available in an external repository, e.g. GitHub, we should check the Create a git repository box. Done, the package project already exists. Let’s take a look at what it contains.

  • .gitignore and .Rbuildignore contain lists of files to be ignored by git or R,

  • DESCRIPTION and NAMESPACE contain metadata about the package. The former contains information about the package name, its author, version, and dependencies on other packages. The second one specifies which functions will be available to the user after loading the package and possibly imports functions from other packages. The NAMESPACE file does not need to be edited manually; the devtools package will do it for us. For this to happen, we need to remove it first. More information about DESCRIPTION can be found here.

Package: testr
Type: Package
Title: What the Package Does (Title Case)
Version: 0.1.0
Author: Who wrote it
Maintainer: The package maintainer <yourself@somewhere.net>
Description: More about what it does (maybe more than one line)
    Use four spaces when indenting paragraphs within the Description.
License: What license is it under?
Encoding: UTF-8
LazyData: true
  • man will contain all manuals for our functions. We don’t have to look there either if we use devtools.

  • The R folder will contain all the functions that make up the package. Functions can be collected in one file or scattered among many.

  • <projectname>.Rproj refers to the project created in RStudio, not the package itself. When creating a package without using RStudio, this file will not appear.


Step 2: connect the project to GitHub

This step is optional. Currently our project uses a local git repository (which is also optional). To connect your project to GitHub, log in to the site and create a new repository with the same name as your local project. When creating, do not check the Add a README file box. In the RStudio terminal (not the R console), enter the following commands (make sure you are in the project directory):

git remote add origin https://github.com/user-name/project-name.git
git branch -M main

Now you can easily create commits and update the repository from the Tool > Version Control menu or using the button at the top of the window. Attention! If the Push option is grayed out after the first commit, you should perform this operation from the console by typing

git push -u origin main

Another solution is to simply make the first commit from the console, but then remember to add the files first with the git add --all command (or instead of --all, specify all the files one by one).


Step 3: add functions to your package

To make work on automatic documentation and code organization run smoothly, it is recommended that each function in the package be located in a separate .R file, which will have the same name as the function. Each function will contain documentation according to the roxygen schema. Detailed instructions can be found by typing vignette("rd") in the terminal. Below is a simple example.

#' Add together two numbers
#'
#' @param x A number.
#' @param y A number.
#' @return A number.
#' @examples
#' add(1, 1)
#' add(10, 1)
add <- function(x, y) {
  x + y
}

A roxygen comment starts with the #' character. The full description of functions consists of four parts, of which only one part is absolutely mandatory.

  1. The first line contains the title (can be additionally marked with the @title tag).
  2. The paragraph starting in the second line is a short description of the package (it may be additionally marked with the @description tag).
  3. The third section contains three mandatory tags: @param, @return and @examples:
    • @param describes the function parameters,
    • @return explains what the function returns,
    • @examples demonstrates how to use the function.
  4. The last section, which may consist of many paragraphs, contains details of the package’s functionality (@details).

It is also worth adding the @export tag because it indicates that the name of the function is to be placed in the NAMESPACE file. Once the function file is ready, we invoke the devtools::document() command to update the contents of man and NAMESPACE.


Use of external packages

Our package, of course, can use other additional libraries whose installation is necessary for our package to work properly. Let’s consider the following example.

#' Read many data frames with the same columns from many text files.
#'
#' @param prefix A string with the first part of the filename.
#' @param range Numeric vector with the numbers which differ from file to file.
#' @param prefix A string with the last part of the filename. Typically consists of a file extension.
#' @param is.header A local value indicating whether the file contains the names of the variables
#'                  as its first line. The default value is TRUE.
#' @return A single data frame.
#' @examples
#' read.many.tables ("N1000_k8_O0.2_beta0.5_part", 1:10, ".txt")
#' read.many.tables ("precision_N", c(100,200,500,1000,2000,5000), ".txt")
read.many.tables <- function (prefix, range, suffix, is.header=TRUE) {
  dataframes <- lapply (range,
                        function(ii) read.table (paste (prefix,ii,suffix,sep=""), header=is.header))
  data.table::rbindlist(dataframes)
}

In the above example we use the rbindlist() function from the data.table package. It is very important that we do this by specifying the package name and the :: operator, and not using the library function. Calling the library or requires command is absolutely not allowed for functions included in the package. Always use the full function name preceded by the package name and the :: operator. Moreover, the DESCRIPTION file should contain information that the data.table package is required in our package:

Imports: data.table

It is also possible to include a specific version of a package in Imports. Different package names should be comma separated.

Imports: data.table (>= 1.9.4), dplyr

However, if our package uses the data.table package, but only to a small extent and its absence does not affect the main functionality of our package, we can only suggest its installation.

Suggests: data.table, dplyr, tidyr


Attaching data sets to our own package

Data sets are often added to R packages, which can be used to test the package’s functionalities conveniently. To add a set, save it in the .Rdata or .rda format using the save function in the data/ directory in the project folder. It is best if the name of the data file is the same as the variable’s name in R.

earnings <- data.frame (name = c("Kowalski", "Nowak", "Wiśniewska"),
                        salary = c(4100, 5700, 6300))
save (earnings, "path-to-project/data/earnings.RData")

Then, we create a .R file with the data set description written in Roxygen2 and place it together with the remaining .R files in the R folder. The file name should be the same as the file name, e.g., earnings.R. Inside, we place the following elements:

  • The first line contains the title of the data set.
  • The following line begins a short paragraph with a data set description.
  • @docType specifies the document type (enter data).
  • @usage tells you how to run the file (just specify data (set_name)).
  • @format specifies the data format, most often data.frame or matrix, possibly array or list.
  • Using the \describe{} function allows you to add a detailed description of the set, in which we provide a description of each variable (column) separately.
  • The @references tag is essential if the data is not synthetic and its source must be provided.
  • @source - similar to @references.
  • @keywords - keywords, often simply dataset.
  • @examples - examples of using the set by other functions, e.g. plot.
#' Earnings of employees of XYZ company
#'
#' The data set contains the monthly salary of employees of company XYZ.
#'
#' @docType data
#' @usage data (earnings)
#' @format A data frame
#' @keywords datasets earnings
#' @examples
#' data (earnings)
#' barplot (earnings$salary, names.arg = earnings$name)


Step 4: Installing your own package

Installing the package is very simple. If we have a local copy of the source code, just use the install function from the devtools package pointing to the directory containing our package.

devtools::install ("/home/rpaluch/git/testr")
library (testr)
add (2, 5)
## [1] 7
subtract (2, 10)
## [1] -8

However, if our package is in a repository, e.g. GitHub, CRAN, or BioConductor, we can use the appropriate functions install_github, install_cran, or install_biocondutor.

devtools::install_github (rep = "robert-paluch/testr")

If we want to install a package from a private repository, we must provide the appropriate token.

devtools::install_github (rep = "robert-paluch/testr", auth_token = "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")


The Rcpp package

The Rcpp package allows us to write R functions directly in C++ language, which may increase the performance of our code hundreds of times. To use it, we will need the following tools:

  • Windows users: Rtools
  • OS X users: Xcode
  • Linux users: gcc


Introduction

C++ is a compiled (not: interpreted) all-purpose programming language. It is portable, object-oriented, generic, and provides low-level memory manipulation facilities. It is developed by Bjarne Stroustrup since 1979 as an extension of the C programming language. It was initially standardized in 1998 as ISO/IEC 14882:1998. C++ characterizes with high performance, efficiency, and flexibility.

R is implemented in C (and R and Fortran). As a result, everything we can do in R may be implemented in C/C++. For example, the following functions call some compiled code directly:

sum
## function (..., na.rm = FALSE)  .Primitive("sum")
c
## function (...)  .Primitive("c")

The R/C API provides the fastest way (in terms of performance) to communicate with R from the compiled code. However, it is definitely not the most convenient one. All the R objects are handled with the type SEXP, which is a pointer to the SEXPREC structure. For example, this is a C/C++ code equivalent to an R call to c(123.45, 67.89):

SEXP createVectorOfLength2 () { //R/C API
  SEXP result ;
  result = PROTECT (allocVector (REALSXP, 2));
  REAL (result)[0] = 123.45;
  REAL (result)[1] = 67.89;
  UNPROTECT (1);
  return result ;
}

The R/C API may be complicated for many R users to learn and use. Thus, during this lesson, we will discuss the Rcpp package. It simplifies writing compiled code. Just compare the above to:

NumericVector createVectorOfLength2 () { // Rcpp
  return NumericVector :: create (123.45 , 67.89);
}

Rcpp is a set of convenient C++ wrappers for the whole R/C API. We may use it to remove performance bottlenecks in our R code, implement code that is difficult to vectorize, or whenever we need advanced algorithms, recursion, or data structures.


Rcpp usage

Let us compare two implementations of the function which computes the n-th Fibonacci number:

\(F_0 = 1\)

\(F_1 = 1\)

\(F_n = F_{n-1} + F_{n-2}, \quad n \geq 2\)

Here is an R solution:

fib1 <- function (n) {
  if (n <= 1) return(1)
  last12 <- c(1, 1)
  for (i in 2:n)
  last12 <- c(last12[1]+last12[2], last12[1])
  last12[1]
}
sapply (0:7, fib1)
## [1]  1  1  2  3  5  8 13 21

And this is how we may implement it in Rcpp:

Rcpp::cppFunction ("
  int fib2(int n) {
    if (n <= 1) return 1;
    int last1 = 1;
    int last2 = 1;
    for (int i=2; i<=n; ++i) {
      int last3 = last2;
      last2 = last1;
      last1 = last2+last3;
    }
  return last1;
  }
")

Note that Rcpp::cppFunction compiles, links, and loads a dynamic library.

print (fib2) # a library has been built
## function (n) 
## .Call(<pointer: 0x7f29d0e59fe0>, n)
sapply (0:7, fib2)
## [1]  1  1  2  3  5  8 13 21

What is important, Rcpp automatically takes care of checking the types of functions’ arguments:

fib2 (1L)
## [1] 1
fib2 (1.5)
## [1] 1
fib2 ("1")
## Error in eval(expr, envir, enclos): Not compatible with requested type: [type=character; target=integer].
fib2 (c(1, 2, 3))
## Error in eval(expr, envir, enclos): Expecting a single value: [extent=3].

Here are some benchmarks:

microbenchmark::microbenchmark (fib1(40), fib2(40), times = 200)
## Unit: nanoseconds
##      expr  min   lq     mean median     uq   max neval
##  fib1(40) 7797 8100 9388.350 8292.5 8684.5 41826   200
##  fib2(40)  441  466  779.315  528.0  589.0 26687   200

Typically, we may get a speed gain of ca. 2–50x+. However, the code writing time increases significantly. Thus, Rcpp makes sense if writing time << execution time or in code re-used by others, e.g. whenever we would like to guarantee the best possible performance.

Apart from an inline usage or Rcpp, we may put our C++ code in separate source files. Here are the contents of the test.cpp file:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
int fib3 (int n) {
    if (n <= 1) return 1;
    int last1 = 1;
    int last2 = 1;
    for ( int i=2; i <= n; ++i) {
        int last3 = last2;
        last2 = last1;
        last1 = last2 + last3;
        }
    return last1;
}

Call the sourceCpp() function to compile the source file.

Rcpp::sourceCpp ("test.cpp")

The fib3() function may now be called in R.

print (fib3) # a library has been built
## function (n) 
## .Call(<pointer: 0x7f29d0e48fe0>, n)
sapply (0:7, fib3)
## [1]  1  1  2  3  5  8 13 21