.Rmd
documents
With RMarkdown it is easy to create documents or webpages that include text, code and plots. During the practical sessions we will work with RMarkdown, and not with R
scripts. RMarkdown allows you to combine the code required to produce an analysis, its results, visualizations and descriptive text all in one document.
For a detailed guide on RMarkdown see R Markdown: The Definitive Guide. Most of the contents discussed here come from that source.
Compiling an R Markdown file
Essentially, an .Rmd file is just a plain text file used to store all your text and code needed to produce your final report. Although RStudio allows you to preview the output of your code, the output is not itself part of the .Rmd file. When you save and close your Rmd file and open it again later or send it to someone, the output will have disappeared (although also here, RStudio has some tricks to keep the output). The way to properly generate the output version of your Rmd file is by compiling it. This can be done in RStudio by pressing the Knit
button in the toolbar, or by pressing Shift + Ctrl + K
(Mac: Shift + Cmd + K
). This will produce an HTML file with the same name as your Rmd and located in the same folder. You can open this file in any internet browser (you don’t need an internet connection since it’s a local file) and marvel at the nicely formatted output of your hard work. Another advantage of working with RStudio is that it will open this file automatically, either in a new window or in the “Viewer” pane (you can change this in the RStudio settings).
Important: saving your Rmd file does not update the corresponding HTML file automatically. In contrast, when you compile (“Knit”), your Rmd file will be saved first.
While editing you generally want to save often (you don’t want to lose stuff if something goes wrong) but you don’t need to compile every time you make a minor change, unless you want to see what the output will look like.
Formatting text
Bold and italic text
Text in an R Markdown document (i.e. everything that is not considered code) is written in Markdown syntax. This enables you to format text by surrounding it with special characters.
Italic text is generated by surrounding the text with a pair of either underscores (_text_
) or asterisks (*text*
). Bold text uses a pair of double underscores (__text__
) or asterisks (**text**
). Bold italic text can be achieved by using a combination of the two (__*text*__
).
Lists
Unordered list items start with *
, -
, or +
, and you can nest one list within another list by indenting the sub-list, for example:
- one item
- one item
- one item
* one more item
* one more item
* one more item
The output of the above syntax would be:
- one item
- one item
- one item
- one more item
- one more item
- one more item
A numbered list can be created by starting each item with a number:
1. the first item
2. the second item
3. the third item
Output:
- the first item
- the second item
- the third item
Note: make sure to leave an empty line between text and a list for correct formatting.
Main section title: first level
Subsection: second level
Sub-subsection: third level
Sub-sub-subsection: fourth level
In general it’s not recommended to go further than 4 sub-levels of sections.
Note that you can also include a table of contents at the top of your file by specifying toc: true
in the YAML header.
Integrating text and R code
You can insert chunks of R code in your RMarkdown file by wrapping a code block inside ```{r}
and ```
.
In RStudio this is easily done by using the Insert
button in the toolbar or the keyboard shorcut Ctrl + Alt + I
(Cmd + Option + I
on macOS).
For more information, see section 2.6 of The Definitive Guide
We will demonstrete the use of R-code with various examples in the sections below so that you can familiarise yourself with the syntax.
Basic R stuff
Simple calculations
R can be used as a simple calculator. Executing code can be done in various ways. To execute the entire chunk of code:
- In RStudio: click the green arrow in the upper right corner of the code chunk
- Place your cursor inside the chunk and press
Shift + Ctrl + Enter
(Mac: Shift + Cmd + Enter
)
Code can also be run line by line by selecting a line with your cursor and pressing Ctrl + Enter
(Mac: Cmd + Enter
).
Try executing the code below using both ways: line-by-line and the entire chunk.
## [1] 2
## [1] 37
The output should be displayed either right below the code chunk or in the R console. You can change this behaviour by selecting the settings gear in the RStudio toolbar and selecting Chunk Output Inline
or Chunk Output in Console
.
You can also run all chunks in you Rmd file by pressing the Run
button in the toolbar (upper right) and then select Run All
(or using the keyboard shortcut Shift + Ctrl + R
on Windows or Shift + Cmd + R
on Mac). Note that this menu also provides additional ways of running R chunks, the most useful ones being:
Run All Chunks Above
: this executes all code chunks preceding your cursor location, starting at the top and tehn running down.
Restart R and Run All Chunks
: this will restart R and re-run your entire document. Note that this will remove all current objects and data from memory and re-starting R in a blank state. This means that any code that was run in the console and not saved in a code chunk will be lost. This might seem scary at first, but there are very good reasons to do this often. You can read more about this workflow here.
Functions
Functions are commands in R that perform certain tasks. They take inputs in the form of arguments and return their results as output. They are called using their name, followed by parentheses ()
in which the arguments are specified. There are many built-in functions in R. In addition, you can write your own functions or load other functins through the use of packages (more on that later).
As an example, we can use the function rnorm
to sample 100 draws from a Normal distribution. We assign the result (a vector of 100 random numbers) to a new variable draws
and display the first 6 values by caling the function head
on this new variable.
# sample 100 numbers from a standard normal distribution
# (mean = 0, standard deviation = 1)
draws <- rnorm(100)
head(draws)
## [1] 0.2821262847 -0.1542399503 -1.1655695370 1.2951836232 0.2953700467
## [6] -0.0005475743
Getting help in R: ?
What if instead of drawing from the standard normal distribution we want to sample from a normal with a mean of 5 and a standard deviation of 2? Knowing that rnorm
is the function to generate normally distributed numbers, we can get more information on it by executing ?rnorm
(usually you would do this in the Console, but for demonstration, we run it inside a code chunk here). This will open up the help page for the “The Normal Distribution” and all its related functions inside R, one of which is rnorm
.
Read through the documentation, can you figure out how to change the mean and standard deviation?
The solution is to specify the mean
and sd
arguments.
draws2 <- rnorm(100, mean = 5, sd = 2)
head(draws2)
## [1] 2.244972 6.682070 1.659189 2.248990 4.843019 6.949700
Each built-in R function or functions loaded through a package has a help page. In addition, there is a wealth of information through the wonderful magic of Google. Another great place to look for answers or ask questions yourself is StackOverflow.
Tidyverse
The tidyverse is a set of packages designed to make data science in R more user-friendly. It shares a common philosophy and grammar of doing data science, which can change somewhat from how base R works (though it shouldn’t be an excuse to learn how to use base R!!!).
Instead of installing each tidyverse package individually, you can install all of them simultaneously by simply calling
install.packages("tidyverse")
Note that this code won’t be executed when compiling this report, because I specified eval=FALSE
in the chunk options. Instead it should be called manually, for example in an R console.
For more information see
Visualizations
Base R
A picture often says more than words (or lines of code output) and R has a rich visualization toolbox that allows us to make powerful visualizations of our data.
For example, to visually verify that the random numbers we generated earlier (which are still stored in memory under the variables draws
and draws2
) are indeed normally distributed, we could make a histogram of them, using the hist
function.
# draws comes from a standard normal (mean = 0, sd = 1)
hist(draws)
# draws 2 comes from a normal with mean = 5, sd = 2
hist(draws2)
Note that you can customize these plots in a lot of ways. Looking at the help page of hist
or doing a quick internet search will take you a long way!
ggplot2
Although powerful, the base R visualization framework can be somewhat challenging to work with. A good alternative is the ggplot2
package, part of the larger tidyverse
which includes more useful packages for data manipulation and analysis. ggplot2
uses a visualization framework based on the grammar of graphics philosophy, which we won’t get further into here but which can be quite an intuitive way to think about data visualizations.
ggplot2
works best with a data.frame
as input. As an example, we’ll use the mtcars
data, which is available by default in every R installation (for more info on this data set, see ?mtcars
). With just a few lines, we can create already quite elegant visualizations:
library(ggplot2)
ggplot(mpg, aes(displ, hwy, colour = class)) +
geom_point()
You could perfectly recreate this plot with base R, but it’s going to take more lines of code and potential headaches. Still, both frameworks have their strengths and weaknesses and they are complementary in many ways, so it pays to learn both of them rather than stubbornly sticking to one.
There are tons of documentation and tutorials on ggplot2
to be found online. A good place to start is https://ggplot2.tidyverse.org/.
Importing data
See https://r4ds.had.co.nz/data-import.html.
Note that both the base read.csv
and the tidyverse equivalent readr::read_csv()
can both use URLs to read in data. That way you don’t have to download the data locally on your machine or worry about different versions.
## Example from HDA2020 course (https://statomics.github.io/HDA2020/)
uk_foods <- readr::read_csv(
file = "https://raw.githubusercontent.com/statOmics/HDA2020/data/ukFoods.csv",
col_names = TRUE
)
## New names:
## * `` -> ...1
## Rows: 17 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ...1
## dbl (4): England, Wales, Scotland, N.Ireland
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Summary of data
To get a quick summary of the data, you can use the summary
function. This will return some summary statistics for the columns present in the data.
## iris is a default data set available in R
## To get more info, use `?iris`
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
## [1] "data.frame"
Subsetting data
You can select subsets of a data.frame
by using square brackets []
and specifying the number(s) of the row(s) or column(s) you want to select. Alternatively, you can use the dollar sign $
to select a column using its name.
To reduce the output printed out, we will first make a subset of the iris
data containing just the first 10 rows.
# make subset of data to prevent long outputs
iris_sub <- iris[1:10, ]
iris_sub
# Select first column, single brackets returns data.frame
iris_sub[1]
# double brackets returns vector
iris_sub[[1]]
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9
iris_sub[["Sepal.Length"]]
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9
# using the dollar sign, also returns vector
iris_sub$Sepal.Length
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9
# Select all columns except the first one
iris_sub[-1]
# Selecting rows
iris_sub[1, ] # Select the first row
iris_sub[1:5,] # Select the first five rows
iris_sub[c(2, 4), ] # select the second and fourth rows
# columns and rows
iris_sub[1:5, "Sepal.Length"] # first 5 rows of "Sepal.Length" column
## [1] 5.1 4.9 4.7 4.6 5.0
iris_sub[3, 2] # third row, second column
## [1] 3.2
Including R object values or results in the text
You can add inline R code results by wrapping them inside `r `
. This is useful for discussing the value of a result in your text. Instead of having to copy the value of a result (which is prone to error and not robust to changes), you can just call it inside the text. For example, we could calculate the mean sepal length of the iris flowers as follow:
## [1] 5.843333
We could copy the value inside our text, but a better way is by just running the code inline as mean(iris$Sepal.Length)
surrounded by `r
and `
. So we could say that the iris flowers in our data have a mean length of 5.8433333.
Note: to actually see the value of the inline code, place your cursor inside the backticks ` `r
and press Ctrl + Enter
or Cmd + Enter
. When you knit your Rmd file and build the HTML output, the inline R code will be replaced by its output value.
Including equations
Finally, you can include equations in your text using LaTeX syntax and surrounding it by a pair of double dollar signs $$
. This is useful to specify models. For example, we can write the equation of a linear model as
$$
y_i = \beta_0 + \beta_1 x_i + \epsilon_i
$$
Which will be converted to the following output:
\[
y_i = \beta_0 + \beta_1 x_i + \epsilon_i
\]
Note that RStudio will give you a preview of what your equation will look like in the final document.
You can also include inline \(\LaTeX\) equations by using a pair of single $
signs. For example, the following sentence:
The sample mean of $y$ is given by $\bar{y}=\sum\limits_{i=1}^{n}\frac{y_i}{n}$
Will be converted to:
The sample mean of \(y\) is given by \(\bar{y}=\sum\limits_{i=1}^{n}\frac{y_i}{n}\)
Session Info
Finally, it’s always good practice to include the Session Info for your R session in your document. That way, other persons (including your future self) looking at your document can see what versions of R and loaded packages were used, which can be quite essential for reproducibility.
You can also include the date with Sys.time()
so you have a time stamp of when the report was compiled.
## [1] "2022-01-25 11:16:58 UTC"
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices datasets utils methods base
##
## other attached packages:
## [1] ggplot2_3.3.5
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.1.1 xfun_0.25 bslib_0.3.0
## [4] purrr_0.3.4 colorspace_2.0-2 vctrs_0.3.8
## [7] generics_0.1.0 htmltools_0.5.2 yaml_2.2.1
## [10] utf8_1.2.2 rlang_0.4.11 jquerylib_0.1.4
## [13] pillar_1.6.3 glue_1.4.2 withr_2.4.2
## [16] DBI_1.1.1 bit64_4.0.5 lifecycle_1.0.1
## [19] stringr_1.4.0 munsell_0.5.0 gtable_0.3.0
## [22] evaluate_0.14 labeling_0.4.2 knitr_1.33
## [25] tzdb_0.1.2 fastmap_1.1.0 parallel_4.1.2
## [28] curl_4.3.2 fansi_0.5.0 highr_0.9
## [31] readr_2.0.1 renv_0.15.2 scales_1.1.1
## [34] BiocManager_1.30.16 vroom_1.5.4 jsonlite_1.7.2
## [37] farver_2.1.0 bit_4.0.4 hms_1.1.1
## [40] digest_0.6.28 stringi_1.7.4 dplyr_1.0.7
## [43] grid_4.1.2 cli_3.1.1 tools_4.1.2
## [46] magrittr_2.0.1 sass_0.4.0 tibble_3.1.5
## [49] crayon_1.4.1 pkgconfig_2.0.3 ellipsis_0.3.2
## [52] assertthat_0.2.1 rmarkdown_2.10 R6_2.5.1
## [55] compiler_4.1.2
