1 Aim of this exercise

An exploratory data analysis is a crucial step in a data analysis to get insight in the nature and distribution of the data, and to assess the assumptions of the downstream data analysis.

In this exercise you will acquired the skills to conduct a data exploration for a two group comparison in R and to interpret the results.

2 Background

Researchers wanted to study the immune response on pertussis. They have set up an experiments with 40 rats. 16 rats were infected with pertussis and 24 rats received a control treatment. Researchers measured the white blood cell concentration (WBC) in each rat (count per mm\(^3\).

De data consists of two variables:

  • WBC: white blood cell count (counts/mm\(^3\)).

  • trt: treatment

    • control: rat recieved control treatment
    • pertussis: rat was infected with pertussis

Load the libraries

library(tidyverse)

3 Import the dataset

Data path:

https://raw.githubusercontent.com/statOmics/PSLSData/main/wbcon.csv

wbcon <- read_csv("https://raw.githubusercontent.com/statOmics/PSLSData/main/wbcon.csv")
## Rows: 40 Columns: 2
## ── Column specification ──────────────────────────────────────────────
## Delimiter: ","
## chr (1): trt
## dbl (1): WBC
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(wbcon)
## Rows: 40
## Columns: 2
## $ WBC <dbl> 10252, 10467, 10601, 10638, 10901, 11071, 11092, 11371, …
## $ trt <chr> "control", "control", "control", "control", "control", "…

3.1 Aim of the study

The overarching goal of this study was to assess if the white blood cell count changes upon pertussis infection. To this end, researchers randomized 40 rats to two treatments: A control treatment and a treatment in which the rat was infected with pertussis.

We will explore the data to get insight on the impact of the pertussis infection on the white blood cell count.

A secondary goal of the data exploration is to assess assumptions that will be required to use a formal statistical test to assess if the white blood cell count is on average different between infected and control rats (see later exercises).

For this test to be valid, we have to assess following assumptions:

  1. The data in each treatment group are normally distributed.

  2. The data from the two treatment groups has the same variance.

4 Data visualization

A crucial first step in a data analysis is to visualize and to explore the raw data.

4.1 Histogram

First, make a histogram of the data. Fill in the missing parts in the chunk of code below:

wbcon %>%
  ggplot() +
  geom_histogram(aes(x = WBC, fill = trt), color = "black") +
  facet_grid(rows = vars(trt)) +
  theme_bw() +
  xlab("WBC (count/mm3)")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Based on this plot, it seems that the white blood cell counts are higher for rats that were infected with pertussis than for rats that received the control treatment.

4.2 Boxplots

However, given the relative small sample size a better option to visualize these data are boxplots. Histograms get useful in larger datasets, 30 observations per group are a bare minimum.

wbcon %>% ggplot(aes(x = trt, y = WBC, fill = trt)) +
  geom_boxplot(outlier.shape = NA) +
  geom_point(position = "jitter") +
  ylab("WBC (count/mm3)") +
  xlab("treatment") +
  stat_summary(
    fun = mean, geom = "point",
    shape = 5, size = 3, color = "black"
  )

What do you observe?

Both the mean and variance of the data seems to differ between control rats and rats infected with pertussis.

4.3 QQ-plots

To assess the assumption that the data are normally distributed in each treatment group, we will use QQ plots.

wbcon %>%
  ggplot(aes(sample = WBC)) +
  geom_qq() +
  geom_qq_line() +
  facet_grid(cols = vars(trt))

What do you observe?

The white blood cell counts appear to be normally distributed in both treatment groups.

5 Descriptive statistics

Here, we will generate some informative descriptive statistics for the dataset.

We first summarize the data and calculate the mean, standard deviation, number of observations and standard error and store the result in an object wbcSum via ’wbcSum<-`

  1. We pipe the wbcon dataframe to the group_by function to group the data by treatment groep group_by(trt)
  2. We pipe the result to the summarize() function to summarize the “WBC” variable and calculate the mean, standard deviation and the number of observations
  3. We pipe the result to the mutate function to make a new variable in the data frame that is named se for which we calculate the standard error (\(\sigma / n\))
wbcSum <- wbcon %>%
  group_by(trt) %>%
  summarize(
      mean = mean(WBC, na.rm = TRUE),
      sd = sd(WBC, na.rm = TRUE),
      n = n()
  ) %>%
  mutate(se = sd / sqrt(n))
wbcSum

This concludes the data exploration. In the next exercise sessions, we will learn how to formally test if the observed difference in WBC between rats that were infected with pertussis and those receiving the control treatment is statistically significant.

LS0tCnRpdGxlOiAiRXhlcmNpc2UgNC4yOiBFeHBsb3JpbmcgdGhlIHBlcnR1c3NpcyBkYXRhc2V0IgphdXRob3I6ICJMaWV2ZW4gQ2xlbWVudCwgYW5kIE1pbGFuIE1hbGZhaXQiCmRhdGU6ICJzdGF0T21pY3MsIEdoZW50IFVuaXZlcnNpdHkgKGh0dHBzOi8vc3RhdG9taWNzLmdpdGh1Yi5pbykiCi0tLQoKIyBBaW0gb2YgdGhpcyBleGVyY2lzZQoKQW4gZXhwbG9yYXRvcnkgZGF0YSBhbmFseXNpcyBpcyBhIGNydWNpYWwgc3RlcCBpbiBhIGRhdGEgYW5hbHlzaXMgdG8gZ2V0Cmluc2lnaHQgaW4gdGhlIG5hdHVyZSBhbmQgZGlzdHJpYnV0aW9uIG9mIHRoZSBkYXRhLCBhbmQgdG8gYXNzZXNzIHRoZQphc3N1bXB0aW9ucyBvZiB0aGUgZG93bnN0cmVhbSBkYXRhIGFuYWx5c2lzLgoKSW4gdGhpcyBleGVyY2lzZSB5b3Ugd2lsbCBhY3F1aXJlZCB0aGUgc2tpbGxzIHRvIGNvbmR1Y3QgYSBkYXRhIGV4cGxvcmF0aW9uIGZvciBhIHR3byBncm91cCBjb21wYXJpc29uIGluIFIgYW5kIHRvIGludGVycHJldCB0aGUgcmVzdWx0cy4KCiMgQmFja2dyb3VuZAoKUmVzZWFyY2hlcnMgd2FudGVkIHRvIHN0dWR5IHRoZSBpbW11bmUgcmVzcG9uc2Ugb24gcGVydHVzc2lzLgpUaGV5IGhhdmUgc2V0IHVwIGFuIGV4cGVyaW1lbnRzIHdpdGggNDAgcmF0cy4KMTYgcmF0cyB3ZXJlIGluZmVjdGVkIHdpdGggcGVydHVzc2lzIGFuZCAyNCByYXRzIHJlY2VpdmVkIGEgY29udHJvbCB0cmVhdG1lbnQuClJlc2VhcmNoZXJzIG1lYXN1cmVkIHRoZSB3aGl0ZSBibG9vZCBjZWxsIGNvbmNlbnRyYXRpb24gKFdCQykgaW4gZWFjaCByYXQgKGNvdW50IHBlciBtbSReMyQuCgpEZSBkYXRhIGNvbnNpc3RzIG9mIHR3byB2YXJpYWJsZXM6CgotIFdCQzogd2hpdGUgYmxvb2QgY2VsbCBjb3VudCAoY291bnRzL21tJF4zJCkuCi0gdHJ0OiB0cmVhdG1lbnQKCiAgICAtIGNvbnRyb2w6IHJhdCByZWNpZXZlZCBjb250cm9sIHRyZWF0bWVudAogICAgLSBwZXJ0dXNzaXM6IHJhdCB3YXMgaW5mZWN0ZWQgd2l0aCBwZXJ0dXNzaXMKCkxvYWQgdGhlIGxpYnJhcmllcwoKYGBge3IsIG1lc3NhZ2U9RkFMU0UsIHdhcm5pbmc9RkFMU0V9CmxpYnJhcnkodGlkeXZlcnNlKQpgYGAKCiMgSW1wb3J0IHRoZSBkYXRhc2V0CgpEYXRhIHBhdGg6CgogIGBodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vc3RhdE9taWNzL1BTTFNEYXRhL21haW4vd2Jjb24uY3N2YAoKCmBgYHtyfQp3YmNvbiA8LSByZWFkX2NzdigiaHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL3N0YXRPbWljcy9QU0xTRGF0YS9tYWluL3diY29uLmNzdiIpCmBgYAoKYGBge3J9CmdsaW1wc2Uod2Jjb24pCmBgYAoKIyMgQWltIG9mIHRoZSBzdHVkeQoKVGhlIG92ZXJhcmNoaW5nIGdvYWwgb2YgdGhpcyBzdHVkeSB3YXMgdG8gYXNzZXNzIGlmIHRoZSB3aGl0ZSBibG9vZCBjZWxsIGNvdW50IGNoYW5nZXMgdXBvbiBwZXJ0dXNzaXMgaW5mZWN0aW9uLiBUbyB0aGlzIGVuZCwgcmVzZWFyY2hlcnMgcmFuZG9taXplZCA0MCByYXRzCnRvIHR3byB0cmVhdG1lbnRzOiBBIGNvbnRyb2wgdHJlYXRtZW50IGFuZCBhIHRyZWF0bWVudCBpbiB3aGljaCB0aGUgcmF0IHdhcyBpbmZlY3RlZCB3aXRoIHBlcnR1c3Npcy4KCldlIHdpbGwgZXhwbG9yZSB0aGUgZGF0YSB0byBnZXQgaW5zaWdodCBvbiB0aGUgaW1wYWN0IG9mIHRoZSBwZXJ0dXNzaXMgaW5mZWN0aW9uIG9uIHRoZSB3aGl0ZSBibG9vZCBjZWxsIGNvdW50LgoKQSBzZWNvbmRhcnkgZ29hbCBvZiB0aGUgZGF0YSBleHBsb3JhdGlvbiBpcyB0byBhc3Nlc3MgYXNzdW1wdGlvbnMgdGhhdCB3aWxsIGJlIHJlcXVpcmVkIHRvIHVzZSBhIGZvcm1hbCBzdGF0aXN0aWNhbCB0ZXN0IHRvIGFzc2VzcyBpZiB0aGUgd2hpdGUgYmxvb2QgY2VsbCBjb3VudCBpcyBvbiBhdmVyYWdlIGRpZmZlcmVudCBiZXR3ZWVuIGluZmVjdGVkIGFuZCBjb250cm9sIHJhdHMgKHNlZSBsYXRlciBleGVyY2lzZXMpLgoKRm9yIHRoaXMgdGVzdCB0byBiZSB2YWxpZCwgd2UgaGF2ZSB0byBhc3Nlc3MgZm9sbG93aW5nIGFzc3VtcHRpb25zOgoKMS4gVGhlIGRhdGEgaW4gZWFjaCB0cmVhdG1lbnQgZ3JvdXAgYXJlIG5vcm1hbGx5IGRpc3RyaWJ1dGVkLgoKMi4gVGhlIGRhdGEgZnJvbSB0aGUgdHdvIHRyZWF0bWVudCBncm91cHMgaGFzIHRoZSBzYW1lIHZhcmlhbmNlLgoKCiMgRGF0YSB2aXN1YWxpemF0aW9uCgpBIGNydWNpYWwgZmlyc3Qgc3RlcCBpbiBhIGRhdGEgYW5hbHlzaXMgaXMgdG8gdmlzdWFsaXplIGFuZCB0byBleHBsb3JlIHRoZSByYXcKZGF0YS4KCiMjIEhpc3RvZ3JhbQoKRmlyc3QsIG1ha2UgYSBoaXN0b2dyYW0gb2YgdGhlIGRhdGEuIEZpbGwgaW4gdGhlCm1pc3NpbmcgcGFydHMgaW4gdGhlIGNodW5rIG9mIGNvZGUgYmVsb3c6CgpgYGB7cn0Kd2Jjb24gJT4lCiAgZ2dwbG90KCkgKwogIGdlb21faGlzdG9ncmFtKGFlcyh4ID0gV0JDLCBmaWxsID0gdHJ0KSwgY29sb3IgPSAiYmxhY2siKSArCiAgZmFjZXRfZ3JpZChyb3dzID0gdmFycyh0cnQpKSArCiAgdGhlbWVfYncoKSArCiAgeGxhYigiV0JDIChjb3VudC9tbTMpIikKYGBgCgpCYXNlZCBvbiB0aGlzIHBsb3QsIGl0IHNlZW1zIHRoYXQgdGhlIHdoaXRlIGJsb29kIGNlbGwgY291bnRzCmFyZSBoaWdoZXIgZm9yIHJhdHMgdGhhdCB3ZXJlIGluZmVjdGVkIHdpdGggcGVydHVzc2lzIHRoYW4gZm9yIHJhdHMgdGhhdCByZWNlaXZlZCB0aGUgY29udHJvbCB0cmVhdG1lbnQuCgojIyBCb3hwbG90cwoKSG93ZXZlciwgZ2l2ZW4gdGhlIHJlbGF0aXZlIHNtYWxsIHNhbXBsZSBzaXplIGEgYmV0dGVyIG9wdGlvbiB0byB2aXN1YWxpemUgdGhlc2UgZGF0YSBhcmUgYGJveHBsb3RzYC4KSGlzdG9ncmFtcyBnZXQgdXNlZnVsIGluIGxhcmdlciBkYXRhc2V0cywgIDMwIG9ic2VydmF0aW9ucyBwZXIgZ3JvdXAgYXJlIGEgYmFyZSBtaW5pbXVtLgoKYGBge3J9CndiY29uICU+JSBnZ3Bsb3QoYWVzKHggPSB0cnQsIHkgPSBXQkMsIGZpbGwgPSB0cnQpKSArCiAgZ2VvbV9ib3hwbG90KG91dGxpZXIuc2hhcGUgPSBOQSkgKwogIGdlb21fcG9pbnQocG9zaXRpb24gPSAiaml0dGVyIikgKwogIHlsYWIoIldCQyAoY291bnQvbW0zKSIpICsKICB4bGFiKCJ0cmVhdG1lbnQiKSArCiAgc3RhdF9zdW1tYXJ5KAogICAgZnVuID0gbWVhbiwgZ2VvbSA9ICJwb2ludCIsCiAgICBzaGFwZSA9IDUsIHNpemUgPSAzLCBjb2xvciA9ICJibGFjayIKICApCmBgYAoKV2hhdCBkbyB5b3Ugb2JzZXJ2ZT8KCkJvdGggdGhlIG1lYW4gYW5kIHZhcmlhbmNlIG9mIHRoZSBkYXRhIHNlZW1zIHRvIGRpZmZlciBiZXR3ZWVuIGNvbnRyb2wgcmF0cyBhbmQgcmF0cyBpbmZlY3RlZCB3aXRoIHBlcnR1c3Npcy4KCiMjIFFRLXBsb3RzCgpUbyBhc3Nlc3MgdGhlIGFzc3VtcHRpb24gdGhhdCB0aGUgZGF0YSBhcmUgbm9ybWFsbHkgZGlzdHJpYnV0ZWQgaW4gZWFjaCB0cmVhdG1lbnQgZ3JvdXAsIHdlIHdpbGwgdXNlIFFRIHBsb3RzLgoKYGBge3J9CndiY29uICU+JQogIGdncGxvdChhZXMoc2FtcGxlID0gV0JDKSkgKwogIGdlb21fcXEoKSArCiAgZ2VvbV9xcV9saW5lKCkgKwogIGZhY2V0X2dyaWQoY29scyA9IHZhcnModHJ0KSkKYGBgCgpXaGF0IGRvIHlvdSBvYnNlcnZlPwoKVGhlIHdoaXRlIGJsb29kIGNlbGwgY291bnRzIGFwcGVhciB0byBiZSBub3JtYWxseSBkaXN0cmlidXRlZCBpbiBib3RoIHRyZWF0bWVudCBncm91cHMuCgojIERlc2NyaXB0aXZlIHN0YXRpc3RpY3MKCkhlcmUsIHdlIHdpbGwgZ2VuZXJhdGUgc29tZSBpbmZvcm1hdGl2ZSBkZXNjcmlwdGl2ZSBzdGF0aXN0aWNzCmZvciB0aGUgZGF0YXNldC4KCldlIGZpcnN0IHN1bW1hcml6ZSB0aGUgZGF0YSBhbmQgY2FsY3VsYXRlIHRoZSBtZWFuLCBzdGFuZGFyZApkZXZpYXRpb24sIG51bWJlciBvZiBvYnNlcnZhdGlvbnMgYW5kIHN0YW5kYXJkIGVycm9yIGFuZCBzdG9yZSB0aGUKcmVzdWx0IGluIGFuIG9iamVjdCB3YmNTdW0gdmlhICd3YmNTdW08LWAKCjEuIFdlIHBpcGUgdGhlIGB3YmNvbmAgZGF0YWZyYW1lIHRvIHRoZSBncm91cF9ieSBmdW5jdGlvbiB0byBncm91cAp0aGUgZGF0YSBieSB0cmVhdG1lbnQgZ3JvZXAgYGdyb3VwX2J5KHRydClgCjIuIFdlIHBpcGUgdGhlIHJlc3VsdCB0byB0aGUgYHN1bW1hcml6ZSgpYCBmdW5jdGlvbiB0byBzdW1tYXJpemUKdGhlICJXQkMiIHZhcmlhYmxlIGFuZCBjYWxjdWxhdGUgdGhlIG1lYW4sIHN0YW5kYXJkIGRldmlhdGlvbiBhbmQKdGhlIG51bWJlciBvZiBvYnNlcnZhdGlvbnMKMy4gV2UgcGlwZSB0aGUgcmVzdWx0IHRvIHRoZSBgbXV0YXRlYCBmdW5jdGlvbiB0byBtYWtlIGEgbmV3CnZhcmlhYmxlIGluIHRoZSBkYXRhIGZyYW1lIHRoYXQgaXMgbmFtZWQgYHNlYCBmb3Igd2hpY2ggd2UgY2FsY3VsYXRlIHRoZQpzdGFuZGFyZCBlcnJvciAoJFxzaWdtYSAvIG4kKQoKYGBge3J9CndiY1N1bSA8LSB3YmNvbiAlPiUKICBncm91cF9ieSh0cnQpICU+JQogIHN1bW1hcml6ZSgKICAgICAgbWVhbiA9IG1lYW4oV0JDLCBuYS5ybSA9IFRSVUUpLAogICAgICBzZCA9IHNkKFdCQywgbmEucm0gPSBUUlVFKSwKICAgICAgbiA9IG4oKQogICkgJT4lCiAgbXV0YXRlKHNlID0gc2QgLyBzcXJ0KG4pKQp3YmNTdW0KYGBgCgoKVGhpcyBjb25jbHVkZXMgdGhlIGRhdGEgZXhwbG9yYXRpb24uIEluIHRoZSBuZXh0IGV4ZXJjaXNlIHNlc3Npb25zLCB3ZSB3aWxsIGxlYXJuIGhvdyB0byBmb3JtYWxseQp0ZXN0IGlmIHRoZSBvYnNlcnZlZCBkaWZmZXJlbmNlIGluIFdCQyBiZXR3ZWVuIHJhdHMgdGhhdCB3ZXJlIGluZmVjdGVkIHdpdGggcGVydHVzc2lzIGFuZCB0aG9zZSByZWNlaXZpbmcgdGhlIGNvbnRyb2wgdHJlYXRtZW50IGlzICoqc3RhdGlzdGljYWxseSBzaWduaWZpY2FudCoqLgo=