Aims of this exercise
In this exercise, you will learn how data exploration and plots can help you to discover confounding in a real datasets.
You will also improve your data wrangling skills by importing, tidying, wrangling and visualizing data yourself.
The FEV dataset
The forced expiratory volume
(FEV)
is a measure of how much air a person can exhale (in liters)
during a forced breath. In this dataset, the FEV of 606 children,
between the ages of 6 and 17, were measured. The dataset
also provides additional information on these children:
their age
, their height
, their gender
and, most
importantly, whether the child is a smoker or a non-smoker.
The goal of this experiment was to find out whether or not
smoking has an effect on the FEV of children.
Note: to analyse this dataset properly, we will need some
relatively advanced modeling techniques. At the end of this
week, you will have seen all three required steps to analyse
such a dataset! For now, we will limit ourselves to exploring
the data.
Load libraries
If you do not have these libraries installed, make sure to install them first
by using the install.packages()
function with missing the package name inside
the parentheses (and using quotation marks, like install.packages("car")
)
library(readr)
library(dplyr)
library(tidyverse)
library(ggplot2)
library(car)
Import the data
Note: fev.txt
is a tab-separated file: make sure to select the correct readr
function!
fev <- read_tsv("https://raw.githubusercontent.com/statOmics/PSLSData/main/fev.txt")
## Rows: 606 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): gender
## dbl (4): age, fev, height, smoking
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Data wrangling
There are a few things in the formatting of the
data that can be improved:
Both gender
and smoking
can be transformed to factors.
The height
variable is written in inches. Assuming that
this audience is mainly Portuguese/Belgian, inches are hard to
interpret. Let’s add a new column, height_cm
, with the values
converted to centimeters.
fev <- fev %>%
mutate(gender = as.factor(gender)) %>%
mutate(smoking = as.factor(smoking)) %>%
mutate(height_cm = height * 2.54)
head(fev)
That’s better!
Data exploration
Now, let’s make a first explorative plot, showing
only the FEV for both smoking categories.
Which type of plot do you suggest?
fev %>%
ggplot(aes(x = smoking, y = fev, fill = smoking)) +
scale_fill_manual(values = c("dimgrey", "firebrick")) +
theme_bw() +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.2, size = 0.1) +
ggtitle("Boxplot of FEV versus smoking") +
ylab("FEV (l)") +
xlab("smoking status")

Did you expect these results?
It appears that children that smoke have a higher
median FEV than children that do not smoke.
fev %>%
ggplot(aes(x = age, y = fev, color = smoking)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The trends in the plot seems to match much more with our expectations.
First, it seems
that we do not have any smoking children of ages 6, 7 or 8.
Second, it no longer seems that smokers have a higher FEV
than non-smokers; on the contrary smoking seems to lead to a lower lung capacity for the elder pupils.
This shows that accounting for important confounders
(in this case age) is crucial! If we simply analysed based on
the smoking status and FEV values only, we would have made wrong conclusions.
Can we provide additional insights in the data, e.g. by accounting for other useful explanatory variables?
fev %>%
ggplot(aes(x = age, y = fev, color = smoking)) +
geom_point() +
geom_smooth() +
facet_grid(cols = vars(gender))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

This plot holds one extra level of information, the gender
of the child. Especially for higher ages, the FEV
is higher for males as compared to females.
The only source of information that is lacking is height
.
We could simply make a
scatterplot displaying the FEV in function of a child’s
height (in cm). Additionally, we could color the dots based
on gender and making two plots according to the smoking status.
fev %>%
ggplot(aes(x = height_cm, y = fev, color = gender)) +
geom_point() +
scale_color_manual(values = c("darkorchid", "olivedrab4")) +
theme_bw() +
ggtitle("Boxplot of FEV versus height") +
ylab("FEV (l)") +
xlab("Height (cm)") +
facet_grid(cols = vars(smoking))

There is a clear relationship between height and FEV.
In addition, we see that for the large height values
(>175cm), we mainly find male subjects.
LS0tCnRpdGxlOiAiRXhlcmNpc2UgNC40OiBFeHBsb3JpbmcgdGhlIEZFViBkYXRhc2V0IC0gc29sdXRpb24iCmF1dGhvcjogIkxpZXZlbiBDbGVtZW50LCBKZXJvZW4gR2lsaXMgYW5kIE1pbGFuIE1hbGZhaXQiCmRhdGU6ICJzdGF0T21pY3MsIEdoZW50IFVuaXZlcnNpdHkgKGh0dHBzOi8vc3RhdG9taWNzLmdpdGh1Yi5pbykiCi0tLQoKIyBBaW1zIG9mIHRoaXMgZXhlcmNpc2UKCkluIHRoaXMgZXhlcmNpc2UsIHlvdSB3aWxsIGxlYXJuIGhvdyBkYXRhIGV4cGxvcmF0aW9uIGFuZCBwbG90cyBjYW4gaGVscCB5b3UgdG8gZGlzY292ZXIgY29uZm91bmRpbmcgaW4gYSByZWFsIGRhdGFzZXRzLgoKWW91IHdpbGwgYWxzbyBpbXByb3ZlIHlvdXIgZGF0YSB3cmFuZ2xpbmcgc2tpbGxzIGJ5IGltcG9ydGluZywgdGlkeWluZywgd3JhbmdsaW5nIGFuZCB2aXN1YWxpemluZyBkYXRhIHlvdXJzZWxmLgoKCiMgVGhlIEZFViBkYXRhc2V0CgpUaGUgYGZvcmNlZCBleHBpcmF0b3J5IHZvbHVtZWAgKEZFVikKaXMgYSBtZWFzdXJlIG9mIGhvdyBtdWNoIGFpciBhIHBlcnNvbiBjYW4gZXhoYWxlIChpbiBsaXRlcnMpCmR1cmluZyAgYSBmb3JjZWQgYnJlYXRoLiBJbiB0aGlzIGRhdGFzZXQsIHRoZSBGRVYgb2YgNjA2IGNoaWxkcmVuLApiZXR3ZWVuIHRoZSBhZ2VzIG9mIDYgYW5kIDE3LCB3ZXJlIG1lYXN1cmVkLiBUaGUgZGF0YXNldAphbHNvIHByb3ZpZGVzIGFkZGl0aW9uYWwgaW5mb3JtYXRpb24gb24gdGhlc2UgY2hpbGRyZW46CnRoZWlyIGBhZ2VgLCB0aGVpciBgaGVpZ2h0YCwgdGhlaXIgYGdlbmRlcmAgYW5kLCBtb3N0CmltcG9ydGFudGx5LCB3aGV0aGVyIHRoZSBjaGlsZCBpcyBhIHNtb2tlciBvciBhIG5vbi1zbW9rZXIuCgpUaGUgZ29hbCBvZiB0aGlzIGV4cGVyaW1lbnQgd2FzIHRvIGZpbmQgb3V0IHdoZXRoZXIgb3Igbm90CnNtb2tpbmcgaGFzIGFuIGVmZmVjdCBvbiB0aGUgRkVWIG9mIGNoaWxkcmVuLgoKTm90ZTogdG8gYW5hbHlzZSB0aGlzIGRhdGFzZXQgcHJvcGVybHksIHdlIHdpbGwgbmVlZCBzb21lCnJlbGF0aXZlbHkgYWR2YW5jZWQgbW9kZWxpbmcgdGVjaG5pcXVlcy4gQXQgdGhlIGVuZCBvZiB0aGlzCndlZWssIHlvdSB3aWxsIGhhdmUgc2VlbiBhbGwgdGhyZWUgcmVxdWlyZWQgc3RlcHMgdG8gYW5hbHlzZQpzdWNoIGEgZGF0YXNldCEgRm9yIG5vdywgd2Ugd2lsbCBsaW1pdCBvdXJzZWx2ZXMgdG8gZXhwbG9yaW5nCnRoZSBkYXRhLgoKIyBMb2FkIGxpYnJhcmllcwoKSWYgeW91IGRvIG5vdCBoYXZlIHRoZXNlIGxpYnJhcmllcyBpbnN0YWxsZWQsIG1ha2Ugc3VyZSB0byBpbnN0YWxsIHRoZW0gZmlyc3QKYnkgdXNpbmcgdGhlIGBpbnN0YWxsLnBhY2thZ2VzKClgIGZ1bmN0aW9uIHdpdGggbWlzc2luZyB0aGUgcGFja2FnZSBuYW1lIGluc2lkZQp0aGUgcGFyZW50aGVzZXMgKGFuZCB1c2luZyBxdW90YXRpb24gbWFya3MsIGxpa2UgYGluc3RhbGwucGFja2FnZXMoImNhciIpYCkKCmBgYHtyLCBtZXNzYWdlID0gRkFMU0UsIHdhcm5pbmc9RkFMU0V9CmxpYnJhcnkocmVhZHIpCmxpYnJhcnkoZHBseXIpCmxpYnJhcnkodGlkeXZlcnNlKQpsaWJyYXJ5KGdncGxvdDIpCmxpYnJhcnkoY2FyKQpgYGAKCiMgSW1wb3J0IHRoZSBkYXRhCgpOb3RlOiBgZmV2LnR4dGAgaXMgYSB0YWItc2VwYXJhdGVkIGZpbGU6IG1ha2Ugc3VyZSB0byBzZWxlY3QgdGhlIGNvcnJlY3QgYHJlYWRyYApmdW5jdGlvbiEKCmBgYHtyfQpmZXYgPC0gcmVhZF90c3YoImh0dHBzOi8vcmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbS9zdGF0T21pY3MvUFNMU0RhdGEvbWFpbi9mZXYudHh0IikKaGVhZChmZXYpCmBgYAoKIyBEYXRhIHdyYW5nbGluZwoKVGhlcmUgYXJlIGEgZmV3IHRoaW5ncyBpbiB0aGUgZm9ybWF0dGluZyBvZiB0aGUKZGF0YSB0aGF0IGNhbiBiZSBpbXByb3ZlZDoKCjEuIEJvdGggYGdlbmRlcmAgYW5kIGBzbW9raW5nYCBjYW4gYmUgdHJhbnNmb3JtZWQgdG8gZmFjdG9ycy4KCjIuIFRoZSBgaGVpZ2h0YCB2YXJpYWJsZSBpcyB3cml0dGVuIGluIGluY2hlcy4gQXNzdW1pbmcgdGhhdAp0aGlzIGF1ZGllbmNlIGlzIG1haW5seSBQb3J0dWd1ZXNlL0JlbGdpYW4sIGluY2hlcyBhcmUgaGFyZCB0bwppbnRlcnByZXQuIExldCdzIGFkZCBhIG5ldyBjb2x1bW4sIGBoZWlnaHRfY21gLCB3aXRoIHRoZSB2YWx1ZXMKY29udmVydGVkIHRvIGNlbnRpbWV0ZXJzLgoKYGBge3J9CmZldiA8LSBmZXYgJT4lCiAgbXV0YXRlKGdlbmRlciA9IGFzLmZhY3RvcihnZW5kZXIpKSAlPiUKICBtdXRhdGUoc21va2luZyA9IGFzLmZhY3RvcihzbW9raW5nKSkgJT4lCiAgbXV0YXRlKGhlaWdodF9jbSA9IGhlaWdodCAqIDIuNTQpCgpoZWFkKGZldikKYGBgCgpUaGF0J3MgYmV0dGVyIQoKIyBEYXRhIGV4cGxvcmF0aW9uCgpOb3csIGxldCdzIG1ha2UgYSBmaXJzdCBleHBsb3JhdGl2ZSBwbG90LCBzaG93aW5nCm9ubHkgdGhlIEZFViBmb3IgYm90aCBzbW9raW5nIGNhdGVnb3JpZXMuCgpXaGljaCB0eXBlIG9mIHBsb3QgZG8geW91IHN1Z2dlc3Q/CgpgYGB7cn0KZmV2ICU+JQogIGdncGxvdChhZXMoeCA9IHNtb2tpbmcsIHkgPSBmZXYsIGZpbGwgPSBzbW9raW5nKSkgKwogIHNjYWxlX2ZpbGxfbWFudWFsKHZhbHVlcyA9IGMoImRpbWdyZXkiLCAiZmlyZWJyaWNrIikpICsKICB0aGVtZV9idygpICsKICBnZW9tX2JveHBsb3Qob3V0bGllci5zaGFwZSA9IE5BKSArCiAgZ2VvbV9qaXR0ZXIod2lkdGggPSAwLjIsIHNpemUgPSAwLjEpICsKICBnZ3RpdGxlKCJCb3hwbG90IG9mIEZFViB2ZXJzdXMgc21va2luZyIpICsKICB5bGFiKCJGRVYgKGwpIikgKwogIHhsYWIoInNtb2tpbmcgc3RhdHVzIikKYGBgCgpEaWQgeW91IGV4cGVjdCB0aGVzZSByZXN1bHRzPwoKSXQgYXBwZWFycyB0aGF0IGNoaWxkcmVuIHRoYXQgc21va2UgaGF2ZSBhIGhpZ2hlcgptZWRpYW4gRkVWIHRoYW4gY2hpbGRyZW4gdGhhdCBkbyBub3Qgc21va2UuCgpgYGB7cn0KZmV2ICU+JQogIGdncGxvdChhZXMoeCA9IGFnZSwgeSA9IGZldiwgY29sb3IgPSBzbW9raW5nKSkgKwogICAgZ2VvbV9wb2ludCgpICsKICAgIGdlb21fc21vb3RoKCkKYGBgCgpUaGUgdHJlbmRzIGluIHRoZSBwbG90IHNlZW1zIHRvIG1hdGNoIG11Y2ggbW9yZSB3aXRoIG91ciBleHBlY3RhdGlvbnMuCgpGaXJzdCwgaXQgc2VlbXMKdGhhdCB3ZSBkbyBub3QgaGF2ZSBhbnkgc21va2luZyBjaGlsZHJlbiBvZiBhZ2VzIDYsIDcgb3IgOC4KU2Vjb25kLCBpdCBubyBsb25nZXIgc2VlbXMgdGhhdCBzbW9rZXJzIGhhdmUgYSBoaWdoZXIgRkVWCnRoYW4gbm9uLXNtb2tlcnM7IG9uIHRoZSBjb250cmFyeSBzbW9raW5nIHNlZW1zIHRvIGxlYWQgdG8gYSBsb3dlciBsdW5nIGNhcGFjaXR5IGZvciB0aGUgZWxkZXIgcHVwaWxzLgoKVGhpcyBzaG93cyB0aGF0IGFjY291bnRpbmcgZm9yIGltcG9ydGFudCBjb25mb3VuZGVycwooaW4gdGhpcyBjYXNlIGFnZSkgaXMgY3J1Y2lhbCEgSWYgd2Ugc2ltcGx5IGFuYWx5c2VkIGJhc2VkIG9uCnRoZSBzbW9raW5nIHN0YXR1cyBhbmQgRkVWIHZhbHVlcyBvbmx5LCB3ZSB3b3VsZCBoYXZlIG1hZGUgd3JvbmcgY29uY2x1c2lvbnMuCgpDYW4gd2UgcHJvdmlkZSBhZGRpdGlvbmFsIGluc2lnaHRzIGluIHRoZSBkYXRhLCBlLmcuIGJ5IGFjY291bnRpbmcgZm9yIG90aGVyIHVzZWZ1bCBleHBsYW5hdG9yeSB2YXJpYWJsZXM/CgpgYGB7cn0KZmV2ICU+JQogIGdncGxvdChhZXMoeCA9IGFnZSwgeSA9IGZldiwgY29sb3IgPSBzbW9raW5nKSkgKwogICAgZ2VvbV9wb2ludCgpICsKICAgIGdlb21fc21vb3RoKCkgKwogICAgZmFjZXRfZ3JpZChjb2xzID0gdmFycyhnZW5kZXIpKQpgYGAKClRoaXMgcGxvdCBob2xkcyBvbmUgZXh0cmEgbGV2ZWwgb2YgaW5mb3JtYXRpb24sIHRoZSBnZW5kZXIKb2YgdGhlIGNoaWxkLiBFc3BlY2lhbGx5IGZvciBoaWdoZXIgYWdlcywgdGhlIEZFVgppcyBoaWdoZXIgZm9yIG1hbGVzIGFzIGNvbXBhcmVkIHRvIGZlbWFsZXMuCgpUaGUgb25seSBzb3VyY2Ugb2YgaW5mb3JtYXRpb24gdGhhdCBpcyBsYWNraW5nIGlzIGBoZWlnaHRgLgpXZSBjb3VsZCBzaW1wbHkgbWFrZSBhCnNjYXR0ZXJwbG90IGRpc3BsYXlpbmcgdGhlIEZFViBpbiBmdW5jdGlvbiBvZiBhIGNoaWxkJ3MKaGVpZ2h0IChpbiBjbSkuIEFkZGl0aW9uYWxseSwgd2UgY291bGQgY29sb3IgdGhlIGRvdHMgYmFzZWQKb24gZ2VuZGVyIGFuZCBtYWtpbmcgdHdvIHBsb3RzIGFjY29yZGluZyB0byB0aGUgc21va2luZyBzdGF0dXMuCgpgYGB7cn0KZmV2ICU+JQogIGdncGxvdChhZXMoeCA9IGhlaWdodF9jbSwgeSA9IGZldiwgY29sb3IgPSBnZW5kZXIpKSArCiAgZ2VvbV9wb2ludCgpICsKICBzY2FsZV9jb2xvcl9tYW51YWwodmFsdWVzID0gYygiZGFya29yY2hpZCIsICJvbGl2ZWRyYWI0IikpICsKICB0aGVtZV9idygpICsKICBnZ3RpdGxlKCJCb3hwbG90IG9mIEZFViB2ZXJzdXMgaGVpZ2h0IikgKwogIHlsYWIoIkZFViAobCkiKSArCiAgeGxhYigiSGVpZ2h0IChjbSkiKSArCiAgZmFjZXRfZ3JpZChjb2xzID0gdmFycyhzbW9raW5nKSkKYGBgCgpUaGVyZSBpcyBhIGNsZWFyIHJlbGF0aW9uc2hpcCBiZXR3ZWVuIGhlaWdodCBhbmQgRkVWLgpJbiBhZGRpdGlvbiwgd2Ugc2VlIHRoYXQgZm9yIHRoZSBsYXJnZSBoZWlnaHQgdmFsdWVzCig+MTc1Y20pLCB3ZSBtYWlubHkgZmluZCBtYWxlIHN1YmplY3RzLgo=