In this exercise, we will see some basics in hypothesis testing and more specifically on t-tests. As an example, we will work with the captopril dataset that we already explored in the exercise on data exploration.
The goal is to answer these two research questions;
Is the average systolic baseline blood pressure of the patients is higher than the treshold for hypertension of 140 mmHg?
Is the average SBP before captopril treatment different from the average SBP after captopril treatment?
First, load the required R libraries:
Import the data
(captopril <- read.table("https://raw.githubusercontent.com/statOmics/PSLS21/data/captopril.txt",
header = TRUE,
sep = ","))
Data exploration
Before we start delving into the data in order to solve our research hypothesis, we have to explore our data. Our dataset looks like this;
We have 15 patients, for which the systolic blood pressures and diastolyic blood pressures were measured, before and after treatment with the captopril drug.
Note, that the dataset is not in the tidy format. We could tidy the data using the following code.
captopril %>%
gather(type,bp,-id) %>%
filter(type%in%c("SBPa","SBPb")) %>%
mutate(id = as.factor(id))
# we are only interested in the systolic blood pressure
We can visualize the dataframe using boxplots;
captopril %>%
gather(type,bp,-id) %>%
filter(type%in%c("SBPa","SBPb")) %>%
ggplot(aes(x=type, y=bp, fill=type)) +
scale_fill_brewer(palette="RdGy") +
theme_bw() +
geom_boxplot(outlier.shape=NA) +
geom_jitter(width = 0.2) +
ggtitle("Boxplot of blood pressure measures before and after treatment") +
ylab("blood pressure (mmHg)") +
stat_summary(fun=mean, geom="point", shape=5, size=3, color="black", fill="black")
Clearly, it seems that on average the measurements after treatment are lower than those before treatment.
Note that the boxplot is useful in light of the first question where we have to assess if the average systolic blood pressure of patients is above 140mm Hg, the threshold for hypertension.
However, to assess the question if there is an affect of administering captopril it is essential to account for the fact that the blood pressure before and after administering captopril was measured on the same patients.
captopril %>%
gather(type,bp,-id) %>%
filter(type%in%c("SBPa","SBPb")) %>%
ggplot(aes(x=type, y=bp, group=id)) +
geom_line() +
geom_point() +
ylab("Systolic blood pressure (mmHg)")
The key question is to know is if there is significant effect of the treatment? To answer this question, we will need to perform a hypothesis test.
Let’s start of with question 1.
Question 1
Is the average baseline systolic blood pressure (SBP) higher than the threshold for hypertension of 140 mmHg?
To test the effect of the captopril on subjects with hypertension (patients), we need to select patients with hypertension, hence they should have elevated SBP levels, that are on average higher than 140 mmHg. We can assess this hypothesis using a one sample t-test.
Assess the assumptions
Before we can perform a t-test, we must check that the required assumptions are met!
- The observations are independent
- The data (SBPb) must be normally distributed
The first assumption requires us to think about how the data were collected. Are there any underlying correlation structures (that we know of) in the data? For instance, if all the 15 subjects are members of the same family, we expect that the data will give us a good representation of the underlying population of interest, i.e., all past, present and future patients with elevated SBP levels.
Here, we have no reason to believe that this assumption was violated; we may assume 15 unrelated patients with elevated SBP levels were selected at random from the population.
We can assess the second assumption with a quantile-quantile plot.
captopril %>%
ggplot(aes(sample=SBPa)) +
geom_qq() +
geom_qq_line()
captopril %>%
ggplot(aes(sample=SBPb)) +
geom_qq() +
geom_qq_line()
# or, equivalently
captoprilTidy <- captopril %>% gather(type,bp,-id)
captoprilTidy %>%
filter(type%in%c("SBPa","SBPb")) %>%
ggplot(aes(sample=bp)) +
geom_qq() +
geom_qq_line() +
facet_wrap(~type)
We can see that all of the data lies nicely around the quantile-quantile line (black line). As such, we may conclude that our data is normally distributed.
Hypothesis test
Here, we will test if mean systolic blood pressure at baseline is significantly higher than 140 mmHg. More specifically, we will test the null hypothesis;
\(H_0:\) the mean SBPb is equal to 140 mmHg
versus the alternative hypothesis;
\(H_1:\) the mean SBPb is greater than 140 mmHg
output1 <- t.test(captopril$SBPb,
mu=140,
alternative = "greater",
conf.level = 0.95)
output1
##
## One Sample t-test
##
## data: captopril$SBPb
## t = 6.9556, df = 14, p-value = 3.352e-06
## alternative hypothesis: true mean is greater than 140
## 95 percent confidence interval:
## 167.581 Inf
## sample estimates:
## mean of x
## 176.9333
Conclusion
When writing a conclusion on your research hypothesis, it is very important to be precise, concise, and complete.
An example of such a conclusion for our research question is given below:
We can conclude that the mean systolic baseline blood pressure of patients in the captopril study is extremely significantly higher than the threshold for hypertension of 140 mmHg (p << 0.001). The mean SBPb equals 176.93 mmHg with a 95% confidence interval of [167.58, 167.58, +\(\infty\)]).
As we have seen in the theory class, the 95% confidence interval can be interpreted as;
With 95% confidence we can conclude that the true average of baseline systolic blood pressure of diseased patients in the population is above 167.58.
Note, that we only report a one sided confidence interval because we only test against the alternative hypothesis that the blood pressure is on average larger than the threshold for hypertension of 140 mmHg.
Question 2
Is the average SBP before captopril treatment different from the average SBP after captopril treatment?
As the data are paired, there will be a strong correlation between the BP values before and after treatment of each individual patient. We can show this with a scatterplot.
captopril %>%
ggplot(aes(x=SBPb,y=SBPa)) +
geom_point() +
ggtitle("correlation between SBPb and SBPa") +
ylab("SBPa (mmHg)") +
xlab("SBPb (mmHg)")
We clearly see that if a patient’s SBPb value is high, its SBPa value will be comparatively high as well.
We can immediately calculate the impact of the treatment for every patient by calculating the difference between the blood pressure after and before the treatment.
captopril <-
captopril %>% mutate(deltaSBP = SBPa-SBPb)
captopril%>%
ggplot(aes(x="Systolic blood pressure",y=deltaSBP)) +
geom_boxplot(outlier.shape=NA) +
geom_point(position="jitter")+
ylab("Difference (mm mercury)") +
xlab("")
Check the assumptions
The paired t-test has 2 assumptions:
The blood pressure differences are independent of each other.
The blood pressure differences are normally distributed.
The first assumption is met given the experimental design.
Secondly, we assess if the data are normally distributed.
captopril %>%
ggplot(aes(sample=deltaSBP)) +
geom_qq() +
geom_qq_line()
We can see that all of the data lies nicely around the quantile-quantile line. As such, we may assume that our data are normally distributed. As our assumptions are met we may continue with performing the unpaired t-test.
Hypothesis test
As such, we will perform a paired
t-test.
- The null hypothesis of the test is that the blood pressure before and after the treatment with captopril is on average equal.
- Which will be tested against the alternative that the blood pressure before and after the treatment with captopril is on average different.
output2 <- t.test(captopril$SBPb, captopril$SBPa, paired = TRUE)
output2
##
## Paired t-test
##
## data: captopril$SBPb and captopril$SBPa
## t = 8.1228, df = 14, p-value = 1.146e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 13.93409 23.93258
## sample estimates:
## mean of the differences
## 18.93333
Conclusion
There is on average an extremely significant blood pressure drop upon administering captopril to patients with hypertension (p << 0.001). The systolic blood pressure decreases on average with 18.9 mmHg upon the treatment with captopril (95% CI [13.9, 23.9]).
Alternative solution: One-sample t-test on the difference
Performing a paired t-test is analogous to performing a one-sample t-test on the difference between both groups.
This can be easily seen from the output of the paired two-sample t-test. The alternative hypothesis \(HA\) there states that the “true difference in means is not equal to 0”. So internally, R will actually perform a one-sample t-test on the difference, and check whether or not the true mean difference is equal to 0. We can also set this up manually.
t.test(captopril$deltaSBP, mu=0)
##
## One Sample t-test
##
## data: captopril$deltaSBP
## t = -8.1228, df = 14, p-value = 1.146e-06
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -23.93258 -13.93409
## sample estimates:
## mean of x
## -18.93333
Indeed, the output is equivalent to that of the paired t-test.
---
title: "Exercise 5.1: Hypothesis testing on the captopril dataset - solution"   
author: "Lieven Clement and Jeroen Gilis"
date: "statOmics, Ghent University (https://statomics.github.io)"  
output:
    html_document:
      code_download: true    
      theme: cosmo
      toc: true
      toc_float: true
      highlight: tango
      number_sections: true
---

In this exercise, we will see some basics in hypothesis testing and more 
specifically on t-tests. As an example, we will work with the
captopril dataset that we already explored in the exercise on data exploration.

The goal is to answer these two research questions;

1. Is the average systolic baseline blood pressure of the patients is higher than the treshold for hypertension of 140 mmHg? 

2. Is the average SBP before captopril treatment 
different from the average SBP after
captopril treatment?

First, load the required R libraries:

```{r, message=FALSE, warning=FALSE}
library(tidyverse)
```

# Import the data

```{r}
(captopril <- read.table("https://raw.githubusercontent.com/statOmics/PSLS21/data/captopril.txt", 
                         header = TRUE, 
                         sep = ","))
```

# Data exploration

Before we start delving into the data in order to solve 
our research hypothesis, we have to explore our data. Our dataset looks like this;

```{r}
head(captopril)
```

We have 15 patients, for which the systolic 
blood pressures and diastolyic blood pressures were measured, before and after
treatment with the captopril drug.

Note, that the dataset is not in the tidy format. We could tidy the data using
the following code.

```{r}
captopril %>% 
    gather(type,bp,-id) %>%
    filter(type%in%c("SBPa","SBPb")) %>%
    mutate(id = as.factor(id))
    # we are only interested in the systolic blood pressure
```

We can visualize the dataframe using boxplots;

```{r}
captopril %>% 
  gather(type,bp,-id) %>%
  filter(type%in%c("SBPa","SBPb")) %>%
  ggplot(aes(x=type, y=bp, fill=type)) + 
      scale_fill_brewer(palette="RdGy") +
      theme_bw() +
      geom_boxplot(outlier.shape=NA) + 
      geom_jitter(width = 0.2) +
      ggtitle("Boxplot of blood pressure measures before and after treatment") +
      ylab("blood pressure (mmHg)") +
      stat_summary(fun=mean, geom="point", shape=5, size=3, color="black", fill="black")
```

Clearly, it seems that on average the measurements
after treatment are lower than those before treatment.

Note that the boxplot is useful in light of the first question where we have to assess if the average systolic blood pressure of patients is above 140mm Hg, the threshold for hypertension. 

However, to assess the question if there is an affect of administering captopril it is essential to account for the fact that the blood pressure before and after administering captopril was measured on the same patients. 

```{r}
captopril %>% 
  gather(type,bp,-id) %>%
  filter(type%in%c("SBPa","SBPb")) %>%
  ggplot(aes(x=type, y=bp, group=id)) +
  geom_line() +
  geom_point() + 
  ylab("Systolic blood pressure (mmHg)")
```

The key question is to know is if there is **significant** effect of the treatment? To answer this
question, we will need to perform a hypothesis test.

Let's start of with question 1.

# Question 1

Is the average baseline systolic blood pressure (SBP) higher than the threshold for hypertension of 140 mmHg?

To test the effect of the captopril on subjects with hypertension
(patients), we need to select patients with hypertension, hence they should have 
elevated SBP levels, that are on average higher than 140 mmHg. 
We can assess this hypothesis using a one sample t-test.

## Assess the assumptions

Before we can perform a t-test, we must check that the required
assumptions are met!

1. The observations are independent
2. The data (SBPb) must be normally distributed

The first assumption requires us to think about how the data were collected. 
Are there any underlying correlation structures (that we know of)
in the data? For instance, if all the 15 subjects are members of
the same family, we expect that the data will give us a good 
representation of the underlying population of interest, i.e., 
all past, present and future patients with elevated SBP levels.

Here, we have no reason to believe that this 
assumption was violated; we may assume 15 unrelated patients with elevated SBP levels were selected at random from the population.

We can assess the second assumption with a quantile-quantile plot.

```{r}
captopril %>%
  ggplot(aes(sample=SBPa)) +
  geom_qq() +
  geom_qq_line()

captopril %>%
  ggplot(aes(sample=SBPb)) +
  geom_qq() +
  geom_qq_line()

# or, equivalently
captoprilTidy <- captopril %>% gather(type,bp,-id) 
captoprilTidy %>%
  filter(type%in%c("SBPa","SBPb")) %>%
  ggplot(aes(sample=bp)) +
  geom_qq() +
  geom_qq_line() +
  facet_wrap(~type)
```

We can see that all of the data lies nicely around the quantile-quantile
line (black line). As such, we may conclude that our data is normally 
distributed.

## Hypothesis test

Here, we will test if mean systolic blood pressure at baseline is significantly higher than 140 mmHg. More 
specifically, we will test the null hypothesis;

$H_0:$ the mean SBPb is equal to 140 mmHg

versus the alternative hypothesis;

$H_1:$ the mean SBPb is greater than 140 mmHg

```{r}
output1 <- t.test(captopril$SBPb, 
                  mu=140,
                  alternative = "greater",
                  conf.level = 0.95)
output1
```

## Conclusion

When writing a conclusion on your research hypothesis,
it is very important to be precise, concise, and complete.

An example of such a conclusion for our research question
is given below:

We can conclude that the mean systolic baseline blood pressure of patients in the captopril study is extremely significantly higher  than the threshold for hypertension of 140 mmHg (p << 0.001). 
The mean SBPb equals `r round(unname(output1$estimate),2)` mmHg with 
a 95% confidence interval of [`r round(output1$conf.int[c(1,1)],2)`, +$\infty$]).

As we have seen in the theory class, the 95% confidence 
interval can be interpreted as;

With 95% confidence we can conclude that the true average of baseline systolic blood pressure of diseased patients in the population is above `r round(output1$conf.int[1],2)`.

Note, that we only report a one sided confidence interval because we only test against the alternative hypothesis that the blood pressure is on average larger than the threshold for hypertension of 140 mmHg. 

# Question 2

Is the average SBP before captopril treatment different from the average 
SBP after captopril treatment?

As the data are paired, there will be a strong correlation between the BP values 
before and after treatment of each individual patient. We can show this
with a scatterplot.

```{r}
captopril %>% 
  ggplot(aes(x=SBPb,y=SBPa)) +
    geom_point() +
    ggtitle("correlation between SBPb and SBPa") +
    ylab("SBPa (mmHg)") +
    xlab("SBPb (mmHg)")
```

We clearly see that if a patient's SBPb value is high, its
SBPa value will be comparatively high as well.

We can immediately calculate the impact of the treatment for every patient by calculating the difference between the blood pressure after and before the treatment. 

```{r}
captopril <- 
  captopril %>% mutate(deltaSBP = SBPa-SBPb)

captopril%>%
  ggplot(aes(x="Systolic blood pressure",y=deltaSBP)) +
  geom_boxplot(outlier.shape=NA) + 
  geom_point(position="jitter")+
  ylab("Difference (mm mercury)") +
  xlab("") 
```

## Check the assumptions

The paired t-test has 2 assumptions:

1. The blood pressure differences are independent of each other.

2. The blood pressure differences are normally distributed. 

The first assumption is met given the experimental design.

Secondly, we assess if the data are normally distributed.

```{r}
captopril %>%
  ggplot(aes(sample=deltaSBP)) +
  geom_qq() +
  geom_qq_line()
```

We can see that all of the data lies nicely around the quantile-quantile
line. As such, we may assume that our data are normally distributed.
As our assumptions are met we may continue with
performing the unpaired t-test.

## Hypothesis test

As such, we will perform a `paired` t-test.

- The null hypothesis of the test is that the blood pressure before and after 
the treatment with captopril is on average equal. 
- Which will be tested against the alternative that the blood pressure before
and after the treatment with captopril is on average different. 

```{r}
output2 <- t.test(captopril$SBPb, captopril$SBPa, paired = TRUE)
output2
```


## Conclusion

There is on average an extremely significant blood pressure drop upon administering captopril to patients with hypertension (p << 0.001). The systolic blood pressure decreases on average with `r round(unname(output2$estimate),1)` mmHg upon the treatment with captopril (95% CI [`r round(output2$conf.int[c(1,2)],1)`]).

# Alternative solution: One-sample t-test on the difference

Performing a  paired t-test is
analogous to performing a one-sample t-test on the difference
between both groups.

This can be easily seen from the output of the paired two-sample
t-test. The alternative hypothesis $HA$ there states that 
the "true difference in means is not equal to 0". So internally,
R will actually perform a one-sample t-test on the difference, and
check whether or not the true mean difference is equal to 0.
We can also set this up manually.

```{r}
t.test(captopril$deltaSBP, mu=0)
```

Indeed, the output is equivalent to that of the  paired t-test.







