library(tidyverse)
## ── Attaching packages ───────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.5 ✔ purrr 0.3.4
## ✔ tibble 3.1.6 ✔ dplyr 1.0.8
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ──────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Two pager on Replication in nature methods: [PDF]
Consider a single cell RNA-seq experiment where researchers want to assess the effect of drug treatment on gene expression
Potential research questions
TABLE 1 NATURE METHODS | VOL.11 NO.9 | SEPTEMBER 2014 | 879 - 880
Replicate type | Replicate category\(^\text{a}\) | |
---|---|---|
Colonies | B | |
Strains | B | |
Animal study subjects | Cohoused groups | B |
Gender | B | |
Individuals | B | |
Organs from sacrificed animals | B | |
Methods for dissociating cells from tissue | T | |
Sample preparation | Dissociation runs from given tissue sample | T |
Individual cells | B | |
RNA-seq library construction | T | |
Runs from the library of a given cell | T | |
Sequencing | Reads from different transcript molecules | V\(^\text{b}\) |
Reads with unique molecular identifier (UMI) from a given transcript molecule | T |
Figure 1 NATURE METHODS | VOL.11 NO.9 | SEPTEMBER 2014 | 879 - 880
\(\rightarrow\) commit resources to sampling biologically relevant variables
\(\rightarrow\) unless measures of technical variability are of interest then increasing number of technical measurements is valuable.
Good experimental design practice includes planning for replication.
Reading materials: Nature Methods (2013), 10(12), 1139–1140
Minimize the residual sum of squares \[\begin{eqnarray*} RSS(\boldsymbol{\beta})&=&\sum\limits_{i=1}^n e^2_i\\ &=&\sum\limits_{i=1}^n \left(y_i-\beta_0-\sum\limits_{j=1}^p x_{ij}\beta_j\right)^2 \end{eqnarray*}\]
or in matrix notation
\[ \text{RSS}(\boldsymbol{\beta})=(\mathbf{Y}-\mathbf{X\beta})^T(\mathbf{Y}-\mathbf{X\beta}) \]
\[\rightarrow \hat{\boldsymbol{\beta}}=\text{argmin}_\beta \text{ RSS}(\boldsymbol{\beta})\]
\[ \begin{array}{ccc} \frac{\partial RSS}{\partial \boldsymbol{\beta}}&=&\mathbf{0}\\\\ \frac{(\mathbf{Y}-\mathbf{X\beta})^T(\mathbf{Y}-\mathbf{X}\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}&=&\mathbf{0}\\\\ -2\mathbf{X}^T(\mathbf{Y}-\mathbf{X}\boldsymbol{\beta})&=&\mathbf{0}\\\\ \mathbf{X}^T\mathbf{X\beta}&=&\mathbf{X}^T\mathbf{Y}\\\\ \hat{\boldsymbol{\beta}}&=&(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y} \end{array} \]
\[ \begin{array}{ccl} \hat{\boldsymbol{\Sigma}}_{\hat{\boldsymbol{\beta}}} &=&\text{var}\left[(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}\right]\\\\ &=&(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\text{var}\left[\mathbf{Y}\right]\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\\\\ &=&(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T(\mathbf{I}\sigma^2)\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1} \\\\ &=&(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{I}\quad\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\sigma^2\\\\ %\hat{\boldmath{\Sigma}}_{\hat{\boldsymbol{\beta}}}&=&(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \text{var}\left[\mathbf{Y}\right](\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}\\ &=&(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\sigma^2\\\\ &=&(\mathbf{X}^T\mathbf{X})^{-1}\sigma^2 \end{array} \]
The uncertainty on the model parameters thus depends on the residual variability and the design!
The effect size of interest is typically a linear combinations of the model parameters, i.e. \[ l_0 \times \beta_0 + l_1 \times \beta_1 + ... + l_{p-1} \times \beta_{p-1} = \mathbf{L}^T\boldsymbol{\beta} \]
The null hypothesis of our test statistics is than formulated as
\[ H_0: \mathbf{L}^T\boldsymbol{\beta} = 0 \]
vs the alternative hypothesis
\[ H_0: \mathbf{L}^T\boldsymbol{\beta} \neq 0 \] which can be assessed using a t-test statistic:
\[ t=\frac{\mathbf{L}^T\hat{\boldsymbol{\beta}} - 0}{\text{se}_{\mathbf{L}^T\hat{\boldsymbol{\beta}}}} \] which follows a t-distribution with n-p degrees of freedom under the null hypothesis when all assumptions for the linear model are valid.
So the power
\[P(p < 0.05 | H_1)\]
will typically depends on
Similar to the introductory example, we can use simulations to assess the power.
In 2021 Choa et al. published that the cytokine Thymic stromal lymphopoietin (TSLP) induced fat loss through sebum secretion (talg). [html] [PDF]
Suppose that you would like to set up a similar study to test if cytokine interleukin 25 (IL) also has beneficial effect.
You plan to setup a study with a control group of high fat diet (HFD) fed mice and a treatment group that recieves the HFD and IL.
What sample size do you need to pick up the effect of the treatment.
\(H_0\): The average weight difference is equal to zero
\(H_1\): The average weight difference is different from zero
Two sample t-test or a t-test on the slope of a linear model with one dummy variable.
\[ Y_i = \beta_0 + \beta_1 X_\text{iIL} + \epsilon_i\]
\[ \text{with } X_\text{iIL}=\begin{cases}X_{iIL}=0 & \text{HFD}\\X_{iIL}=1 & \text{HFD + IL}\end{cases}. \]
Estimated effect size \[\hat\delta = \bar X_{IL} - \bar X_{c} = \hat \beta_1\]
Test statistic \[ T = \frac{\bar{X}_{IL}-\bar{X}_{c}}{SD_\text{pooled} \times \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}} = \frac{\hat\beta_1}{\text{SE}_{\hat\beta_1}} \]
\(\hat \beta_1\) is an unbiased estimator of the real weight difference (\(\delta\)) that would occur in the population of rats fed with HFD and rats fed with HFD + IL.
\[ P[p < \alpha \vert \beta_1 \neq 0] \]
We can estimate the power if
and we know
Suppose that we have access to the data of a preliminary experiment (e.g. provided by Karen Svenson via Gary Churchill and Dan Gatti and partially funded by P50 GM070683 on PH525x)
<- read.csv("https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleMiceWeights.csv")
mice
%>%
mice ggplot(aes(x = Diet, y = Bodyweight)) +
geom_boxplot(outlier.shape = FALSE) +
geom_jitter()
%>%
mice ggplot(aes(sample = Bodyweight)) +
geom_qq() +
geom_qq_line() +
facet_wrap(~Diet)
<- mice %>% mutate(Diet = as.factor(Diet))
mice <- mice %>%
miceSum group_by(Diet) %>%
summarize(
mean = mean(Bodyweight, na.rm = TRUE),
sd = sd(Bodyweight, na.rm = TRUE),
n = n()
%>%
) mutate(se = sd / sqrt(n))
miceSum
In the experiment we have data from two diets:
We can use the hf mice as input for our power analysis.
Effect size?
The alternative hypothesis is complex.
It includes all possible effects!
In order to do the power analysis we will have to choose a minimum effect size that we would like to detect.
Suppose that we would like to detect if the weight of the rats difference \(\delta\) of at least 10%.
<- -round(miceSum$mean[2] * .1, 1)
delta delta
## [1] -2.7
Note, that the average weight then would get close to that of the rats in our pilot experiment that were fed on the chow diet.
We can set up a simulation study to assess the power of an experiment with 8 mice in each group:
set.seed(1423)
<- n2 <- 8
n1 <- round(miceSum$mean[2], 1)
b0 <- -delta
b1 <- round(miceSum$sd[2], 1)
sd <- 0.05
alpha
<- rep(0:1, c(n1, n2))
x <- b0 + b1 * x + rnorm(n1 + n2, 0, sd = sd)
y <- lm(y ~ x)
fit <- fit$coef
bhat <- summary(fit)
stat summary(fit)$coef[2, ]
## Estimate Std. Error t value Pr(>|t|)
## 2.5972480 1.7919658 1.4493848 0.1692595
For this simulated experiment we would not have been able to pick up the effect of the treatment at the significance level \(\alpha\)=0.05 (p = 0.17)!
library(multcomp)
## Loading required package: mvtnorm
## Loading required package: survival
## Loading required package: TH.data
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
##
## Attaching package: 'TH.data'
## The following object is masked from 'package:MASS':
##
## geyser
<- n2 <- 8
n1
<- round(miceSum$mean[2], 1)
b0 <- -delta
b1
<- round(miceSum$sd[2], 1)
sd <- data.frame(Diet = rep(c("c", "hf"), c(n1, n2)) %>% as.factor())
predictorData
<- 0.05
alpha
<- function(form, data, betas, sd, contrasts, simIndex = NA) {
simLm <- data
dataSim <- model.matrix(form, dataSim)
X $ySim <- X %*% betas + rnorm(nrow(dataSim), 0, sd)
dataSim<- formula(paste("ySim ~", form[[2]]))
form <- lm(form, dataSim)
fitSim <- glht(fitSim, linfct = contrasts)
mcp return(summary(mcp)$test[c("coefficients", "sigma", "tstat", "pvalues")] %>% unlist())
}
simLm(
form = ~Diet,
data = predictorData,
betas = c(b0, b1),
sd = sd,
contrasts = "Diethf = 0"
)
## coefficients.Diethf sigma.Diethf tstat.Diethf
## 1.9444623 2.0931302 0.9289734
## pvalues
## 0.3686439
We now have a generic function that can simulate data from a normal distribution for every design. The function has arguments:
form
: One sided formula including the structure of the predictors in the modeldata
: Data frame with the predictor values for the designbetas
: A vector with values for all mean model parameterssd
: The standard deviation of the errorcontrasts
: a scalar or a vector with the null hypotheses that we would like to assess.simIndex
: an arbitrary argument that is not used by the function but that will allow it to be used in an sapply loop that runs from 1 up to the number of simulations.Simulate nSim = 1000 repeated experiments:
set.seed(1425)
<- 1000
nSim
<- t(sapply(1:nSim, simLm, form = ~Diet, data = predictorData, betas = c(b0, b1), sd = sd, contrasts = "Diethf = 0"))
simResults <- mean(simResults[, 4] < alpha)
power power
## [1] 0.221
We have a power of 22.1% to pick up the treatment effect when using 8 bioreps in each treatment group.
We can now calculate the power for multiple sample sizes.
<- data.frame(n = c(3, 5, 10, 25, 50, 75, 100), power = NA)
power
for (i in 1:nrow(power))
{<- n2 <- power$n[i]
n1 <- data.frame(Diet = rep(c("c", "hf"), c(n1, n2)) %>% as.factor())
predictorData <- t(sapply(1:nSim, simLm, form = ~Diet, data = predictorData, betas = c(b0, b1), sd = sd, contrasts = "Diethf = 0"))
simResults $power[i] <- mean(simResults[, "pvalues"] < alpha)
power
} power
%>%
power ggplot(aes(x = n, y = power)) +
geom_line()
Note, that - the sign of the delta is arbitrary because we test two-sided. - the intercept is arbitrary because we only asses \(\beta_0\) - We therefore typically set \(\beta_0 = 0\)
<- 1000
nSim <- 0
b0 <- c(1, 2, 3, 5, 10)
deltas <- round(miceSum$sd[2], 1)
sd <- c(3, 5, 10, 20, 25, 50, 75, 100)
ns
<- data.frame(
power b1 = rep(deltas, each = length(ns)),
n = rep(ns, length(deltas)),
power = NA
)
for (i in 1:nrow(power))
{<- power$b1[i]
b1 <- n2 <- power$n[i]
n1 <- data.frame(Diet = rep(c("c", "hf"), c(n1, n2)) %>% as.factor())
predictorData <- t(sapply(1:nSim, simLm, form = ~Diet, data = predictorData, betas = c(b0, b1), sd = sd, contrasts = "Diethf = 0"))
simResults $power[i] <- mean(simResults[, "pvalues"] < alpha)
power }
%>%
power ggplot(aes(x = n, y = power, col = b1 %>% as.factor())) +
geom_line()
Note, that the power curves are still a bit choppy because we selected a limited number of sample sizes and because nSim
simulations is not enough to get a good power estimate when the power is low.
The code can be easily extended towards other designs by altering
The simulations start to take long when you evaluate many scenario’s.
Code runs much faster. We now simulate 20 times more experiments in a much shorter time!
<- function(form, data, betas, sd, contrasts, alpha = .05, nSim = 10000) {
simFast <- rnorm(nrow(data) * nSim, sd = sd)
ySim dim(ySim) <- c(nrow(data), nSim)
<- model.matrix(form, data)
design <- ySim + c(design %*% betas)
ySim <- t(ySim)
ySim
### Fitting
<- limma::lmFit(ySim, design)
fitAll
### Inference
<- c(t(contrasts) %*% fitAll$cov.coefficients %*% contrasts)
varUnscaled <- fitAll$coefficients %*% contrasts
contrasts <- varUnscaled^.5 * fitAll$sigma
seContrasts <- contrasts / seContrasts
tstats <- pt(abs(tstats), fitAll$df.residual, lower.tail = FALSE) * 2
pvals return(mean(pvals < alpha))
}
<- 20000
nSim <- 0
b0 <- round(miceSum$sd[2], 1)
sd <- c(3, 5, 10, 20, 25, 50, 75, 100)
ns <- c(1, 2, 3, 5, 10)
deltas
<- limma::makeContrasts("Diethf", levels = c("(Intercept)", "Diethf")) contrast
## Warning in limma::makeContrasts("Diethf", levels = c("(Intercept)",
## "Diethf")): Renaming (Intercept) to Intercept
<- matrix(NA, nrow = length(ns) * length(deltas), ncol = 3) %>% as.data.frame()
powerFast names(powerFast) <- c("b1", "n", "power")
<- ~Diet
form
<- 0
i
for (n in ns)
{<- n2 <- n
n1
### Simulation
<- data.frame(Diet = rep(c("c", "hf"), c(n1, n2)) %>% as.factor())
predictorData
for (b1 in deltas)
{<- i + 1
i <- c(b0, b1)
betas <- c(b1, n, simFast(form, predictorData, betas, sd, contrasts = contrast, alpha = alpha, nSim = nSim))
powerFast[i, ]
} }
%>%
powerFast ggplot(aes(x = n, y = power, col = b1 %>% as.factor())) +
geom_line()
For the two sample t-test a closed form estimate exists for the power. In this context the Cohen’s effect size is typically used:
\(D = \frac{\delta}{SD}\)
power.t.test(n = 8, delta = delta, sd = sd, type = "two.sample")
##
## Two-sample t test power calculation
##
## n = 8
## delta = 2.7
## sd = 4.1
## sig.level = 0.05
## power = 0.2320512
## alternative = two.sided
##
## NOTE: n is number in *each* group
Note, that this is very similar to the power that we calculated using the simulations!
<- 0
b0 <- round(miceSum$sd[2], 1)
sd <- c(3, 5, 10, 20, 25, 50, 75, 100)
ns <- c(1, 2, 3, 5, 10)
deltas
<- data.frame(
powerTheo deltas = rep(deltas, each = length(ns)),
n = rep(ns, length(deltas)),
power = NA
)
$power <- apply(powerTheo[, 1:2], 1, function(x) power.t.test(delta = x[1], n = x[2], sd = sd, type = "two.sample")$power) powerTheo
%>%
powerTheo ggplot(aes(x = n, y = power, col = deltas %>% as.factor())) +
geom_line()