## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ tibble 3.1.4 ✔ dplyr 1.0.7
## ✔ tidyr 1.1.3 ✔ stringr 1.4.0
## ✔ readr 1.4.0 ✔ forcats 0.5.1
## ✔ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Histologic grade in breast cancer provides clinically important prognostic information. Researchers examined whether histologic grade was associated with gene expression profiles of breast cancers and whether such profiles could be used to improve histologic grading. In this tutorial we will assess the association between histologic grade and the expression of the KPNA2 gene that is known to be associated with poor BC prognosis. The patients, however, do not only differ in the histologic grade, but also on their lymph node status. The lymph nodes were not affected (0) or chirugically removed (1).
kpna2 <- read.table("https://raw.githubusercontent.com/statOmics/SGA21/master/data/kpna2.txt",header=TRUE)
kpna2
Because histolic grade and lymph node status are both categorical variables, we model them both as factors.
kpna2 %>%
ggplot(aes(x=node:grade,y=gene,fill=node:grade)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter()
As discussed in a previous exercise, it seems that there is both an effect of histologic grade and lymph node status on the gene expression. There also seems to be a different effect of lymph node status on the gene expression for the different histologic grades.
As we saw before, we can model this with a model that contains both histologic grade, lymph node status and the interaction term between both. When checking the linear model assumptions, we see that the variance is not equal. Therefore we model the gene expression with a log2-transformation, which makes that all the assumptions of the linear model are satisfied.
Check if the interaction term is significant:
As we are dealing with a factorial design, we can calculate the mean gene expression for each group by the following parameter summations.
The researchers want to know the power for testing following hypotheses (remark that we will have to adjust for multiple testing):
\[H_0: \log_2{FC}_{g3n0-g1n0} = \beta_{g3} = 0\]
\[H_0: \log_2{FC}_{g3n1-g1n1} = \beta_{g3} + \beta_{g3n1} = 0\]
\[H_0: \log_2{FC}_{g1n1-g1n0} = \beta_{n1} = 0\]
\[H_0: \log_2{FC}_{g3n1-g3n0} = \beta_{n1} + \beta_{g3n1} = 0\]
\[H_0: \log_2{FC}_{g3n1-g1n1} - \log_2{FC}_{g3n0-g1n0} = \beta_{g3n1} = 0\] which is an equivalent hypotheses with \[H_0: \log_2{FC}_{g3n1-g3n0} - \log_2{FC}_{g1n1-g1n0} = \beta_{g3n1} = 0\]
We can test this using multcomp, which controls for multiple testing.
We get a significant p-value for the first, second, third and fifth hypothesis. The fourth hypothesis is not significant at the overall 5% significance level.
Function to simulate data similar to that of our experiment under our model assumptions.
simFastMultipleContrasts <- function(form, data, betas, sd, contrasts, alpha = .05, nSim = 10000, adjust = "bonferroni")
{
ySim <- rnorm(nrow(data)*nSim,sd=sd)
dim(ySim) <-c(nrow(data),nSim)
design <- model.matrix(form, data)
ySim <- ySim + c(design %*%betas)
ySim <- t(ySim)
### Fitting
fitAll <- limma::lmFit(ySim,design)
### Inference
varUnscaled <- t(contrasts)%*%fitAll$cov.coefficients%*%contrasts
contrasts <- fitAll$coefficients %*%contrasts
seContrasts <- matrix(diag(varUnscaled)^.5,nrow=nSim,ncol=5,byrow = TRUE)*fitAll$sigma
tstats <- contrasts/seContrasts
pvals <- pt(abs(tstats),fitAll$df.residual,lower.tail = FALSE)*2
pvals <- t(apply(pvals, 1, p.adjust, method = adjust))
return(colMeans(pvals < alpha))
}
power1 <- simFastMultipleContrasts(form = ...,
data = ...,
betas = ...,
sd = ...,
contrasts = ...,
alpha = ...,
nSim = ...)
power1
We observe large powers for all contrasts, except for contrast nodeg3, which has a small effect size.
powers <- matrix(NA,nrow=9, ncol=6)
colnames(powers) <- c("n",colnames(contrasts))
powers[,1] <- 2:10
# Zorg hier voor 1 designpunt (of observatie) per groep. In de for-loop gaan we deze designpunten herhalen voor het aantal observaties.
dataAllComb <- data.frame(grade = ...,
node = ...)
for (i in 1:nrow(powers))
{
predData <- data.frame(grade = rep(dataAllComb$grade, powers[i,1]),
node = rep(dataAllComb$node, powers[i,1]))
powers[i,-1] <- simFastMultipleContrasts(form = ...,
data = predData,
betas = ...,
sd = ...,
contrasts = ...,
alpha = ...,
nSim = ...)
}
powers