This is part of the online course Proteomics Data Analysis (PDA)
Note, that the R-code is included for learners who are aiming to develop R/markdown scripts to automate their quantitative proteomics data analyses. According to the target audience of the course we either work with a graphical user interface (GUI) in a R/shiny App msqrob2gui (e.g. Proteomics Bioinformatics course of the EBI and the Proteomics Data Analysis course at the Gulbenkian institute) or with R/markdowns scripts (e.g. Bioinformatics Summer School at UCLouvain or the Statistical Genomics Course at Ghent University).
library(tidyverse)
library(limma)
library(QFeatures)
library(msqrob2)
library(plotly)
library(ggplot2)
<- "https://raw.githubusercontent.com/statOmics/PDA/data/quantification/francisella/peptides.txt" peptidesFile
<- grep("Intensity\\.", names(read.delim(peptidesFile))) ecols
<- readQFeatures(
pe table = peptidesFile,
fnames = 1,
ecol = ecols,
name = "peptideRaw", sep="\t")
colData(pe)$genotype <- pe[[1]] %>%
%>%
colnames substr(12,13) %>%
%>%
as.factor relevel("WT")
%>% colData pe
## DataFrame with 6 rows and 1 column
## genotype
## <factor>
## Intensity.1WT_20_2h_n3_3 WT
## Intensity.1WT_20_2h_n4_3 WT
## Intensity.1WT_20_2h_n5_3 WT
## Intensity.3D8_20_2h_n3_3 D8
## Intensity.3D8_20_2h_n4_3 D8
## Intensity.3D8_20_2h_n5_3 D8
rowData(pe[["peptideRaw"]])$nNonZero <- rowSums(assay(pe[["peptideRaw"]]) > 0)
NA
value rather than 0
.<- zeroIsNA(pe, "peptideRaw") # convert 0 to NA pe
<- logTransform(pe, base = 2, i = "peptideRaw", name = "peptideLog") pe
<- filterFeatures(pe, ~ Proteins %in% smallestUniqueGroups(rowData(pe[["peptideLog"]])$Proteins)) pe
<- filterFeatures(pe,~Reverse != "+")
pe <- filterFeatures(pe,~ Contaminant != "+") pe
<- filterFeatures(pe,~ nNonZero >=2)
pe nrow(pe[["peptideLog"]])
## [1] 6525
We keep 6525 peptides upon filtering.
<- normalize(pe,
pe i = "peptideLog",
name = "peptideNorm",
method = "center.median")
<- aggregateFeatures(pe,
pe i = "peptideNorm",
fcol = "Proteins",
na.rm = TRUE,
name = "protein")
## Your quantitative and row data contain missing values. Please read the
## relevant section(s) in the aggregateFeatures manual page regarding the
## effects of missing values on data aggregation.
Plot of preprocessed data
"peptideNorm"]] %>%
pe[[%>%
assay as.data.frame() %>%
gather(sample, intensity) %>%
mutate(genotype = colData(pe)[sample,"genotype"]) %>%
ggplot(aes(x = intensity,group = sample,color = genotype)) +
geom_density() +
ggtitle("Peptide-level")
## Warning: Removed 7561 rows containing non-finite values (stat_density).
"protein"]] %>%
pe[[%>%
assay as.data.frame() %>%
gather(sample, intensity) %>%
mutate(genotype = colData(pe)[sample,"genotype"]) %>%
ggplot(aes(x = intensity,group = sample,color = genotype)) +
geom_density() +
ggtitle("Protein-level")
## Warning: Removed 428 rows containing non-finite values (stat_density).
%>%
pe %>%
colData ::kable() knitr
genotype | |
---|---|
Intensity.1WT_20_2h_n3_3 | WT |
Intensity.1WT_20_2h_n4_3 | WT |
Intensity.1WT_20_2h_n5_3 | WT |
Intensity.3D8_20_2h_n3_3 | D8 |
Intensity.3D8_20_2h_n4_3 | D8 |
Intensity.3D8_20_2h_n5_3 | D8 |
"protein"]] %>% assay() %>% head() %>% knitr::kable() pe[[
Intensity.1WT_20_2h_n3_3 | Intensity.1WT_20_2h_n4_3 | Intensity.1WT_20_2h_n5_3 | Intensity.3D8_20_2h_n3_3 | Intensity.3D8_20_2h_n4_3 | Intensity.3D8_20_2h_n5_3 | |
---|---|---|---|---|---|---|
WP_003013731 | -0.2748775 | -0.0856247 | 0.1595370 | -0.2809009 | 0.0035526 | 0.0567110 |
WP_003013860 | NA | NA | -0.2512039 | NA | NA | -0.4865646 |
WP_003013909 | -0.6851118 | -0.8161658 | -0.7557906 | -0.4591476 | -0.5449424 | -0.4962482 |
WP_003014068 | 0.6495386 | 0.8522239 | 1.1344852 | 0.5459176 | 0.9187714 | 0.5974741 |
WP_003014122 | -0.7630863 | -1.0430741 | -0.8091715 | -1.1743951 | -1.1924725 | -1.2565893 |
WP_003014123 | -0.2051672 | -0.3361704 | -0.2151930 | -0.3855747 | -0.2802011 | -0.5801771 |
\[ \log_2 \text{FC} = \bar{y}_{p1}-\bar{y}_{p2} \]
\[ T_g=\frac{\log_2 \text{FC}}{\text{se}_{\log_2 \text{FC}}} \]
\[ T_g=\frac{\widehat{\text{signal}}}{\widehat{\text{Noise}}} \]
If we can assume equal variance in both treatment groups:
\[ \text{se}_{\log_2 \text{FC}}=\text{SD}\sqrt{\frac{1}{n_1}+\frac{1}{n_2}} \]
<- data.frame(
WP_003023392 intensity = assay(pe[["protein"]]["WP_003023392",]) %>% c(),
genotype = colData(pe)[,1])
%>%
WP_003023392 ggplot(aes(x=genotype,y=intensity)) +
geom_point() +
ggtitle("Protein WP_003023392")
\[ t=\frac{\log_2\widehat{\text{FC}}}{\text{se}_{\log_2\widehat{\text{FC}}}}=\frac{-1.43}{0.0577}=-24.7 \]
Is t = -24.7 indicating that there is an effect?
How likely is it to observe t = -24.7 when there is no effect of the argP KO on the protein expression?
With data we can never prove a hypothesis (falsification principle of Popper)
With data we can only reject a hypothesis
In general we start from alternative hypothese \(H_1\): we want to show an effect of the KO on a protein
t.test(intensity ~ genotype, data = WP_003023392, var.equal=TRUE)
##
## Two Sample t-test
##
## data: intensity by genotype
## t = 24.747, df = 4, p-value = 1.582e-05
## alternative hypothesis: true difference in means between group WT and group D8 is not equal to 0
## 95 percent confidence interval:
## 1.267666 1.588058
## sample estimates:
## mean in group WT mean in group D8
## -0.1821147 -1.6099769
How likely is it to observe an equal or more extreme effect than the one observed in the sample when the null hypothesis is true?
When we make assumptions about the distribution of our test statistic we can quantify this probability: p-value. The p-value will only be calculated correctly if the underlying assumptions hold!
When we repeat the experiment, the probability to observe a fold
change for this gene that is more extreme than a 2.69 fold (\(\log_2 FC=-1.43\)) down or up regulation by
random change (if \(H_0\) is true) is
16 out of 1 000 000.
If the p-value is below a significance threshold \(\alpha\) we reject the null hypothesis. We control the probability on a false positive result at the \(\alpha\)-level (type I error)
Note, that the p-values are uniform under the null hypothesis, i.e. when \(H_0\) is true all p-values are equally likely.
Consider testing DA for all \(m=1066\) proteins simultaneously
What if we assess each individual test at level \(\alpha\)? \(\rightarrow\) Probability to have a false positive (FP) among all m simultatenous test \(>>> \alpha= 0.05\)
Indeed for each non DA protein we have a probability of 5% to return a FP.
In a typical experiment the majority of the proteins are non DA.
So an upperbound of the expected FP is \(m \times \alpha\) or \(1066 \times 0.05=53\).
\(\rightarrow\) Hence, we are bound to call many false positive proteins each time we run the experiment.
The family-wise error rate (FWER) addresses the multiple testing issue by no longer controlling the individual type I error for each protein, instead it controls:
\[ \text{FWER} = \text{P}\left[FP \geq 1 \right]. \]
The Bonferroni method is widely used to control the type I error:
Problem, the method is very conservative!
The False Discovery Proportion (FDP) is the fraction of false positives that are returned, i.e.
\[ FDP = \frac{FP}{R} \]
Therefore, Benjamini and Hochberg, 1995, defined The False Discovery Rate (FDR) as \[ \text{FDR} = \text{E}\left[\frac{FP}{R}\right] =\text{E}\left[\text{FDP}\right] \] the expected FDP.
Consider \(m = 1000\) tests
Suppose that a researcher rejects all null hypotheses for which \(p < 0.01\).
If we use \(p < 0.01\), we expect \(0.01 \times m_0\) tests to return false positives.
A conservative estimate of the number of false positives that we can expect can be obtained by considering that the null hypotheses are true for all features, \(m_0 = m = 1000\).
We then would expect \(0.01 \times 1000 = 10\) false positives (\(FP=10\)).
Suppose that the researcher found 200 genes with \(p<0.01\) (\(R=200\)).
The proportion of false positive results (FDP = false positive proportion) among the list of \(R=200\) genes can then be estimated as \[ \widehat{\text{FDP}}=\frac{FP}{R}=\frac{10}{200}=\frac{0.01 \times 1000}{200} = 0.05. \]
Let \(p_{(1)}\leq \ldots \leq p_{(m)}\) denote the ordered \(p\)-values.
Find the largest integer \(k\) so that \[ \frac{p_{(k)} \times m}{k} \leq \alpha \] \[\text{or}\] \[ p_{(k)} \leq k \times \alpha/m \]
If such a \(k\) exists, reject the \(k\) null hypotheses associated with \(p_{(1)}, \ldots, p_{(k)}\). Otherwise none of the null hypotheses is rejected.
The adjusted \(p\)-value (also known as the \(q\)-value in FDR literature): \[ q_{(i)}=\tilde{p}_{(i)} = \min\left[\min_{j=i,\ldots, m}\left(m p_{(j)}/j\right), 1 \right]. \] In the hypothetical example above: \(k=200\), \(p_{(k)}=0.01\), \(m=1000\) and \(\alpha=0.05\).
<- function(y,group) {
ttestMx <- try(t.test(y[group],y[!group],var.equal=TRUE),silent=TRUE)
test if(is(test,"try-error")) {
return(c(log2FC=NA,se=NA,tstat=NA,p=NA))
else {
} return(c(log2FC= (test$estimate%*%c(1,-1)),se=test$stderr,tstat=test$statistic,pval=test$p.value))
}
}
<- apply(
res assay(pe[["protein"]]),
1,
ttestMx,group = colData(pe)$genotype=="D8") %>%
t colnames(res) <- c("logFC","se","tstat","pval")
<- res %>% as.data.frame %>% na.exclude %>% arrange(pval)
res $adjPval <- p.adjust(res$pval, "fdr")
res<- 0.05
alpha $adjAlphaForm <- paste0(1:nrow(res)," x ",alpha,"/",nrow(res))
res$adjAlpha <- alpha * (1:nrow(res))/nrow(res)
res$"pval < adjAlpha" <- res$pval < res$adjAlpha
res$"adjPval < alpha" <- res$adjPval < alpha res
FWER: Bonferroni method:\(\alpha_\text{adj} = \alpha/m = 0.05 / 1066= 5\times 10^{-5}\)
logFC | pval | adjPval | adjAlphaForm | adjAlpha | pval < adjAlpha | adjPval < alpha | |
---|---|---|---|---|---|---|---|
WP_003038940 | -0.2876290 | 0.0000146 | 0.0084347 | 1 x 0.05/1066 | 0.0000469 | TRUE | TRUE |
WP_003023392 | -1.4278622 | 0.0000158 | 0.0084347 | 2 x 0.05/1066 | 0.0000938 | TRUE | TRUE |
WP_003039212 | -0.2658247 | 0.0000820 | 0.0291520 | 3 x 0.05/1066 | 0.0001407 | TRUE | TRUE |
WP_003026016 | -1.0800305 | 0.0001395 | 0.0346124 | 4 x 0.05/1066 | 0.0001876 | TRUE | TRUE |
WP_003039615 | -0.3992190 | 0.0001623 | 0.0346124 | 5 x 0.05/1066 | 0.0002345 | TRUE | TRUE |
WP_011733588 | -0.4323262 | 0.0002291 | 0.0407034 | 6 x 0.05/1066 | 0.0002814 | TRUE | TRUE |
WP_003014552 | -0.9843865 | 0.0003224 | 0.0440266 | 7 x 0.05/1066 | 0.0003283 | TRUE | TRUE |
WP_003040849 | -1.2780743 | 0.0003304 | 0.0440266 | 8 x 0.05/1066 | 0.0003752 | TRUE | TRUE |
WP_003038430 | -0.4331987 | 0.0004505 | 0.0489078 | 9 x 0.05/1066 | 0.0004221 | FALSE | TRUE |
WP_003033975 | -0.2949061 | 0.0005047 | 0.0489078 | 10 x 0.05/1066 | 0.0004690 | FALSE | TRUE |
WP_011733645 | 0.3531405 | 0.0005171 | 0.0489078 | 11 x 0.05/1066 | 0.0005159 | FALSE | TRUE |
WP_011733723 | -0.3935768 | 0.0005506 | 0.0489078 | 12 x 0.05/1066 | 0.0005629 | TRUE | TRUE |
WP_003038679 | -0.3909725 | 0.0007083 | 0.0580821 | 13 x 0.05/1066 | 0.0006098 | FALSE | FALSE |
WP_003033719 | -1.1865453 | 0.0008426 | 0.0603810 | 14 x 0.05/1066 | 0.0006567 | FALSE | FALSE |
… | … | … | … | … | … | … | … |
WP_003040562 | 0.0039480 | 0.9976429 | 0.9985797 | 1065 x 0.05/1066 | 0.0499531 | FALSE | FALSE |
WP_003041130 | 0.0002941 | 0.9992812 | 0.9992812 | 1066 x 0.05/1066 | 0.05 | FALSE | FALSE |
<- res %>%
volcanoT ggplot(aes(x = logFC, y = -log10(pval), color = adjPval < 0.05)) +
geom_point(cex = 2.5) +
scale_color_manual(values = alpha(c("black", "red"), 0.5)) +
theme_minimal()
volcanoT
Problems with ordinary t-test
<- list()
problemPlots 1]] <- res %>%
problemPlots[[ggplot(aes(x = logFC, y = se, color = adjPval < 0.05)) +
geom_point(cex = 2.5) +
scale_color_manual(values = alpha(c("black", "red"), 0.5)) +
theme_minimal()
for (i in 2:3)
{<- colData(pe) %>%
problemPlots[[i]] %>%
as.data.frame mutate(intensity = pe[["protein"]][rownames(res)[i],] %>%
%>%
assay %>%
c) ggplot(aes(x=genotype,y=intensity)) +
geom_point() +
ylim(-3,0) +
ggtitle(rownames(res)[i])
}
problemPlots
## [[1]]
##
## [[2]]
##
## [[3]]
A general class of moderated test statistics is given by \[ T_g^{mod} = \frac{\bar{Y}_{g1} - \bar{Y}_{g2}}{C \quad \tilde{S}_g} , \] where \(\tilde{S}_g\) is a moderated standard deviation estimate.
<-sapply(res$se/sqrt(1/3+1/3),function(n,mean,sd) rnorm(n,mean,sd),n=6,mean=0) %>% t
simI<- apply(
resSim
simI, 1,
ttestMx,group = colData(pe)$genotype=="D8") %>%
t colnames(resSim) <- c("logFC","se","tstat","pval")
<- as.data.frame(resSim)
resSim <- resSim %>%
tstatSimPlot ggplot(aes(x=tstat)) +
geom_histogram(aes(y=..density.., fill=..count..),bins=30) +
stat_function(fun=dt,
color="red",
args=list(df=4)) +
ylim(0,.6) +
ggtitle("t-statistic")
$C <- sqrt(1/3+1/3)
resSim$sd <- resSim$se/resSim$C
resSim<- resSim %>%
tstatSimPerseus ggplot(aes(x=logFC/((sd+.1)*C))) +
geom_histogram(aes(y=..density.., fill=..count..),bins=30) +
stat_function(fun=dt,
color="red",
args=list(df=4)) +
ylim(0,.6) +
ggtitle("Perseus")
::grid.arrange(tstatSimPlot,tstatSimPerseus,nrow=1) gridExtra
Figure courtesy to Rafael Irizarry
\[ T_g^{mod} = \frac{\bar{Y}_{g1} - \bar{Y}_{g2}}{C \quad \tilde{S}_g} , \]
\[ \tilde{S}_g=\sqrt{\frac{d_gS_g^2+d_0S_0^2}{d_g+d_0}}, \]
msqrob
function. By default msqrob2
estimates
the model parameters using robust regression.We will model the data with a different group mean for every
genotype. The group is incoded in the variable genotype
of
the colData. We can specify this model by using a formula with the
factor genotype
as its predictor:
formula = ~genotype
.
Note, that a formula always starts with a symbol ‘~’.
<- msqrob(object = pe, i = "protein", formula = ~genotype) pe
We first explore the design of the model that we specified using the
the package ExploreModelMatrix
library(ExploreModelMatrix)
VisualizeDesign(colData(pe),~genotype)$plotlist[[1]]
We have two model parameters, the (Intercept) and genotypeD8. This results in a model with two group means:
\[\text{E}[Y\vert \text{genotype}=\text{WT}] = \text{(Intercept)}\]
\[\text{E}[Y\vert \text{genotype}=\text{D8}] = \text{(Intercept)} + \text{genotypeD8}\]
The average log2FC between D8 and WT is thus \[\log_2\text{FC}_{D8-WT}= \text{E}[Y\vert \text{genotype}=\text{D8}] - \text{E}[Y\vert \text{genotype}=\text{WT}] = \text{genotypeD8} \]
Hence, assessing the null hypothesis that there is no differential abundance between D8 and WT can be reformulated as
\[H_0: \text{genotypeD8}=0\] We can implement a hypothesis test for each protein in msqrob2 using the code below:
<- makeContrast("genotypeD8 = 0", parameterNames = c("genotypeD8"))
L <- hypothesisTest(object = pe, i = "protein", contrast = L) pe
We can show the list with all significant DE proteins at the 5% FDR using
rowData(pe[["protein"]])$genotypeD8 %>%
arrange(pval) %>%
filter(adjPval<0.05)
We can also visualise the results using a volcanoplot
<- ggplot(
volcano rowData(pe[["protein"]])$genotypeD8,
aes(x = logFC, y = -log10(pval), color = adjPval < 0.05)
+
) geom_point(cex = 2.5) +
scale_color_manual(values = alpha(c("black", "red"), 0.5)) +
theme_minimal() +
ggtitle("msqrob2")
::grid.arrange(
gridExtra+
volcanoT xlim(-3,3) +
ggtitle("ordinary t-test"),
+
volcano xlim(-3,3)
nrow=2) ,
## Warning: Removed 109 rows containing missing values (geom_point).
The volcano plot opens up when using the EB variance estimator
Borrowing strength to estimate the variance using empirical Bayes solves the issue of returning proteins with a low fold change as significant due to a low variance.
qplot(
sapply(rowData(pe[["protein"]])$msqrobModels,getSigma),
sapply(rowData(pe[["protein"]])$msqrobModels,getSigmaPosterior)) +
xlab("SD") +
ylab("moderated SD") +
geom_abline(intercept = 0,slope = 1) +
geom_hline(yintercept = )
## Warning: Removed 109 rows containing missing values (geom_point).
<- rowData(pe[["protein"]])$genotypeD8 %>%
sigNames rownames_to_column("protein") %>%
filter(adjPval < 0.05) %>%
pull(protein)
heatmap(assay(pe[["protein"]])[sigNames, ])
for (protName in sigNames)
{<- pe[protName, , c("peptideNorm", "protein")]
pePlot <- data.frame(longFormat(pePlot))
pePlotDf $assay <- factor(pePlotDf$assay,
pePlotDflevels = c("peptideNorm", "protein")
)$genotype <- as.factor(colData(pePlot)[pePlotDf$colname, "genotype"])
pePlotDf
# plotting
<- ggplot(
p1 data = pePlotDf,
aes(x = colname, y = value, group = rowname)
+
) geom_line() +
geom_point() +
facet_grid(~assay) +
theme(axis.text.x = element_text(angle = 70, hjust = 1, vjust = 0.5)) +
ggtitle(protName)
print(p1)
# plotting 2
<- ggplot(pePlotDf, aes(x = colname, y = value, fill = genotype)) +
p2 geom_boxplot(outlier.shape = NA) +
geom_point(
position = position_jitter(width = .1),
aes(shape = rowname)
+
) scale_shape_manual(values = 1:nrow(pePlotDf)) +
labs(title = protName, x = "sample", y = "peptide intensity (log2)") +
theme(axis.text.x = element_text(angle = 70, hjust = 1, vjust = 0.5)) +
facet_grid(~assay)
print(p2)
}
\[ \log_2 \text{FC} = \bar{y}_{p1}-\bar{y}_{p2} \]
\[ T_g=\frac{\log_2 \text{FC}}{\text{se}_{\log_2 \text{FC}}} \]
\[ T_g=\frac{\widehat{\text{signal}}}{\widehat{\text{Noise}}} \]
If we can assume equal variance in both treatment groups:
\[ \text{se}_{\log_2 \text{FC}}=\text{SD}\sqrt{\frac{1}{n_1}+\frac{1}{n_2}} \]
\(\rightarrow\) if number of bio-repeats increases we have a higher power!
\[\sigma^2= \sigma^2_{bio}+\sigma^2_\text{lab} +\sigma^2_\text{extraction} + \sigma^2_\text{run} + \ldots\]
Duguet et al. (2017) MCP 16(8):1416-1432. doi: 10.1074/mcp.m116.062745
\[\updownarrow\]
Our R/Bioconductor package msqrob2 can be used in R markdown scripts or with a GUI/shinyApp in the msqrob2gui package.
The GUI is intended as a introduction to the key concepts of proteomics data analysis for users who have no experience in R.
However, learning how to code data analyses in R markdown scripts is key for open en reproducible science and for reporting your proteomics data analyses and interpretation in a reproducible way.
More information on our tools can be found in our papers (L. J. Goeminne, Gevaert, and Clement 2016), (L. J. E. Goeminne et al. 2020) and (Sticker et al. 2020). Please refer to our work when using our tools.
Clips on the code on importing the data and preprocessing can be found in Part I Preprocessing
A clip on the code for modelling and statistical inference with msqrob2 is included below