This is part of the online course Proteomics Data Analysis 2021 (PDA21)
Note, that the R-code is included for learners who are aiming to develop R/markdown scripts to automate their quantitative proteomics data analyses. According to the target audience of the course we either work with a graphical user interface (GUI) in a R/shiny App msqrob2gui (e.g. Proteomics Bioinformatics course of the EBI and the Proteomics Data Analysis course at the Gulbenkian institute) or with R/markdowns scripts (e.g. Bioinformatics Summer School at UCLouvain or the Statistical Genomics Course at Ghent University).
library(tidyverse)
library(limma)
library(QFeatures)
library(msqrob2)
library(plotly)
library(ggplot2)
"https://raw.githubusercontent.com/statOmics/PDA21/data/quantification/francisella/peptides.txt" peptidesFile <-
grep("Intensity\\.", names(read.delim(peptidesFile))) ecols <-
readQFeatures(
pe <-table = peptidesFile,
fnames = 1,
ecol = ecols,
name = "peptideRaw", sep="\t")
colData(pe)$genotype <- pe[[1]] %>%
colnames %>%
substr(12,13) %>%
as.factor %>%
relevel("WT")
%>% colData pe
## DataFrame with 6 rows and 1 column
## genotype
## <factor>
## Intensity.1WT_20_2h_n3_3 WT
## Intensity.1WT_20_2h_n4_3 WT
## Intensity.1WT_20_2h_n5_3 WT
## Intensity.3D8_20_2h_n3_3 D8
## Intensity.3D8_20_2h_n4_3 D8
## Intensity.3D8_20_2h_n5_3 D8
rowData(pe[["peptideRaw"]])$nNonZero <- rowSums(assay(pe[["peptideRaw"]]) > 0)
NA
value rather than 0
. zeroIsNA(pe, "peptideRaw") # convert 0 to NA pe <-
logTransform(pe, base = 2, i = "peptideRaw", name = "peptideLog") pe <-
filterFeatures(pe, ~ Proteins %in% smallestUniqueGroups(rowData(pe[["peptideLog"]])$Proteins)) pe <-
filterFeatures(pe,~Reverse != "+")
pe <- filterFeatures(pe,~ Contaminant != "+") pe <-
filterFeatures(pe,~ nNonZero >=2)
pe <-nrow(pe[["peptideLog"]])
## [1] 6525
We keep 6525 peptides upon filtering.
normalize(pe,
pe <-i = "peptideLog",
name = "peptideNorm",
method = "center.median")
aggregateFeatures(pe,
pe <-i = "peptideNorm",
fcol = "Proteins",
na.rm = TRUE,
name = "protein")
## Your quantitative and row data contain missing values. Please read the
## relevant section(s) in the aggregateFeatures manual page regarding the
## effects of missing values on data aggregation.
Plot of preprocessed data
"peptideNorm"]] %>%
pe[[ assay %>%
as.data.frame() %>%
gather(sample, intensity) %>%
mutate(genotype = colData(pe)[sample,"genotype"]) %>%
ggplot(aes(x = intensity,group = sample,color = genotype)) +
geom_density() +
ggtitle("Peptide-level")
## Warning: Removed 7561 rows containing non-finite values (stat_density).
"protein"]] %>%
pe[[ assay %>%
as.data.frame() %>%
gather(sample, intensity) %>%
mutate(genotype = colData(pe)[sample,"genotype"]) %>%
ggplot(aes(x = intensity,group = sample,color = genotype)) +
geom_density() +
ggtitle("Protein-level")
## Warning: Removed 428 rows containing non-finite values (stat_density).
%>%
pe colData %>%
knitr::kable()
genotype | |
---|---|
Intensity.1WT_20_2h_n3_3 | WT |
Intensity.1WT_20_2h_n4_3 | WT |
Intensity.1WT_20_2h_n5_3 | WT |
Intensity.3D8_20_2h_n3_3 | D8 |
Intensity.3D8_20_2h_n4_3 | D8 |
Intensity.3D8_20_2h_n5_3 | D8 |
"protein"]] %>% assay() %>% head() %>% knitr::kable() pe[[
Intensity.1WT_20_2h_n3_3 | Intensity.1WT_20_2h_n4_3 | Intensity.1WT_20_2h_n5_3 | Intensity.3D8_20_2h_n3_3 | Intensity.3D8_20_2h_n4_3 | Intensity.3D8_20_2h_n5_3 | |
---|---|---|---|---|---|---|
WP_003013731 | -0.2748775 | -0.0856247 | 0.1595370 | -0.2809009 | 0.0035526 | 0.0567110 |
WP_003013860 | NA | NA | -0.2512039 | NA | NA | -0.4865646 |
WP_003013909 | -0.6851118 | -0.8161658 | -0.7557906 | -0.4591476 | -0.5449424 | -0.4962482 |
WP_003014068 | 0.6495386 | 0.8522239 | 1.1344852 | 0.5459176 | 0.9187714 | 0.5974741 |
WP_003014122 | -0.7630863 | -1.0430741 | -0.8091715 | -1.1743951 | -1.1924725 | -1.2565893 |
WP_003014123 | -0.2051672 | -0.3361704 | -0.2151930 | -0.3855747 | -0.2802011 | -0.5801771 |
log2FC=ˉyp1−ˉyp2
Tg=log2FCselog2FC
Tg=^signal^Noise
If we can assume equal variance in both treatment groups:
selog2FC=SD√1n1+1n2
003023392 <- data.frame(
WP_intensity = assay(pe[["protein"]]["WP_003023392",]) %>% c(),
genotype = colData(pe)[,1])
003023392 %>%
WP_ ggplot(aes(x=genotype,y=intensity)) +
geom_point() +
ggtitle("Protein WP_003023392")
t=log2^FCselog2^FC=−1.430.0577=−24.7
Is t = -24.7 indicating that there is an effect?
How likely is it to observe t = -24.7 when there is no effect of the argP KO on the protein expression?
With data we can never prove a hypothesis (falsification principle of Popper)
With data we can only reject a hypothesis
In general we start from alternative hypothese H1: we want to show an effect of the KO on a protein
t.test(intensity ~ genotype, data = WP_003023392, var.equal=TRUE)
##
## Two Sample t-test
##
## data: intensity by genotype
## t = 24.747, df = 4, p-value = 1.582e-05
## alternative hypothesis: true difference in means between group WT and group D8 is not equal to 0
## 95 percent confidence interval:
## 1.267666 1.588058
## sample estimates:
## mean in group WT mean in group D8
## -0.1821147 -1.6099769
How likely is it to observe an equal or more extreme effect than the one observed in the sample when the null hypothesis is true?
When we make assumptions about the distribution of our test statistic we can quantify this probability: p-value. The p-value will only be calculated correctly if the underlying assumptions hold!
When we repeat the experiment, the probability to observe a fold change for this gene that is more extreme than a 2.69 fold (log2FC=−1.43) down or up regulation by random change (if H0 is true) is 16 out of 1 000 000.
If the p-value is below a significance threshold α we reject the null hypothesis. We control the probability on a false positive result at the α-level (type I error)
Note, that the p-values are uniform under the null hypothesis, i.e. when H0 is true all p-values are equally likely.
Consider testing DA for all m=1066 proteins simultaneously
What if we assess each individual test at level α? → Probability to have a false positive (FP) among all m simultatenous test >>>α=0.05
Indeed for each non DA protein we have a probability of 5% to return a FP.
In a typical experiment the majority of the proteins are non DA.
So an upperbound of the expected FP is m×α or 1066×0.05=53.
→ Hence, we are bound to call many false positive proteins each time we run the experiment.
The family-wise error rate (FWER) addresses the multiple testing issue by no longer controlling the individual type I error for each protein, instead it controls:
FWER=P[FP≥1].
The Bonferroni method is widely used to control the type I error:
Problem, the method is very conservative!
The False Discovery Proportion (FDP) is the fraction of false positives that are returned, i.e.
FDP=FPR
Therefore, Benjamini and Hochberg, 1995, defined The False Discovery Rate (FDR) as FDR=E[FPR]=E[FDP] the expected FDP.
Consider m=1000 tests
Suppose that a researcher rejects all null hypotheses for which p<0.01.
If we use p<0.01, we expect 0.01×m0 tests to return false positives.
A conservative estimate of the number of false positives that we can expect can be obtained by considering that the null hypotheses are true for all features, m0=m=1000.
We then would expect 0.01×1000=10 false positives (FP=10).
Suppose that the researcher found 200 genes with p<0.01 (R=200).
The proportion of false positive results (FDP = false positive proportion) among the list of R=200 genes can then be estimated as ^FDP=FPR=10200=0.01×1000200=0.05.
Let p(1)≤…≤p(m) denote the ordered p-values.
Find the largest integer k so that p(k)×mk≤α or p(k)≤k×α/m
If such a k exists, reject the k null hypotheses associated with p(1),…,p(k). Otherwise none of the null hypotheses is rejected.
The adjusted p-value (also known as the q-value in FDR literature): q(i)=˜p(i)=min[minj=i,…,m(mp(j)/j),1]. In the hypothetical example above: k=200, p(k)=0.01, m=1000 and α=0.05.
function(y,group) {
ttestMx <- try(t.test(y[group],y[!group],var.equal=TRUE),silent=TRUE)
test <-if(is(test,"try-error")) {
return(c(log2FC=NA,se=NA,tstat=NA,p=NA))
else {
} return(c(log2FC= (test$estimate%*%c(1,-1)),se=test$stderr,tstat=test$statistic,pval=test$p.value))
}
}
apply(
res <-assay(pe[["protein"]]),
1,
ttestMx,group = colData(pe)$genotype=="D8") %>%
t
colnames(res) <- c("logFC","se","tstat","pval")
res %>% as.data.frame %>% na.exclude %>% arrange(pval)
res <-$adjPval <- p.adjust(res$pval, "fdr")
res 0.05
alpha <-$adjAlphaForm <- paste0(1:nrow(res)," x ",alpha,"/",nrow(res))
res$adjAlpha <- alpha * (1:nrow(res))/nrow(res)
res$"pval < adjAlpha" <- res$pval < res$adjAlpha
res$"adjPval < alpha" <- res$adjPval < alpha res
FWER: Bonferroni method:αadj=α/m=0.05/1066=5×10−5
logFC | pval | adjPval | adjAlphaForm | adjAlpha | pval < adjAlpha | adjPval < alpha | |
---|---|---|---|---|---|---|---|
WP_003038940 | -0.2876290 | 0.0000146 | 0.0084347 | 1 x 0.05/1066 | 0.0000469 | TRUE | TRUE |
WP_003023392 | -1.4278622 | 0.0000158 | 0.0084347 | 2 x 0.05/1066 | 0.0000938 | TRUE | TRUE |
WP_003039212 | -0.2658247 | 0.0000820 | 0.0291520 | 3 x 0.05/1066 | 0.0001407 | TRUE | TRUE |
WP_003026016 | -1.0800305 | 0.0001395 | 0.0346124 | 4 x 0.05/1066 | 0.0001876 | TRUE | TRUE |
WP_003039615 | -0.3992190 | 0.0001623 | 0.0346124 | 5 x 0.05/1066 | 0.0002345 | TRUE | TRUE |
WP_011733588 | -0.4323262 | 0.0002291 | 0.0407034 | 6 x 0.05/1066 | 0.0002814 | TRUE | TRUE |
WP_003014552 | -0.9843865 | 0.0003224 | 0.0440266 | 7 x 0.05/1066 | 0.0003283 | TRUE | TRUE |
WP_003040849 | -1.2780743 | 0.0003304 | 0.0440266 | 8 x 0.05/1066 | 0.0003752 | TRUE | TRUE |
WP_003038430 | -0.4331987 | 0.0004505 | 0.0489078 | 9 x 0.05/1066 | 0.0004221 | FALSE | TRUE |
WP_003033975 | -0.2949061 | 0.0005047 | 0.0489078 | 10 x 0.05/1066 | 0.0004690 | FALSE | TRUE |
WP_011733645 | 0.3531405 | 0.0005171 | 0.0489078 | 11 x 0.05/1066 | 0.0005159 | FALSE | TRUE |
WP_011733723 | -0.3935768 | 0.0005506 | 0.0489078 | 12 x 0.05/1066 | 0.0005629 | TRUE | TRUE |
WP_003038679 | -0.3909725 | 0.0007083 | 0.0580821 | 13 x 0.05/1066 | 0.0006098 | FALSE | FALSE |
WP_003033719 | -1.1865453 | 0.0008426 | 0.0603810 | 14 x 0.05/1066 | 0.0006567 | FALSE | FALSE |
… | … | … | … | … | … | … | … |
WP_003040562 | 0.0039480 | 0.9976429 | 0.9985797 | 1065 x 0.05/1066 | 0.0499531 | FALSE | FALSE |
WP_003041130 | 0.0002941 | 0.9992812 | 0.9992812 | 1066 x 0.05/1066 | 0.05 | FALSE | FALSE |
res %>%
volcanoT <- ggplot(aes(x = logFC, y = -log10(pval), color = adjPval < 0.05)) +
geom_point(cex = 2.5) +
scale_color_manual(values = alpha(c("black", "red"), 0.5)) +
theme_minimal()
volcanoT
Problems with ordinary t-test
list()
problemPlots <-1]] <- res %>%
problemPlots[[ ggplot(aes(x = logFC, y = se, color = adjPval < 0.05)) +
geom_point(cex = 2.5) +
scale_color_manual(values = alpha(c("black", "red"), 0.5)) +
theme_minimal()
for (i in 2:3)
{ colData(pe) %>%
problemPlots[[i]] <- as.data.frame %>%
mutate(intensity = pe[["protein"]][rownames(res)[i],] %>%
assay %>%
c) %>%
ggplot(aes(x=genotype,y=intensity)) +
geom_point() +
ylim(-3,0) +
ggtitle(rownames(res)[i])
}
problemPlots
## [[1]]
##
## [[2]]
##
## [[3]]
A general class of moderated test statistics is given by Tmodg=ˉYg1−ˉYg2C˜Sg, where ˜Sg is a moderated standard deviation estimate.
sapply(res$se/sqrt(1/3+1/3),function(n,mean,sd) rnorm(n,mean,sd),n=6,mean=0) %>% t
simI<- apply(
resSim <-
simI, 1,
ttestMx,group = colData(pe)$genotype=="D8") %>%
t
colnames(resSim) <- c("logFC","se","tstat","pval")
as.data.frame(resSim)
resSim <- resSim %>%
tstatSimPlot <- ggplot(aes(x=tstat)) +
geom_histogram(aes(y=..density.., fill=..count..),bins=30) +
stat_function(fun=dt,
color="red",
args=list(df=4)) +
ylim(0,.6) +
ggtitle("t-statistic")
$C <- sqrt(1/3+1/3)
resSim$sd <- resSim$se/resSim$C
resSim resSim %>%
tstatSimPerseus <- ggplot(aes(x=logFC/((sd+.1)*C))) +
geom_histogram(aes(y=..density.., fill=..count..),bins=30) +
stat_function(fun=dt,
color="red",
args=list(df=4)) +
ylim(0,.6) +
ggtitle("Perseus")
::grid.arrange(tstatSimPlot,tstatSimPerseus,nrow=1) gridExtra
Figure courtesy to Rafael Irizarry
Tmodg=ˉYg1−ˉYg2C˜Sg,
˜Sg=√dgS2g+d0S20dg+d0,
msqrob
function. By default msqrob2
estimates the model parameters using robust regression.We will model the data with a different group mean for every genotype. The group is incoded in the variable genotype
of the colData. We can specify this model by using a formula with the factor genotype
as its predictor: formula = ~genotype
.
Note, that a formula always starts with a symbol ‘~’.
msqrob(object = pe, i = "protein", formula = ~genotype) pe <-
We first explore the design of the model that we specified using the the package ExploreModelMatrix
library(ExploreModelMatrix)
VisualizeDesign(colData(pe),~genotype)$plotlist[[1]]
We have two model parameters, the (Intercept) and genotypeD8. This results in a model with two group means:
E[Y|genotype=WT]=(Intercept)
E[Y|genotype=D8]=(Intercept)+genotypeD8
The average log2FC between D8 and WT is thus log2FCD8−WT=E[Y|genotype=D8]−E[Y|genotype=WT]=genotypeD8
Hence, assessing the null hypothesis that there is no differential abundance between D8 and WT can be reformulated as
H0:genotypeD8=0 We can implement a hypothesis test for each protein in msqrob2 using the code below:
makeContrast("genotypeD8 = 0", parameterNames = c("genotypeD8"))
L <- hypothesisTest(object = pe, i = "protein", contrast = L) pe <-
We can show the list with all significant DE proteins at the 5% FDR using
rowData(pe[["protein"]])$genotypeD8 %>%
arrange(pval) %>%
filter(adjPval<0.05)
logFC <dbl> | se <dbl> | df <dbl> | t <dbl> | pval <dbl> | adjPval <dbl> | |
---|---|---|---|---|---|---|
WP_003023392 | -1.4278622 | 0.09546104 | 6.353267 | -14.957538 | 3.435824e-06 | 0.003456439 |
WP_003040849 | -1.2857733 | 0.12226820 | 6.236222 | -10.516007 | 3.372825e-05 | 0.016965307 |
WP_003014552 | -0.9843865 | 0.10770682 | 6.353267 | -9.139501 | 6.920701e-05 | 0.022593404 |
WP_003033719 | -1.1970073 | 0.13408775 | 6.220667 | -8.927044 | 8.983461e-05 | 0.022593404 |
WP_003040790 | -1.3531336 | 0.16187947 | 6.353267 | -8.358896 | 1.174333e-04 | 0.023627585 |
WP_003026016 | -1.0584448 | 0.11240892 | 5.442252 | -9.416022 | 1.429981e-04 | 0.023976009 |
WP_003033905 | -0.8815374 | 0.12199689 | 6.353267 | -7.225901 | 2.740575e-04 | 0.039385984 |
WP_003039713 | -0.7635044 | 0.10902744 | 6.353267 | -7.002865 | 3.279672e-04 | 0.041241874 |
WP_003039530 | -0.9184399 | 0.13755363 | 6.353267 | -6.676959 | 4.299645e-04 | 0.041477282 |
WP_003014581 | -0.9789058 | 0.12767782 | 5.353267 | -7.667000 | 4.409838e-04 | 0.041477282 |
We can also visualise the results using a volcanoplot
ggplot(
volcano <-rowData(pe[["protein"]])$genotypeD8,
aes(x = logFC, y = -log10(pval), color = adjPval < 0.05)
+
) geom_point(cex = 2.5) +
scale_color_manual(values = alpha(c("black", "red"), 0.5)) +
theme_minimal() +
ggtitle("msqrob2")
::grid.arrange(
gridExtra+
volcanoT xlim(-3,3) +
ggtitle("ordinary t-test"),
+
volcano xlim(-3,3)
nrow=2) ,
## Warning: Removed 109 rows containing missing values (geom_point).
The volcano plot opens up when using the EB variance estimator
Borrowing strength to estimate the variance using empirical Bayes solves the issue of returning proteins with a low fold change as significant due to a low variance.
qplot(
sapply(rowData(pe[["protein"]])$msqrobModels,getSigma),
sapply(rowData(pe[["protein"]])$msqrobModels,getSigmaPosterior)) +
xlab("SD") +
ylab("moderated SD") +
geom_abline(intercept = 0,slope = 1) +
geom_hline(yintercept = )
## Warning: Removed 109 rows containing missing values (geom_point).
rowData(pe[["protein"]])$genotypeD8 %>%
sigNames <- rownames_to_column("protein") %>%
filter(adjPval < 0.05) %>%
pull(protein)
heatmap(assay(pe[["protein"]])[sigNames, ])
for (protName in sigNames)
{ pe[protName, , c("peptideNorm", "protein")]
pePlot <- data.frame(longFormat(pePlot))
pePlotDf <-$assay <- factor(pePlotDf$assay,
pePlotDflevels = c("peptideNorm", "protein")
)$genotype <- as.factor(colData(pePlot)[pePlotDf$colname, "genotype"])
pePlotDf
# plotting
ggplot(
p1 <-data = pePlotDf,
aes(x = colname, y = value, group = rowname)
+
) geom_line() +
geom_point() +
facet_grid(~assay) +
theme(axis.text.x = element_text(angle = 70, hjust = 1, vjust = 0.5)) +
ggtitle(protName)
print(p1)
# plotting 2
ggplot(pePlotDf, aes(x = colname, y = value, fill = genotype)) +
p2 <- geom_boxplot(outlier.shape = NA) +
geom_point(
position = position_jitter(width = .1),
aes(shape = rowname)
+
) scale_shape_manual(values = 1:nrow(pePlotDf)) +
labs(title = protName, x = "sample", y = "peptide intensity (log2)") +
theme(axis.text.x = element_text(angle = 70, hjust = 1, vjust = 0.5)) +
facet_grid(~assay)
print(p2)
}
log2FC=ˉyp1−ˉyp2
Tg=log2FCselog2FC
Tg=^signal^Noise
If we can assume equal variance in both treatment groups:
selog2FC=SD√1n1+1n2
→ if number of bio-repeats increases we have a higher power!
σ2=σ2bio+σ2lab+σ2extraction+σ2run+…
Duguet et al. (2017) MCP 16(8):1416-1432. doi: 10.1074/mcp.m116.062745
↕
Our R/Bioconductor package msqrob2 can be used in R markdown scripts or with a GUI/shinyApp in the msqrob2gui package.
The GUI is intended as a introduction to the key concepts of proteomics data analysis for users who have no experience in R.
However, learning how to code data analyses in R markdown scripts is key for open en reproducible science and for reporting your proteomics data analyses and interpretation in a reproducible way.
More information on our tools can be found in our papers (Goeminne, Gevaert, and Clement 2016), (Goeminne et al. 2020) and (Sticker et al. 2020). Please refer to our work when using our tools.
Clips on the code on importing the data and preprocessing can be found in Part I Preprocessing
A clip on the code for modelling and statistical inference with msqrob2 is included below
Goeminne, L. J. E., A. Sticker, L. Martens, K. Gevaert, and L. Clement. 2020. “MSqRob Takes the Missing Hurdle: Uniting Intensity- and Count-Based Proteomics.” Anal Chem 92 (9): 6278–87.
Goeminne, L. J., K. Gevaert, and L. Clement. 2016. “Peptide-level Robust Ridge Regression Improves Estimation, Sensitivity, and Specificity in Data-dependent Quantitative Label-free Shotgun Proteomics.” Mol Cell Proteomics 15 (2): 657–68.
Sticker, A., L. Goeminne, L. Martens, and L. Clement. 2020. “Robust Summarization and Inference in Proteome-wide Label-free Quantification.” Mol Cell Proteomics 19 (7): 1209–19.