This is part of the online course Experimental Design and Data-Analysis in Label-Free Quantitative LC/MS Proteomics - A Tutorial with msqrob2 (hupo21)
\[y_{i}= \beta_0 + \beta_1 x_{i,1} + \beta x_{i,2} + ... + \epsilon_i\]
with
Example:
We model the data for a single protein using a model:
\[y_{i}= \beta_0 + \beta_{PD} x_{i,PD} + \epsilon_i\]
with \(x_{i,PD}=\begin{cases} 0& \text{good outcome}\\ 1& \text{poor outcome} \end{cases}\).
peptidesFile <- "https://raw.githubusercontent.com/statOmics/pda21/data/quantification/cancer/peptides3vs3.txt"
## CharacterList of length 1
## [["peptideRaw"]] Intensity.OR.01 Intensity.OR.04 ... Intensity.PD.04
## DataFrame with 6 rows and 1 column
## prognosis
## <factor>
## Intensity.OR.01 OR
## Intensity.OR.04 OR
## Intensity.OR.07 OR
## Intensity.PD.02 PD
## Intensity.PD.03 PD
## Intensity.PD.04 PD
NA
value rather than 0
.
In our approach a peptide can map to multiple proteins, as long as there is none of these proteins present in a smaller subgroup.
pe <- filterFeatures(pe, ~ Proteins %in% smallestUniqueGroups(rowData(pe[["peptideLog"]])$Proteins))
We now remove the contaminants, peptides that map to decoy sequences, and proteins which were only identified by peptides with modifications.
We keep peptides that were observed at last twice.
## [1] 22413
We keep 22413 peptides upon filtering.
## [[1]]
\[ \begin{array}{rclcl} E[Y|OR] &=& \beta_0\\ E[Y|PD] &=& \beta_0 + \beta_{PD}\\ \log_2 FC_{PD-OR} &=& \beta_0 + \beta_{PD} - \beta_0 &=&\beta_{PD} \end{array} \]
We want to find proteins that are differential abundant
\(\rightarrow\) use a statistical test
We typically start from the alternative hypothesis
But we can not use data to prove a hypothesis, we therefore falsify the opposite:
\[T = \frac{\hat\beta_{PD} - 0}{\text{se}_{\hat\beta_{PD}}}\]
which follows a t-distribution under \(H_0\) if the errors are \[ \epsilon_i \text{ i.i.d. } N(0,\sigma^2) \]
We do this test for all proteins (typically thousands of them)
Adjust p-values for multiple testing using the false discovery rate \[FDR = E\left[\frac{FP}{FP + TP}\right]\]
Empirical Bayes variance estimation: Note, that the massive parallel data structure also allows you to stabilize the variance estimation by borrowing information across proteins!
Note, if you want to refresh some fundamental concepts of hypothesis testing:
pe <- msqrob(object = pe, i = "protein", formula = ~prognosis)
L <- makeContrast("prognosisPD=0", parameterNames = c("prognosisPD"))
pe <- hypothesisTest(object = pe, i = "protein", contrast = L)
volcano3x3 <- ggplot(rowData(pe[["protein"]])$prognosisPD %>% na.exclude,
aes(x = logFC, y = -log10(pval), color = adjPval < 0.05)) +
geom_point(cex = 2.5) +
scale_color_manual(values = alpha(c("black", "red"), 0.5)) +
theme_minimal() +
ggtitle(paste0(sum(rowData(pe[["protein"]])$prognosisPD$adjPval<0.05,na.rm=TRUE)," proteins are found to be DA"))
Upon correction for multiple testing with using the false discovery rate (FDR) method no proteins are found to be differentially expressed.
Note, that you can refresh the concept of multiple testing and FDR [here] (https://statomics.github.io/PDA21/pda_quantification_inference.html#14_Multiple_hypothesis_testing).
peptidesFile <- "https://raw.githubusercontent.com/statOmics/pda21/data/quantification/cancer/peptides6vs6.txt"
## CharacterList of length 1
## [["peptideRaw"]] Intensity.OR.01 Intensity.OR.04 ... Intensity.PD.08
## DataFrame with 12 rows and 1 column
## prognosis
## <factor>
## Intensity.OR.01 OR
## Intensity.OR.04 OR
## Intensity.OR.07 OR
## Intensity.OR.09 OR
## Intensity.OR.10 OR
## ... ...
## Intensity.PD.03 PD
## Intensity.PD.04 PD
## Intensity.PD.06 PD
## Intensity.PD.07 PD
## Intensity.PD.08 PD
NA
value rather than 0
.
In our approach a peptide can map to multiple proteins, as long as there is none of these proteins present in a smaller subgroup.
pe <- filterFeatures(pe, ~ Proteins %in% smallestUniqueGroups(rowData(pe[["peptideLog"]])$Proteins))
We now remove the contaminants, peptides that map to decoy sequences, and proteins which were only identified by peptides with modifications.
We keep peptides that were observed at last twice.
## [1] 25452
We keep 25452 peptides upon filtering.
pe <- msqrob(object = pe, i = "protein", formula = ~prognosis)
L <- makeContrast("prognosisPD=0", parameterNames = c("prognosisPD"))
pe <- hypothesisTest(object = pe, i = "protein", contrast = L)
volcano6x6 <- ggplot(rowData(pe[["protein"]])$prognosisPD %>% na.exclude,
aes(x = logFC, y = -log10(pval), color = adjPval < 0.05)) +
geom_point(cex = 2.5) +
scale_color_manual(values = alpha(c("black", "red"), 0.5)) +
theme_minimal() +
ggtitle(paste0(sum(rowData(pe[["protein"]])$prognosisPD$adjPval<0.05,na.rm=TRUE)," proteins are found to be DA"))
peptidesFile <- "https://raw.githubusercontent.com/statOmics/pda21/data/quantification/cancer/peptides9vs9.txt"
## CharacterList of length 1
## [["peptideRaw"]] Intensity.OR.01 Intensity.OR.04 ... Intensity.PD.11
## DataFrame with 18 rows and 1 column
## prognosis
## <factor>
## Intensity.OR.01 OR
## Intensity.OR.04 OR
## Intensity.OR.07 OR
## Intensity.OR.09 OR
## Intensity.OR.10 OR
## ... ...
## Intensity.PD.07 PD
## Intensity.PD.08 PD
## Intensity.PD.09 PD
## Intensity.PD.10 PD
## Intensity.PD.11 PD
NA
value rather than 0
.
In our approach a peptide can map to multiple proteins, as long as there is none of these proteins present in a smaller subgroup.
pe <- filterFeatures(pe, ~ Proteins %in% smallestUniqueGroups(rowData(pe[["peptideLog"]])$Proteins))
We now remove the contaminants, peptides that map to decoy sequences, and proteins which were only identified by peptides with modifications.
We keep peptides that were observed at last twice.
## [1] 26696
We keep 26696 peptides upon filtering.
pe <- msqrob(object = pe, i = "protein", formula = ~prognosis)
L <- makeContrast("prognosisPD=0", parameterNames = c("prognosisPD"))
pe <- hypothesisTest(object = pe, i = "protein", contrast = L)
volcano9x9 <- ggplot(rowData(pe[["protein"]])$prognosisPD %>% na.exclude,
aes(x = logFC, y = -log10(pval), color = adjPval < 0.05)) +
geom_point(cex = 2.5) +
scale_color_manual(values = alpha(c("black", "red"), 0.5)) +
theme_minimal() +
ggtitle(paste0(sum(rowData(pe[["protein"]])$prognosisPD$adjPval<0.05,na.rm=TRUE)," proteins are found to be DA"))
We have seen that the sample size is key to recover DA proteins
Indeed, if a protein is differentially expressed, the value of T-test depends on the effect size, the variability of the protein expression values and the sample size.
\[ T_g=\frac{\log_2 \text{FC}}{\text{se}_{\log_2 \text{FC}}} \]
\[ T_g=\frac{\widehat{\text{signal}}}{\widehat{\text{Noise}}} \]
For a two group comparison the standard error on the fold change equals
\[ \text{se}_{\log_2 \text{FC}}=\text{SD}\sqrt{\frac{1}{n_1}+\frac{1}{n_2}} \]
\(\rightarrow\) if number of bio-repeats increases we have a higher power!
\[\sigma^2= \sigma^2_{bio}+\sigma^2_\text{lab} +\sigma^2_\text{extraction} + \sigma^2_\text{run} + \ldots\]
Duguet et al. (2017) MCP 16(8):1416-1432. doi: 10.1074/mcp.m116.062745
To illustrate the power of blocking we have subsetted the data of Duguet et al. in a
completely randomized design with
randomized complete block design with four mice for which we both have
library(tidyverse)
library(limma)
library(QFeatures)
library(msqrob2)
library(plotly)
library(gridExtra)
peptidesFile <- "https://raw.githubusercontent.com/statOmics/PDA21/data/quantification/mouseTcell/peptidesRCB.txt"
peptidesFile2 <- "https://raw.githubusercontent.com/statOmics/PDA21/data/quantification/mouseTcell/peptidesCRD.txt"
peptidesFile3 <- "https://raw.githubusercontent.com/statOmics/PDA21/data/quantification/mouseTcell/peptides.txt"
ecols <- grep("Intensity\\.", names(read.delim(peptidesFile)))
pe <- readQFeatures(
table = peptidesFile,
fnames = 1,
ecol = ecols,
name = "peptideRaw", sep="\t")
ecols2 <- grep("Intensity\\.", names(read.delim(peptidesFile2)))
pe2 <- readQFeatures(
table = peptidesFile2,
fnames = 1,
ecol = ecols2,
name = "peptideRaw", sep="\t")
ecols3 <- grep("Intensity\\.", names(read.delim(peptidesFile3)))
pe3 <- readQFeatures(
table = peptidesFile3,
fnames = 1,
ecol = ecols3,
name = "peptideRaw", sep="\t")
### Design
colData(pe)$celltype <- substr(
colnames(pe[["peptideRaw"]]),
11,
14) %>%
unlist %>%
as.factor
colData(pe)$mouse <- pe[[1]] %>%
colnames %>%
strsplit(split="[.]") %>%
sapply(function(x) x[3]) %>%
as.factor
colData(pe2)$celltype <- substr(
colnames(pe2[["peptideRaw"]]),
11,
14) %>%
unlist %>%
as.factor
colData(pe2)$mouse <- pe2[[1]] %>%
colnames %>%
strsplit(split="[.]") %>%
sapply(function(x) x[3]) %>%
as.factor
colData(pe3)$celltype <- substr(
colnames(pe3[["peptideRaw"]]),
11,
14) %>%
unlist %>%
as.factor
colData(pe3)$mouse <- pe3[[1]] %>%
colnames %>%
strsplit(split="[.]") %>%
sapply(function(x) x[3]) %>%
as.factor
rowData(pe[["peptideRaw"]])$nNonZero <- rowSums(assay(pe[["peptideRaw"]]) > 0)
rowData(pe2[["peptideRaw"]])$nNonZero <- rowSums(assay(pe2[["peptideRaw"]]) > 0)
rowData(pe3[["peptideRaw"]])$nNonZero <- rowSums(assay(pe3[["peptideRaw"]]) > 0)
NA
value rather than 0
.pe <- zeroIsNA(pe, "peptideRaw") # convert 0 to NA
pe2 <- zeroIsNA(pe2, "peptideRaw") # convert 0 to NA
pe3 <- zeroIsNA(pe3, "peptideRaw") # convert 0 to NA
In our approach a peptide can map to multiple proteins, as long as there is none of these proteins present in a smaller subgroup.
pe <- filterFeatures(pe, ~ Proteins %in% smallestUniqueGroups(rowData(pe[["peptideLog"]])$Proteins))
pe2 <- filterFeatures(pe2, ~ Proteins %in% smallestUniqueGroups(rowData(pe2[["peptideLog"]])$Proteins))
pe3 <- filterFeatures(pe3, ~ Proteins %in% smallestUniqueGroups(rowData(pe3[["peptideLog"]])$Proteins))
We now remove the contaminants, peptides that map to decoy sequences, and proteins which were only identified by peptides with modifications.
pe <- filterFeatures(pe,~Reverse != "+")
pe <- filterFeatures(pe,~ Potential.contaminant != "+")
pe2 <- filterFeatures(pe2,~Reverse != "+")
pe2 <- filterFeatures(pe2,~ Potential.contaminant != "+")
pe3 <- filterFeatures(pe3,~Reverse != "+")
pe3 <- filterFeatures(pe3,~ Potential.contaminant != "+")
We keep peptides that were observed at last twice.
## [1] 44449
## [1] 43401
## [1] 47431
## Your quantitative and row data contain missing values. Please read the
## relevant section(s) in the aggregateFeatures manual page regarding the
## effects of missing values on data aggregation.
## Your quantitative and row data contain missing values. Please read the
## relevant section(s) in the aggregateFeatures manual page regarding the
## effects of missing values on data aggregation.
## Your quantitative and row data contain missing values. Please read the
## relevant section(s) in the aggregateFeatures manual page regarding the
## effects of missing values on data aggregation.
levels(colData(pe3)$mouse) <- paste0("m",1:7)
mdsObj3 <- plotMDS(assay(pe3[["protein"]]), plot = FALSE)
mdsOrig <- colData(pe3) %>%
as.data.frame %>%
mutate(mds1 = mdsObj3$x,
mds2 = mdsObj3$y,
lab = paste(mouse,celltype,sep="_")) %>%
ggplot(aes(x = mds1, y = mds2, label = lab, color = celltype, group = mouse)) +
geom_text(show.legend = FALSE) +
geom_point(shape = 21) +
geom_line(color = "black", linetype = "dashed") +
xlab(
paste0(
mdsObj3$axislabel,
" ",
1,
" (",
paste0(
round(mdsObj3$var.explained[1] *100,0),
"%"
),
")"
)
) +
ylab(
paste0(
mdsObj3$axislabel,
" ",
2,
" (",
paste0(
round(mdsObj3$var.explained[2] *100,0),
"%"
),
")"
)
) +
ggtitle("Original (RCB)")
levels(colData(pe)$mouse) <- paste0("m",1:4)
mdsObj <- plotMDS(assay(pe[["protein"]]), plot = FALSE)
mdsRCB <- colData(pe) %>%
as.data.frame %>%
mutate(mds1 = mdsObj$x,
mds2 = mdsObj$y,
lab = paste(mouse,celltype,sep="_")) %>%
ggplot(aes(x = mds1, y = mds2, label = lab, color = celltype, group = mouse)) +
geom_text(show.legend = FALSE) +
geom_point(shape = 21) +
geom_line(color = "black", linetype = "dashed") +
xlab(
paste0(
mdsObj$axislabel,
" ",
1,
" (",
paste0(
round(mdsObj$var.explained[1] *100,0),
"%"
),
")"
)
) +
ylab(
paste0(
mdsObj$axislabel,
" ",
2,
" (",
paste0(
round(mdsObj$var.explained[2] *100,0),
"%"
),
")"
)
) +
ggtitle("Randomized Complete Block (RCB)")
levels(colData(pe2)$mouse) <- paste0("m",1:8)
mdsObj2 <- plotMDS(assay(pe2[["protein"]]), plot = FALSE)
mdsCRD <- colData(pe2) %>%
as.data.frame %>%
mutate(mds1 = mdsObj2$x,
mds2 = mdsObj2$y,
lab = paste(mouse,celltype,sep="_")) %>%
ggplot(aes(x = mds1, y = mds2, label = lab, color = celltype, group = mouse)) +
geom_text(show.legend = FALSE) +
geom_point(shape = 21) +
xlab(
paste0(
mdsObj$axislabel,
" ",
1,
" (",
paste0(
round(mdsObj2$var.explained[1] *100,0),
"%"
),
")"
)
) +
ylab(
paste0(
mdsObj$axislabel,
" ",
2,
" (",
paste0(
round(mdsObj2$var.explained[2] *100,0),
"%"
),
")"
)
) +
ggtitle("Completely Randomized Design (CRD)")
We observe that the leading fold change is according to mouse
In the second dimension we see a separation according to cell-type
With the Randomized Complete Block design (RCB) we can remove the mouse effect from the analysis!
We can isolate the between block variability from the analysis using linear model:
Formula in R \[ y \sim \text{celltype} + \text{mouse} \]
Formula
\[ y_i = \beta_0 + \beta_\text{Treg} x_{i,\text{Treg}} + \beta_{m2}x_{i,m2} + \beta_{m3}x_{i,m3} + \beta_{m4}x_{i,m4} + \epsilon_i \]
with
\(x_{i,Treg}=\begin{cases} 1& \text{Treg}\\ 0& \text{Tcon} \end{cases}\)
\(x_{i,m2}=\begin{cases} 1& \text{m2}\\ 0& \text{otherwise} \end{cases}\)
\(x_{i,m3}=\begin{cases} 1& \text{m3}\\ 0& \text{otherwise} \end{cases}\)
\(x_{i,m4}=\begin{cases} 1& \text{m4}\\ 0& \text{otherwise} \end{cases}\)
Possible in msqrob2 and MSstats but not possible with Perseus!
Effect size in RCB
## [[1]]
Effect size in CRD
## [[1]]
accessions <- rownames(pe[["protein"]])[rownames(pe[["protein"]])%in%rownames(pe2[["protein"]])]
dat <- data.frame(
sigmaRBC = sapply(rowData(pe[["protein"]])$msqrobModels[accessions], getSigmaPosterior),
sigmaCRD <- sapply(rowData(pe2[["protein"]])$msqrobModels[accessions], getSigmaPosterior)
)
plotRBCvsCRD <- ggplot(data = dat, aes(sigmaRBC, sigmaCRD)) +
geom_point(alpha = 0.1, shape = 20) +
scale_x_log10() +
scale_y_log10() +
geom_abline(intercept=0,slope=1)
## Warning: Removed 743 rows containing missing values (geom_point).
We clearly observe that the standard deviation of the protein expression in the RCB is smaller for the majority of the proteins than that obtained with the CRD
Why are some of the standard deviations for the RCB with the correct analysis larger than than of the RCB with the incorrect analysis that ignored the mouse blocking factor?
Can you think of a reason why it would not be useful to block on a particular factor?