
Basics of Statistical Inference and Some Experimental Design Concepts
Outline
- Francisella tularensis Example
- Hypothesis testing
- Multiple testing
- Moderated statistics
- Experimental design
Note, that this dataset has been acquired with DDA (data dependent acquisition). It was searched with MaxQuant. The analysis starts from the peptides.txt file, which is a table in the output directory of MaxQuant with the data summarized at the peptide-level.
The preprocessing steps are slightly different. The modeling concepts, however, are also applicable to DIA data.
Movie clip
1 Francisella tularensis experiment
Movie clip

- Pathogen: causes tularemia
- Metabolic adaptation key for intracellular life cycle of pathogenic microorganisms.
- Upon entry into host cells quick phasomal escape and active multiplication in cytosolic compartment.
- Franciscella is auxotroph for several amino acids, including arginine.
- Inactivation of arginine transporter delayed bacterial phagosomal escape and intracellular multiplication.
- Experiment to assess difference in proteome using 3 WT vs 3 ArgP KO mutants
1.1 Import the data in R
Click to see code
- Load libraries
library(tidyverse)
library(limma)
library(QFeatures)
library(msqrob2)
library(plotly)
library(ggplot2)- We use a peptides.txt file from MS-data quantified with maxquant that contains MS1 intensities summarized at the peptide level.
peptidesFile <- "https://raw.githubusercontent.com/statOmics/PDA22GTPB/data/quantification/francisella/peptides.txt"- Maxquant stores the intensity data for the different samples in columnns that start with Intensity. We can retreive the column names with the intensity data with the code below:
ecols <- grep("Intensity\\.", names(read.delim(peptidesFile)))- Read the data and store it in QFeatures object
qf <- readQFeatures(
assayData = read.delim(peptidesFile),
fnames = 1,
quantCols = ecols,
name = "peptideRaw")- Update data with information on design
colData(qf)$genotype <- qf[[1]] |>
colnames() |>
substr(12,13) |>
as.factor() |>
relevel("WT")
colData(qf)DataFrame with 6 rows and 1 column
genotype
<factor>
Intensity.1WT_20_2h_n3_3 WT
Intensity.1WT_20_2h_n4_3 WT
Intensity.1WT_20_2h_n5_3 WT
Intensity.3D8_20_2h_n3_3 D8
Intensity.3D8_20_2h_n4_3 D8
Intensity.3D8_20_2h_n5_3 D8
1.2 Preprocessing
Click to see code to log-transfrom the data
- Log transform
- peptides with zero intensities are missing peptides and should be represent with a
NAvalue rather than0.
qf <- zeroIsNA(qf, "peptideRaw") # convert 0 to NA- Logtransform data with base 2
qf <- logTransform(qf, base = 2, i = "peptideRaw", name = "peptides_log")- Filtering
Only use proteotypic proteins
qf <- filterFeatures(qf, ~ !grepl(pattern = ";", Proteins))- Remove reverse sequences (decoys) and contaminants. Note that this is indicated by the column names Reverse and depending on the version of maxQuant with Potential.contaminants or Contaminants.
qf <- filterFeatures(qf,~Reverse != "+")
qf <- filterFeatures(qf,~ Contaminant != "+")- We keep peptides that were observed at last 4 times out of the \(n = 6\) samples, so that we can estimate the peptide characteristics.
We tolerate the following proportion of NAs: \(\text{pNA} = \frac{(n - 4)}{n} = 0.33\), so we keep peptides that are observed in at least 66% of the samples. This is an arbitrary value that may need to be adjusted depending on the experiment and the data set.
nObs <- 4
n <- ncol(qf[["peptides_log"]])
(qf <- filterNA(qf, i = "peptides_log", pNA = (n - nObs) / n))An instance of class QFeatures (type: bulk) with 2 sets:
[1] peptideRaw: SummarizedExperiment with 10461 rows and 6 columns
[2] peptides_log: SummarizedExperiment with 5052 rows and 6 columns
We keep 5052 peptides upon filtering.
- Normalization by median centering
qf <- sweep( #4. Subtract log2 norm factor column-by-column (MARGIN = 2)
qf,
MARGIN = 2,
STATS = nfLogMedianOfRatios(qf, i="peptides_log"), #1.
i = "peptides_log",
name = "peptides_norm"
)- Summarization. We use maxLFQ from the iq package.
qf <- aggregateFeatures(qf,
i = "peptides_norm",
fcol = "Proteins",
name = "proteins",
fun = function(X) iq::maxLFQ(X)$estimate
)Plot of preprocessed data
qf[, , "peptides_norm"] |> #1.
longForm(colvars = c( "genotype")) |> #2.
data.frame() |>
filter(!is.na(value)) |>
ggplot() + #3.
aes(x = value,
colour = genotype,
group = colname) +
geom_density() +
theme_minimal() +
ggtitle("Peptide-level")
qf[, , "proteins"] |> #1.
longForm(colvars = c( "genotype")) |> #2.
data.frame() |>
filter(!is.na(value)) |>
ggplot() + #3.
aes(x = value,
colour = genotype,
group = colname) +
geom_density() +
theme_minimal() +
ggtitle("Protein-level")
1.3 Summarized data structure
1.3.1 Design
qf |>
colData() |>
knitr::kable()| genotype | |
|---|---|
| Intensity.1WT_20_2h_n3_3 | WT |
| Intensity.1WT_20_2h_n4_3 | WT |
| Intensity.1WT_20_2h_n5_3 | WT |
| Intensity.3D8_20_2h_n3_3 | D8 |
| Intensity.3D8_20_2h_n4_3 | D8 |
| Intensity.3D8_20_2h_n5_3 | D8 |
- WT vs KO
- 3 vs 3 repeats
1.3.2 Summarized intensity matrix
qf[["proteins"]] |>
assay() |>
head() |>
knitr::kable()| Intensity.1WT_20_2h_n3_3 | Intensity.1WT_20_2h_n4_3 | Intensity.1WT_20_2h_n5_3 | Intensity.3D8_20_2h_n3_3 | Intensity.3D8_20_2h_n4_3 | Intensity.3D8_20_2h_n5_3 | |
|---|---|---|---|---|---|---|
| WP_003013731 | 25.91971 | 26.03992 | 26.27133 | 25.93598 | 26.19855 | 26.22295 |
| WP_003013909 | 25.87033 | 25.65042 | 25.74652 | 26.07970 | 25.90919 | 26.03624 |
| WP_003014068 | 26.79067 | 26.76619 | 27.08102 | 26.62292 | 26.80336 | 26.77253 |
| WP_003014122 | 25.30922 | 24.96019 | 25.18033 | 24.92020 | 24.80622 | 24.86137 |
| WP_003014123 | 25.86714 | 25.66709 | 25.77431 | 25.70902 | 25.71849 | 25.53778 |
| WP_003014302 | 27.85088 | 27.93109 | 27.87884 | 28.01439 | 28.02191 | 28.05122 |
- 1020 proteins
1.3.3 Hypothesis testing: a single protein
Movie clip

T-test
\[ \log_2 \text{FC} = \bar{y}_{p1}-\bar{y}_{p2} \]
\[ T_g=\frac{\log_2 \text{FC}}{\text{se}_{\log_2 \text{FC}}} \]
\[ T_g=\frac{\widehat{\text{signal}}}{\widehat{\text{Noise}}} \]
If we can assume equal variance in both treatment groups:
\[ \text{se}_{\log_2 \text{FC}}=\text{SD}\sqrt{\frac{1}{n_1}+\frac{1}{n_2}} \]
WP_003023392 <- data.frame(
intensity = assay(qf[["proteins"]]["WP_003023392",]) |> c(),
genotype = colData(qf)[,1])
WP_003023392 |>
ggplot(aes(x=genotype,y=intensity)) +
geom_point() +
ggtitle("Protein WP_003023392") +
theme_minimal()
\[ t=\frac{\log_2\widehat{\text{FC}}}{\text{se}_{\log_2\widehat{\text{FC}}}}=\frac{-1.37}{0.103}=-13.2 \]
Is t = -13.2 indicating that there is an effect?
How likely is it to observe t = -13.2 when there is no effect of the argP KO on the protein expression?
Null hypothesis (\(H_0\)) and alternative hypothesis (\(H_1\))
With data we can never prove a hypothesis (falsification principle of Popper)
With data we can only reject a hypothesis
In general we start from alternative hypothese \(H_1\): we want to show an effect of the KO on a protein
- But, we will assess this by falsifying the opposite:
\(H_0\): On average the protein abundance in WT is equal to that in KO<-
t.test(intensity ~ genotype, data = WP_003023392, var.equal=TRUE)
Two Sample t-test
data: intensity by genotype
t = 13.247, df = 4, p-value = 0.0001877
alternative hypothesis: true difference in means between group WT and group D8 is not equal to 0
95 percent confidence interval:
1.082085 1.655946
sample estimates:
mean in group WT mean in group D8
26.10746 24.73845
How likely is it to observe an equal or more extreme effect than the one observed in the sample when the null hypothesis is true?
When we make assumptions about the distribution of our test statistic we can quantify this probability: p-value. The p-value will only be calculated correctly if the underlying assumptions hold!
When we repeat the experiment, the probability to observe a fold change for this gene that is more extreme than a 2.58 fold (\(\log_2 FC=-1.37\)) down or up regulation by random change (if \(H_0\) is true) is 188 out of 1 000 000.
If the p-value is below a significance threshold \(\alpha\) we reject the null hypothesis. We control the probability on a false positive result at the \(\alpha\)-level (type I error)
Note, that the p-values are uniform under the null hypothesis, i.e. when \(H_0\) is true all p-values are equally likely.
1.4 Multiple hypothesis testing
Movie clip
Consider testing DA for all \(m=1066\) proteins simultaneously
What if we assess each individual test at level \(\alpha\)? \(\rightarrow\) Probability to have a false positive (FP) among all m simultatenous test \(>>> \alpha= 0.05\)
Indeed for each non DA protein we have a probability of 5% to return a FP.
In a typical experiment the majority of the proteins are non DA.
So an upperbound of the expected FP is \(m \times \alpha\) or \(1066 \times 0.05=53\).
\(\rightarrow\) Hence, we are bound to call many false positive proteins each time we run the experiment.
1.4.1 Multiple testing
Family-wise error rate
Movie clip
The family-wise error rate (FWER) addresses the multiple testing issue by no longer controlling the individual type I error for each protein, instead it controls:
\[ \text{FWER} = \text{P}\left[FP \geq 1 \right] \]
The Bonferroni method is widely used to control the type I error:
assess each test at \[\alpha_\text{adj}=\frac{\alpha}{m}\]
or use adjusted p-values and compare them to \(\alpha\): \[p_\text{adj}=\text{min}\left(p \times m,1\right)\]
Problem, the method is very conservative!
False discovery rate
Movie clip
- FDR: Expected proportion of false positives on the total number of positives you return.
- An FDR of 1% means that on average we expect 1% false positive proteins in the list of proteins that are called significant.
- Defined by Benjamini and Hochberg in their seminal paper Benjamini, Y. and Hochberg, Y. (1995). “Controlling the false discovery rate: a practical and powerful approach to multiple testing”. Journal of the Royal Statistical Society Series B, 57 (1): 289–300.
The False Discovery Proportion (FDP) is the fraction of false positives that are returned, i.e.
\[ FDP = \frac{FP}{R} \]
- However, this quantity cannot be observed because in practice we only know the number of proteins for which we rejected \(H_0\), \(R\).
- But, we do not know the number of false positives, \(FP\).
Therefore, Benjamini and Hochberg, 1995, defined The False Discovery Rate (FDR) as \[ \text{FDR} = \text{E}\left[\frac{FP}{R}\right] =\text{E}\left[\text{FDP}\right] \] the expected FDP.
- Controlling the FDR allows for more discoveries (i.e. longer lists with significant results), while the fraction of false discoveries among the significant results in well controlled on average. As a consequence, more of the true positive hypotheses will be detected.
Intuition of BH-FDR procedure
Consider \(m = 1000\) tests
Suppose that a researcher rejects all null hypotheses for which \(p < 0.01\).
If we use \(p < 0.01\), we expect \(0.01 \times m_0\) tests to return false positives.
A conservative estimate of the number of false positives that we can expect can be obtained by considering that the null hypotheses are true for all features, \(m_0 = m = 1000\).
We then would expect \(0.01 \times 1000 = 10\) false positives (\(FP=10\)).
Suppose that the researcher found 200 genes with \(p<0.01\) (\(R=200\)).
The proportion of false positive results (FDP = false positive proportion) among the list of \(R=200\) genes can then be estimated as
\[ \widehat{\text{FDP}}=\frac{\widehat{FP}}{R}=\frac{10}{200}=\frac{0.01 \times 1000}{200} = 0.05. \]
Benjamini and Hochberg (1995) procedure for controlling the FDR at \(\alpha\)
Let \(p_{(1)}\leq \ldots \leq p_{(m)}\) denote the ordered \(p\)-values.
Find the largest integer \(k\) so that \[ \frac{p_{(k)} \times m}{k} \leq \alpha \] \[\text{or}\] \[ p_{(k)} \leq k \times \alpha/m \]
If such a \(k\) exists, reject the \(k\) null hypotheses associated with \(p_{(1)}, \ldots, p_{(k)}\). Otherwise none of the null hypotheses is rejected.
The adjusted \(p\)-value (also known as the \(q\)-value in FDR literature): \[ q_{(i)}=\tilde{p}_{(i)} = \min\left[\min_{j=i,\ldots, m}\left(m p_{(j)}/j\right), 1 \right]. \] In the hypothetical example above: \(k=200\), \(p_{(k)}=0.01\), \(m=1000\) and \(\alpha=0.05\).
Francisella Example
Movie clip
Click to see code
ttestMx <- function(y,group) {
test <- try(t.test(y[group],y[!group],var.equal=TRUE),silent=TRUE)
if(is(test,"try-error")) {
return(c(log2FC=NA,se=NA,tstat=NA,p=NA))
} else {
return(c(log2FC= (test$estimate%*%c(1,-1)),se=test$stderr,tstat=test$statistic,pval=test$p.value))
}
}
res <- apply(
assay(qf[["proteins"]]),
1,
ttestMx,
group = colData(qf)$genotype=="D8") |>
t()
colnames(res) <- c("logFC","se","tstat","pval")
res <- res |>
as.data.frame() |>
na.exclude() |>
arrange(pval)
res$adjPval <- p.adjust(res$pval, "fdr")
alpha <- 0.05
res$adjAlphaForm <- paste0(1:nrow(res)," x ",alpha,"/",nrow(res))
res$adjAlpha <- alpha * (1:nrow(res))/nrow(res)
res$"pval < adjAlpha" <- res$pval < res$adjAlpha
res$"adjPval < alpha" <- res$adjPval < alpha FWER: Bonferroni method:\(\alpha_\text{adj} = \alpha/m = 0.05 / 1020= 5\times 10^{-5}\)
| logFC | pval | adjPval | adjAlphaForm | adjAlpha | pval < adjAlpha | adjPval < alpha | |
|---|---|---|---|---|---|---|---|
| WP_003034781 | 0.4224142 | 0.0000022 | 0.0022721 | 1 x 0.05/1020 | 0.0000490 | TRUE | TRUE |
| WP_003040849 | -1.2263612 | 0.0000076 | 0.0026390 | 2 x 0.05/1020 | 0.0000980 | TRUE | TRUE |
| WP_011733723 | -0.3258931 | 0.0000078 | 0.0026390 | 3 x 0.05/1020 | 0.0001471 | TRUE | TRUE |
| WP_003041244 | 0.2057070 | 0.0000791 | 0.0201736 | 4 x 0.05/1020 | 0.0001961 | TRUE | TRUE |
| WP_003038430 | -0.3836478 | 0.0001838 | 0.0280097 | 5 x 0.05/1020 | 0.0002451 | TRUE | TRUE |
| WP_003023392 | -1.3690155 | 0.0001877 | 0.0280097 | 6 x 0.05/1020 | 0.0002941 | TRUE | TRUE |
| WP_011733588 | -0.3823744 | 0.0001922 | 0.0280097 | 7 x 0.05/1020 | 0.0003431 | TRUE | TRUE |
| WP_011733645 | 0.3040827 | 0.0002362 | 0.0301195 | 8 x 0.05/1020 | 0.0003922 | TRUE | TRUE |
| WP_004339068 | 0.1975424 | 0.0006535 | 0.0730468 | 9 x 0.05/1020 | 0.0004412 | FALSE | FALSE |
| WP_003018840 | -0.5013825 | 0.0007161 | 0.0730468 | 10 x 0.05/1020 | 0.0004902 | FALSE | FALSE |
...
| WP_003022463 | -0.0003682 | 0.9989893 | 0.9999324 | 1019 x 0.05/1020 | 0.049951 | FALSE | FALSE |
| WP_003033335 | -0.0000299 | 0.9999324 | 0.9999324 | 1020 x 0.05/1020 | 0.050000 | FALSE | FALSE |
Results
Click to see code
volcanoT <- plotVolcano(res) volcanoT
1.5 Moderated Statistics
Movie clip
Problems with ordinary t-test
Click to see code
problemPlots <- list()
problemPlots[[1]] <- res |>
ggplot(aes(x = logFC, y = se, color = adjPval < 0.05)) +
geom_point(cex = 2.5) +
scale_color_manual(values = alpha(c("black", "red"), 0.5)) +
theme_minimal()
for (i in 2:3)
{
problemPlots[[i]] <- colData(qf) |>
as.data.frame() |>
mutate(intensity = qf[["proteins"]][rownames(res)[i],] |>
assay() |>
c()) |>
ggplot(aes(x=genotype,y=intensity)) +
geom_point() +
ggtitle(rownames(res)[i]) +
ylim(c(24,27)) +
theme_minimal()
}problemPlots[[1]]

[[2]]

[[3]]

A general class of moderated test statistics is given by \[ T_g^{mod} = \frac{\bar{Y}_{g1} - \bar{Y}_{g2}}{C \quad \tilde{S}_g} , \] where \(\tilde{S}_g\) is a moderated standard deviation estimate.
- \(C\) is a constant depending on the design e.g. \(\sqrt{1/{n_1}+1/n_2}\) for a t-test and of another form for linear models.
- \(\tilde{S}_g=S_g+S_0\): add small positive constant to denominator of t-statistic.
- This can be adopted in perseus.
Click to see code
simI<-sapply(res$se/sqrt(1/3+1/3),function(n,mean,sd) rnorm(n,mean,sd),n=6,mean=0) |> t()
resSim <- apply(
simI,
1,
ttestMx,
group = colData(qf)$genotype=="D8") |>
t()
colnames(resSim) <- c("logFC","se","tstat","pval")
resSim <- as.data.frame(resSim)
tstatSimPlot <- resSim |>
ggplot(aes(x=tstat)) +
geom_histogram(aes(y=..density.., fill=..count..),bins=30) +
stat_function(fun=dt,
color="red",
args=list(df=4)) +
ylim(0,.6) +
ggtitle("t-statistic") +
theme_minimal()
resSim$C <- sqrt(1/3+1/3)
resSim$sd <- resSim$se/resSim$C
tstatSimperseus <- resSim |>
ggplot(aes(x=logFC/((sd+.1)*C))) +
geom_histogram(aes(y=..density.., fill=..count..),bins=30) +
stat_function(fun=dt,
color="red",
args=list(df=4)) +
ylim(0,.6) +
ggtitle("perseus") +
theme_minimal()
- The choice of \(S_0\) in perseus is ad hoc and the t-statistic is no-longer t-distributed.
- permutation test, but is difficult for more complex designs.
- Allows for Data Dredging because user can choose \(S_0\)
1.5.1 Empirical Bayes

Figure courtesy to Rafael Irizarry
\[ T_g^{mod} = \frac{\bar{Y}_{g1} - \bar{Y}_{g2}}{C \quad \tilde{S}_g} , \]
- empirical Bayes theory provides formal framework for borrowing strength across proteins,
- Implemented in popular bioconductor package limma and msqrob2
\[ \tilde{S}_g=\sqrt{\frac{d_gS_g^2+d_0S_0^2}{d_g+d_0}}, \]
- \(S_0^2\): common variance (over all proteins)
- Moderated t-statistic is t-distributed with \(d_0+d_g\) degrees of freedom.
- Note that the degrees of freedom increase by borrowing strength across proteins!
Click to see the code
- We model the protein level expression values using the
msqrobfunction. By defaultmsqrob2estimates the model parameters using robust regression.
We will model the data with a different group mean for every genotype. The group is incoded in the variable genotype of the colData. We can specify this model by using a formula with the factor genotype as its predictor: formula = ~genotype.
Note, that a formula always starts with a symbol ‘~’.
qf <- msqrob(object = qf, i = "proteins", formula = ~genotype)- Inference
We first explore the design of the model that we specified using the the package ExploreModelMatrix
library(ExploreModelMatrix)
VisualizeDesign(colData(qf),~genotype)$plotlist[[1]]
We have two model parameters, the (Intercept) and genotypeD8. This results in a model with two group means:
- For the wild type (WT) the expected value (mean) of the log2 transformed intensity y for a protein will be modelled using
\[\text{E}[Y\vert \text{genotype}=\text{WT}] = \text{(Intercept)}\]
- For the knockout genotype D8 the expected value (mean) of the log2 transformed intensity y for a protein will be modelled using
\[\text{E}[Y\vert \text{genotype}=\text{D8}] = \text{(Intercept)} + \text{genotypeD8}\]
The average log2FC between D8 and WT is thus \[\log_2\text{FC}_{D8-WT}= \text{E}[Y\vert \text{genotype}=\text{D8}] - \text{E}[Y\vert \text{genotype}=\text{WT}] = \text{genotypeD8} \]
Hence, assessing the null hypothesis that there is no differential abundance between D8 and WT can be reformulated as
\[H_0: \text{genotypeD8}=0\] We can implement a hypothesis test for each protein in msqrob2 using the code below:
L <- makeContrast("genotypeD8 = 0", parameterNames = c("genotypeD8"))
qf <- hypothesisTest(object = qf, i = "proteins", contrast = L)We can show the list with all significant DE proteins at the 5% FDR using
msqrobCollect(qf[["proteins"]],L) |>
arrange(pval) |>
filter(adjPval<0.05) logFC se df t pval
genotypeD8.WP_003040849 -1.2263612 0.08020151 6.087933 -15.290999 4.358069e-06
genotypeD8.WP_003023392 -1.3690155 0.11105902 6.087933 -12.326919 1.560512e-05
genotypeD8.WP_003033719 -1.1615474 0.12602389 5.828253 -9.216882 1.087304e-04
genotypeD8.WP_003026016 -0.9540456 0.11530658 6.080135 -8.273991 1.573910e-04
genotypeD8.WP_003033046 -0.8219170 0.10214460 5.815009 -8.046603 2.307079e-04
genotypeD8.WP_003038350 0.8077325 0.10952806 6.087933 7.374663 2.978545e-04
adjPval contrast feature
genotypeD8.WP_003040849 0.004301414 genotypeD8 WP_003040849
genotypeD8.WP_003023392 0.007701128 genotypeD8 WP_003023392
genotypeD8.WP_003033719 0.035772317 genotypeD8 WP_003033719
genotypeD8.WP_003026016 0.038836240 genotypeD8 WP_003026016
genotypeD8.WP_003033046 0.045541744 genotypeD8 WP_003033046
genotypeD8.WP_003038350 0.048997070 genotypeD8 WP_003038350
We can also visualise the results using a volcanoplot
volcano <- plotVolcano(
rowData(qf[["proteins"]])$genotypeD8) +
ggtitle("msqrob2")
The volcano plot opens up when using the EB variance estimator
Borrowing strength to estimate the variance using empirical Bayes solves the issue of returning proteins with a low fold change as significant due to a low variance.
1.5.2 Shrinkage of the variance and moderated t-statistics

- Small variances are shrunken towards the common variance resulting in large EB variance estimates
- Large variances are shrunken towards the common variance resulting in smaller EB variance estimates
- Pooled degrees of freedom of the EB variance estimator are larger because information is borrowed across proteins to estimate the variance
1.6 Plots













2 Experimental Design
2.1 Sample size
Movie clip
\[ \log_2 \text{FC} = \bar{y}_{p1}-\bar{y}_{p2} \]
\[ T_g=\frac{\log_2 \text{FC}}{\text{se}_{\log_2 \text{FC}}} \]
\[ T_g=\frac{\widehat{\text{signal}}}{\widehat{\text{Noise}}} \]
If we can assume equal variance in both treatment groups:
\[ \text{se}_{\log_2 \text{FC}}=\text{SD}\sqrt{\frac{1}{n_1}+\frac{1}{n_2}} \]
\(\rightarrow\) if number of bio-repeats increases we have a higher power!
- cfr. Study of tamoxifen treated Estrogen Recepter (ER) positive breast cancer patients
2.2 Randomized complete block designs
\[\sigma^2= \sigma^2_{bio}+\sigma^2_\text{lab} +\sigma^2_\text{extraction} + \sigma^2_\text{run} + \ldots\]
- Biological: fluctuations in protein level between mice, fluctations in protein level between cells, …
- Technical: cage effect, lab effect, week effect, plasma extraction, MS-run, …
2.3 Nature methods: Points of significance - Blocking
https://www.nature.com/articles/nmeth.3005.pdf
2.3.1 Mouse example

Duguet et al. (2017) MCP 16(8):1416-1432. doi: 10.1074/mcp.m116.062745
- All treatments of interest are present within block!
- We can estimate the effect of the treatment within block!
- We can isolate the between block variability from the analysis using linear model: \[ y \sim \text{type} + \text{mouse} \]
- Not possible with perseus!
2.3.2 Assess the impact of blocking in the tutorial session!
- Completely randomized design with only one cell type per mouse (Treg and Tconv)
\[\updownarrow\]
- Randomized complete block design assessing Treg and Tconv in each mouse