Motivating example
(Brunner et al. 2022)
Brunner et al. 2022. Mol Syst Biol. 18(3): e10798. doi:
10.15252/msb.202110798
- Single cell study on 231 HeLa cells with drug induced cell cycle
arrest for which over 2500 proteins where measured with MS-SCP.
- Authors report:
“The proteomes of the different cell cycle states grouped together in
a principal component analysis (PCA) plot”.
“Our single‐cell data set also highlighted proteins not previously
associated with the cell cycle and the G2/M transition”
Reanalysis of the
results
We could reproduce the results:
Another way to look at the same plot
What happened
Confounding between acquisition batch and cell cycle arrest!!
To consult the statistician after an experiment is finished is often
merely to ask him to conduct a post mortem examination. He can perhaps
say what the experiment died of. – Ronald Fisher
Thinking about design before your experiment is key!!!!
Stages of an
experiment
Stages
Define hypothesis
Experimental design
- Technology
- Lab protocol
- Think about sources of variation
- Choice of Design
- Experimental conditions
- Replicates
- How will we analyse the data
- How can we translate our research question in a quantity that we can
measure
- Power analysis
Conduct experiment
Data Analysis
- QC
- Preprocessing & Data exploration
- Statistical inference
Optimisation of your experiment (go back to 2)
Validation
Report your results
Always good to discuss with statistician before the experiment!
Sources of
variability
Experimental
unit
- Animal, subject, plant, culture, cage to which the treatment is
randomized
- Colony (B)
- Strain (B)
- Culture (B)
- Treatment / Condition of interest (B)
- Cage (T)
- Sex (B)
- Individual (B)
- Life style (B)
- …
Sample prep
- Organs from sacrificed animal (B)
- Single cells (B)
- Runs for Dissociation, Extraction and Digestion (T)
- Multipipet / pipeting robot (T)
- Plate (T)
- Position on plate (T)
Proteomics
acquisition
- LC column (T)
- Run (T)
- Technical repeat (T)
- Labeling (T)
- Acquisition order
- …
Observational
unit
- Unit on which the measurement is conducted
- Cell
- Cell bulk
- Animal
- If observational unit \(\neq\)
experimental unit: pseudoreplication
Avoid
confounding
- Random sampling & Randomisation
- Blocking
Random Sampling &
Randomisation
Random Sampling
Random sampling is closely related to the concept of the
population or the scope of the study.
Based on a sample of subjects, the researchers want to come to
conclusions that hold for
- all kinds of people
- only male students
Scope of the study should be well specified before the start of
the study.
Representative sample: For the statistical analysis to be valid,
it is required that the subjects are selected completely at random from
the population to which we want to generalize our conclusions.
Selecting completely at random from a population implies:
- all subjects in the population should have the same probability of
being selected in the sample,
- the selection of a subject in the sample should be independent from
the selection of the other subjects in the sample.
Randomisation
- Make sure that groups are comparable / Avoid systematic differences
between groups
- Randomisation: treatments of interest are attributed at randam to
the experimental units
Consequences of
Random sampling & Randomisation
National Health
NHanes study
- Since 1960 individuals of all ages are interviewed in their homes
every year
- The health examination component of the survey is conducted in a
mobile examination centre (MEC).
- We will use this large study to select random subjects from the
American population.
- This will help us to understand how the results of an analysis and
the conclusions vary from sample to sample.
nhanesSub <- NHANES %>%
filter(Age >= 18 & !is.na(Height)) %>%
select(c("Gender","Height"))
nhanesSub %>%
ggplot(aes(x = Height)) +
geom_histogram() +
facet_grid(Gender ~ .) +
xlab("Height (cm)")
female |
162.1 |
7.3 |
male |
175.9 |
7.5 |
- Data bell-shaped
- Allows us to summarize data with two statistics: mean and standard
deviation
Unfortunately we cannot sample entire population! We have to draw
conclusions based on a small sample.
Experiment
- We can simulate an experiment on the American population by sampling
from the NHANES study
- 5 males and 5 females above 18 years.
Note that the sample mean is different from that of the large
experiment (“population”) we sampled from.
We test for the difference between Males and females
t.test(Height ~ Gender, samp, var.equal = TRUE)
Two Sample t-test
data: Height by Gender
t = -0.82599, df = 8, p-value = 0.4327
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
-18.276478 8.636478
sample estimates:
mean in group female mean in group male
168.72 173.54
Repeat
experiment
If we do the experiment again we select other people and we obtain
different results.
Two Sample t-test
data: Height by Gender
t = -4.8876, df = 8, p-value = 0.001213
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
-29.08282 -10.43718
sample estimates:
mean in group female mean in group male
158.04 177.80
And again
Two Sample t-test
data: Height by Gender
t = 3.1182, df = 8, p-value = 0.01427
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
1.255452 8.384548
sample estimates:
mean in group female mean in group male
172.60 167.78
Summary
We drew at random different subjects in each sample
As a result, length measurements vary from sample to
sample.
So do the estimated means and standard deviations.
Consequently, our conclusions are also uncertain and may change
from sample to sample.
For the length example, samples where the effect is opposite to
that in the population and where we decide that the difference is
significant are rare.
\(\rightarrow\) With statistics, we
control for the probability of drawing wrong conclusions.
Control of Decision
Errors
We have two types of errors:
- false negatives: there is an effect but we do not pick it up
- false positives: there is no effect but we report a difference
between both groups
Control of false
negatives
- We repeat the experiment with 5 females and 5 males 10000
times.
$x
[1] ""
attr(,"class")
[1] "labels"
There are 7234 samples for which we return a true positive \(\righarrow\) The power is 72.3%.
There are 2766 samples for which we cannot report a significant
difference.
There are 0 for which we report a significant height difference
between females and males that is positive.
The sample that we have shown where we concluded that females
were larger than males was very unlikely. We had to draw 88605 samples
before we were able to find such an extreme sample.
Why do we have a considerable number of samples for which we do
not find a significant height difference between males and
females?
Larger sample
size
When we take 20 subjects in each group:
$x
[1] ""
attr(,"class")
[1] "labels"
Larger sample size:
- Larger power to pick up a real difference in the population.
- Mean is more precise
Control of false
positives
Suppose that we set up an experiment with two groups that are
both sampled from the females in the NHANES study
Both groups come from the same population: so no
difference
We again draw repeated experiments with 5 subjects in each
group.
$x
[1] ""
attr(,"class")
[1] "labels"
Only in 451 out of 10000 samples we conclude that the mean in
both groups are different or in 4.5% of the samples.
With the statistical analysis we can control the number of false
positive results correctly at the 5% significance level.
Larger sample
size
We perform the simulations again with 20 subjects in each group.
$x
[1] ""
attr(,"class")
[1] "labels"
Only in 517 out of 10000 samples we conclude that the mean in
both groups are different or in 5.2% of the samples.
So with the statistical analysis, also when taking a large
sample, we correctly control the number of false positive results at
5%.
The mean difference is again more accurately estimated
(fluctuating less around the real difference of 0).
Control treatment
Captopril study: SBP before and after dosign captopril.
Paired t-test
data: SBPa and SBPb
t = -8.1228, df = 14, p-value = 1.146e-06
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-23.93258 -13.93409
sample estimates:
mean difference
-18.93333
[1] 20.56511
[1] 20.00357
[1] 9.027471
pre-test/post-test experiment
Good control is nessecary
\(\rightarrow\) Placebo controlled
double blind experiments
\(\rightarrow\) Injection of control
animal with blank containing same solvents, etc.
Replication
Within an experiment: Enables to estimate uncertainty / biological
variability
Between experiments: Any true finding should be reproducible
At which level do we
have to replicate?
- Genuine replicates include all sources of variability: technical +
biological
- Technical replicates are important if you assess new
technologies
Francisella
tularensis study Ramond et al. (2015)
- Proteome of wild type F. tularensis (WT) vs ArgP-gene deleted F.
tularensis (knock-out, D8).
- Each bio-rep in technical triplicate on MS
- Illustration with 50S ribosomal proteïn L5 A0Q4J5
If we analyse original data with a t-test we act as if we would
have 9 genuine repeats in each condition
Effect of interest between bio-reps so no block design
Same number of technical repeats for each genuine repeat we can
first average over the techreps.
If that is not the case more complex data analysis methods have
to be used, e.g. mixed models!
Caution: never average over genuine repeats/experimental
units!!!
Correct
analysis
lmBiorep <- lm(intensityLog2 ~ -1 + biorep, franc)
francSum <- data.frame(genotype = rep(c("D8","WT"),each=3) %>% as.factor %>% relevel("WT"), intensityLog2 = lmBiorep$coef)
francSum
t.test(intensityLog2 ~ genotype, francSum, var.equal=TRUE)
Two Sample t-test
data: intensityLog2 by genotype
t = 3.662, df = 4, p-value = 0.02154
alternative hypothesis: true difference in means between group WT and group D8 is not equal to 0
95 percent confidence interval:
0.05426166 0.39452610
sample estimates:
mean in group WT mean in group D8
27.58266 27.35826
Wrong analysis
Two Sample t-test
data: intensityLog2 by genotype
t = -4.5904, df = 16, p-value = 0.0003017
alternative hypothesis: true difference in means between group D8 and group WT is not equal to 0
95 percent confidence interval:
-0.3280223 -0.1207654
sample estimates:
mean in group D8 mean in group WT
27.35826 27.58266
- Result much more significant because we erroneously act as if we
have a 9 by 9 comparison.
Simulation under
\(H_0\)
- We use data to estimate variance components: technical and
biological variance.
- We simulate 10000 experiments with similar design from a normal
distribution under the assumption that group means are the same.
- We analyse them with both designs
Probability on a false positive when using a 5% significance level
in
mean(resCorrect$pvalue < 0.05)
[1] 0.0452
mean(resWrong$pvalue < 0.05)
[1] 0.156
- We no-longer control the false positives at the \(\alpha = 5\)% level!
- We report to much false positives!
Implications of
different technologies
Slide
courtesy (Lisa Breckels)
Power analysis
- How many replicates do we need?
- Differs for different experimental designs
- Requires knowledge of total variability / magnitude of sources of
variability
- Use literature or pilot experiment
- Can be done using simulations
Mouse example
In 2021 Choa et al. published that the cytokine Thymic stromal
lymphopoietin (TSLP) induced fat loss through sebum secretion (talg).
[html] [PDF]
Suppose that you would like to set up a similar study to test if
cytokine interleukin 25 (IL) also has beneficial effect.
You plan to setup a study with a control group of high fat diet
(HFD) fed mice and a treatment group that recieves the HFD and
IL.
What sample size do you need to pick up the effect of the
treatment.
How will we analyse
the data of this experiment?
- Two groups: two sample t-test
\(H_0\): The average weight
difference is equal to zero
\(H_1\): The average weight difference
is different from zero
Power? Is design specific. For two-group comparison it depends on
- Real weight difference between the group means.
- Variability of the weight measurements
- Significance level \(\alpha\)
- Sample size in both groups
We can estimate the power if
- The assumptions of the model are met: weights are normally
distributed with same variance
and we know
- Standard deviation of the weight measurements around their average
mean for HFD-fed mice
- Real effect size in the population
- Sample sizes in each group
Use data from a
previous experiment to get insight in mice data
- Suppose that we have access to the data of a preliminary experiment
(e.g. provided by Karen Svenson via Gary Churchill and Dan Gatti and
partially funded by P50 GM070683 on PH525x)
In the experiment we have data from two diets:
- Regular diet of cerial and grain based diet (Chow)
- High Fat (hf)
We can use the hf mice as input for our power analysis.
- The data of hf mice seem to be normally distributed
- The mean weight is 26.8g
- The SD of the weight is 4.1g
Effect size?
The alternative hypothesis is complex.
It includes all possible effects!
In order to do the power analysis we will have to choose a
minimum effect size that we would like to detect.
Suppose that we would like to pick up a weight change of at least
10%.
delta <- abs(round(miceSum$mean[2] * .1, 1))
delta
[1] 2.7
Simulation based
power analysis
We can use simulations to assess the power.
- E.g. for a 3 by 3 comparison where there is a weight difference of
3g.
- We simulate 3 observations from a normal with mean 26.8g and sd 4.1g
and 3 observations from a normal with mean 29.8g and sd 4.1g.
- We perform a t-test and assess if we can conclude that there is a
significant difference in the average weight between both groups based
on this sample
- We repeat it many times and calculate the probability to find a
significant difference.
We can repeat the same procedure for many different sample sizes…
We can also repeat that for different effect sizes
