The gene is extremely significantly DE between groups.
set.seed(65)
nPerGroup <- 8
y <- c(rpois(n=nPerGroup, lambda = 7),
rpois(n=nPerGroup, lambda = 11))
group <- rep(1:2, each = nPerGroup)
plot(group, y)
# Poisson GLM, no library size: signifcantly DE
m <- glm(y ~ factor(group),
family = "poisson")
summary(m)
##
## Call:
## glm(formula = y ~ factor(group), family = "poisson")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7195 -0.7771 0.0556 0.8105 1.1133
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.6582 0.1543 10.747 < 2e-16 ***
## factor(group)2 0.7621 0.1869 4.078 4.53e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 30.916 on 15 degrees of freedom
## Residual deviance: 13.055 on 14 degrees of freedom
## AIC: 78.587
##
## Number of Fisher Scoring iterations: 4
Since library sizes are different between groups, accounting for library size results in the gene no longer being DE at the 5% significance level. Not correcting for sequencing depth would thus result in spurious results.
# Suppose library sizes are different between groups
libSize <- c(rpois(n=nPerGroup, lambda = 1e5),
rpois(n=nPerGroup, lambda = 1.5e5))
# Poisson GLM with library size offset: no longer significantly DE on 5% level.
m <- glm(y ~ factor(group) + offset(log(libSize)),
family = "poisson")
summary(m)
##
## Call:
## glm(formula = y ~ factor(group) + offset(log(libSize)), family = "poisson")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.73231 -0.78231 0.05288 0.81088 1.11070
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.8536 0.1543 -63.859 <2e-16 ***
## factor(group)2 0.3545 0.1869 1.897 0.0578 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 16.851 on 15 degrees of freedom
## Residual deviance: 13.111 on 14 degrees of freedom
## AIC: 78.643
##
## Number of Fisher Scoring iterations: 4
You have previously worked with the dataset from Hammer et al. (2010). Explore if there’s confounding between library size and biological group. Perform an analysis with and without accounting for library size. What is the impact of accounting for sequencing depth?
In RNA-seq, we are working with count data. Count data has an inherent mean-variance structure, where the variance is positively associated with the mean. For example, assume a random variable \(Y_{i} \sim Poi(\mu)\), where \(i \in \{1, \ldots, n\}\). We can visualize the mean-variance relationship by simulating different random variables \(\mathbf{Y}_1, \mathbf{Y}_2, \mathbf{Y}_3\). Say,
\[ Y_{1i} \sim Poi(\mu_1 = 5) \] \[ Y_{2i} \sim Poi(\mu_2 = 50) \] \[ Y_{3i} \sim Poi(\mu_3 = 500) \]
Below, we enforce the same mean for all three random variables, i.e. we create the set of random variables \(\{ Y_{1i} \times 10, Y_{2i}, Y_{3i} / 10 \}\), and plot the density of each variable. It is clear that the distributions are drastically different. This is because, although they all have the same mean, their variance is different, and indeed this is because of the mean-variance relationship that was impliclty used when simulating the data.
library(ggplot2)
set.seed(50)
n <- 1e4
y1 <- rpois(n = n, lambda = 5)
y2 <- rpois(n = n, lambda = 50)
y3 <- rpois(n = n, lambda = 500)
df <- data.frame(y = c(y1 * 10, y2, y3 / 10),
gr = factor(rep(1:3, each = n)))
ggplot(df, aes(x=y)) +
geom_density() +
facet_wrap(. ~ gr) +
geom_vline(xintercept = 50, col="red") +
theme_bw()
But didn’t you say a higher mean was associated with a higher variance?
The critical reader would rightly so be somewhat confused at this point. Indeed, above we wrote that “the variance is positively associated with the mean”, but the density plots are showing that \(\mathbf{Y}_3\), which was simulated to have the highest mean, has the smallest variance. Based on the simulation, we have that \(Var(\mathbf{Y}_3) = 500\). However, \(Var(\mathbf{Y}_3 / 10) = \frac{1}{100} Var(\mathbf{Y}_3) = 5\). Thus, \(\mathbf{Y}_3\) indeed will have a small variance upon rescaling.
One could also say that the relative certainty of the mean of a Poisson random variable increases with its mean. With relative certainty, we mean the relative deviation from the mean. Indeed,
\[ \frac{\arg \max_{i} Y_{3i}}{\arg \min_{i} Y_{3i}} \le \frac{\arg \max_{i} Y_{2i}}{\arg \min_{i} Y_{2i}} \le \frac{\arg \max_{i} Y_{1i}}{\arg \min_{i} Y_{1i}}. \]
We show this below using our simulated data.
# note that logX - logY = log X/Y.
diff(log(range(y1) + 1e-10))
## [1] 25.79844
diff(log(range(y2) + 1e-10))
## [1] 1.163151
diff(log(range(y3) + 1e-10))
## [1] 0.3340655