Introduction
We here make use of the publication of Anna Cuomo et al. (last author Oliver Stegle), which we will refer to as the iPSC dataset
. The paper that describes this dataset can be found using this link.
In the experiment, the authors harvested induced pluripotent stem cells (iPSCs) from 125 healthy human donors. These cells were used to study the endoderm differentiation process. In this process, iPSCs differentiate to endoderm cells, a process which takes approximately three days. As such, the authors cultered the iPSCs cell lines and allowed for differentiation for three days. During the experiment, cells were harvested at four different time points: day0 (directly at to incubation), day1, day2 and day3. Knowing the process of endoderm differentiation, these time points should correspond with different cell types: day0 are (undifferentiated) iPSCs, day1 are mesendoderm cells, day2 are “intermediate” cells and day3 are fully differentiated endoderm cells.
This dataset was generated using the SMART-Seq2 scRNA-seq protocol.
The final goal of the experiment was to characterize population variation in the process of endoderm differentiation.
Download data
For this lab session, we will work with a subset of the data, i.e., the data for the first (alphabetically) 15 patients in the experiment. These can be downloaded through the belnet filesender link provided through email, https://filesender.belnet.be/?s=download&token=eb8136df-67d3-4869-b2a9-f65767054e81.
The data original (125 patient) could be downloaded from Zenodo. At the bottom of this web-page, we can download the files raw_counts.csv.zip
and cell_metadata_cols.tsv
and store these files locally. We do not recommend doing this during the lab session, to avoid overloading the system.
Import data
First we read in the count matrix:
library(SingleCellExperiment)
sce <- readRDS("/Users/jg/Desktop/sce_15_cuomo.rds") # change to YOUR path!
Obtaining and including rowData
The rowData
slot of a SingleCellExperiment
object allows for storing information on the features, i.e. the genes, in a dataset. In our object, the rowData
slot currently contains the following:
To improve our gene-level information, we may:
Split V1
into two columns, one with the ENSEMBL ID and the other with the gene symbol.
Display which chromosome the gene is located
Many more options are possible, but are not necessary for us right now.
rowData(sce) <- data.frame(Ensembl = gsub("_.*", "", rowData(sce)$V1),
Symbol = gsub("^[^_]*_", "", rowData(sce)$V1))
head(rowData(sce))
library("biomaRt")
ensembl75 <- useEnsembl(biomart = 'genes',
dataset = 'hsapiens_gene_ensembl',
version = 75)
GeneInfo <- getBM(attributes = c("ensembl_gene_id", # To match with rownames SCE
"chromosome_name"), # Info on chromose
mart = ensembl75)
GeneInfo <- GeneInfo[match(rowData(sce)$Ensembl, GeneInfo$ensembl_gene_id),]
rowData(sce) <- cbind(rowData(sce), GeneInfo)
head(rowData(sce))
all(rowData(sce)$Ensembl == rowData(sce)$ensembl_gene_id)
# identical, as desired, so we could optionally remove one of the two
Quality control
Calculate QC variables
library(scater)
# check ERCC spike-in transcripts
sum(grepl("^ERCC-", rowData(sce)$Symbol)) # no spike-in transcripts available
is.mito <- grepl("^MT", rowData(sce)$chromosome_name)
sum(is.mito) # 13 mitochondrial genes
df <- perCellQCMetrics(sce, subsets=list(Mito=is.mito))
head(df)
## add the QC variables to sce object
colData(sce) <- cbind(colData(sce), df)
Exploratory data analysis
In the figure below, we see that several cells have a very low number of expressed genes, and where most of the molecules are derived from mitochondrial genes. This indicates likely damaged cells, presumably because of loss of cytoplasmic RNA from perforated cells, so we should remove these for the downstream analysis.
# Number of genes vs library size
plotColData(sce, x = "sum", y="detected", colour_by="day")
# Mitochondrial genes
plotColData(sce, x = "detected", y="subsets_Mito_percent", colour_by="day")
QC using adaptive thresholds
Below, we remove cells that are outlying with respect to
- A low sequencing depth (number of UMIs);
- A low number of genes detected;
- A high percentage of reads from mitochondrial genes.
We remove a total of \(301\) cells, mainly due to low sequencing depth and low number of genes detected.
lowLib <- isOutlier(df$sum, type="lower", log=TRUE)
lowFeatures <- isOutlier(df$detected, type="lower", log=TRUE)
highMito <- isOutlier(df$subsets_Mito_percent, type="higher")
table(lowLib)
table(lowFeatures)
table(highMito)
discardCells <- (lowLib | lowFeatures | highMito)
table(discardCells)
colData(sce)$discardCells <- discardCells
# visualize cells to be removed
plotColData(sce, x = "detected", y="subsets_Mito_percent", colour_by = "discardCells")
plotColData(sce, x = "sum", y="detected", colour_by="discardCells")
# visualize cells to be removed
plotColData(sce, x = "detected", y="subsets_Mito_percent", colour_by = "donor")
plotColData(sce, x = "sum", y="detected", colour_by="donor")
# visualize cells to be removed
plotColData(sce, x = "detected", y="subsets_Mito_percent", colour_by = "experiment")
plotColData(sce, x = "sum", y="detected", colour_by="experiment")
table(sce$donor, sce$discardCells)
table(sce$donor, sce$discardCells)/rowSums(table(sce$donor, sce$discardCells))
#fractions of removed cells per donor
Most removed cells (fraction) are from patients dixh
and babz
.
table(sce$experiment, sce$discardCells)
table(sce$experiment, sce$donor)
Most removed cells (fraction) are from patients dixh
and babz
. Most low library sizes seem to come from patient dixh
; for patient babz
the effect is less pronounced.
plotColData(sce[,sce$donor=="dixh"], x = "sum", y="detected")
plotColData(sce[,sce$donor=="babz"], x = "sum", y="detected")
As such, we are mainly removing cells from specific patients and the respective batches in which they were sequenced. However, we want to be careful; we only want to remove technical artefacts, while retaining as much of the biology as possible. In our exploratory figure, we see that the cells we are removing based on the number of genes detected, are quite far apart from the bulk of the data cloud; as such, these cells are indeed suspicious. For the criterion of library size, we see that the cells removed there are still strongly connected to the data cloud. As such, we may want to relax the filtering criterion there a little bit. When we think of how the adaptive threshold strategy works, we may want to remove cells that are 4MADs away from the center, rather than the default 3 MADs.
# previously
lowLib <- isOutlier(df$sum, type="lower", log=TRUE)
table(lowLib)
# after seeing appropriate exploratory figure
lowLib <- isOutlier(df$sum, nmads=4, type="lower", log=TRUE)
table(lowLib)
discardCells <- (lowLib | lowFeatures | highMito)
table(discardCells)
colData(sce)$discardCells <- discardCells
Note that these steps are not exact; different analysts will come with different filtering criteria for many of the steps. The key ideas are that we let appropriate exploratory figures guide us to make reasonable choices; i.e., we look at the data rather than blindly following a standardized pipeline that may work well in many cases, but maybe not our particular dataset.
# remove cells identified using adaptive thresholds
sce <- sce[, !colData(sce)$discardCells]
Identifying and removing empty droplets
This does not make much sense for plate-based data! While we could imagine the presence of empty plate wells for the SMART-Seq2 experiment, these are typically detected and removed upstream of our analysis using laser technology.
Identifying and removing doublets
Again, this step typically does not make much sense for plate-based data. While we could imagine the presence of plate wells in which more than two cells are present for the SMART-Seq2 experiment, these are again typically detected and removed upstream of our analysis using laser technology.
If we would use scDblFinder to detect doublet cells on this dataset, we would see that only a very small fraction of the cells would be removed. These removed cells could correspond to actual doublets that were missed by the upstream doublet detection technology, or more likely could correspond to false positive doublets of the scDblFinder
doublet detection procedure.
Normalization
For normalization, the size factors \(s_i\) computed here are simply scaled library sizes:
\[ N_i = \sum_g Y_{gi} \] \[ s_i = N_i / \bar{N}_i \]
sce <- logNormCounts(sce)
# note we also returned log counts: see the additional logcounts assay.
sce
# you can extract size factors using
sf <- librarySizeFactors(sce)
mean(sf) # equal to 1 due to scaling.
plot(x= log(colSums(assays(sce)$counts)),
y=sf)
— end lab session 1 —
Visualize the effect of normalization using MD-plots.
In order to only work with cells that could be considered technical replicates, we should select cells coming from the same biological group. We can do this approximately by selecting cells from the same sampling day, the same donor and the same sequencing experiment.
table(sce$donor, sce$day)
table(sce$experiment, sce$day)
table(sce$experiment, sce$donor)
Below, we will visualize the normalization occurring between two cells of the same cell type - donor/experiment combination (which could be considered technical repeats):
cs <- colSums(assays(sce)$counts[,which(sce$day == "day2" &
sce$donor == "eipl")])
cs[order(cs, decreasing = TRUE)][c(1,10,100)]
Let’s take a look at how comparable two cells (replicates) of this biological group are. We will compare the cell with the highest library size with the cell that has the 10th and 100th highest library size using MD-plots (mean-difference plots, as introduced by Dudoit et al. (2002)), also sometimes referred to as MA-plots.
targetCells <- names(cs[order(cs, decreasing = TRUE)][c(1,10)])
M <- rowMeans(assays(sce)$counts[,targetCells])
D <- assays(sce)$counts[,targetCells[2]] / assays(sce)$counts[,targetCells[1]]
plot(x = log(M), y = log2(D),
pch = 16, cex=1/3,
main = paste0("Cell ", targetCells[2], " vs cell ", targetCells[1]),
xlab = "Log mean", ylab = "Log2 fold-change",
bty = 'l')
abline(h = 0, col="orange", lwd=2)
targetCells <- names(cs[order(cs, decreasing = TRUE)][c(1,100)])
M <- rowMeans(assays(sce)$counts[,targetCells])
D <- assays(sce)$counts[,targetCells[2]] / assays(sce)$counts[,targetCells[1]]
plot(x = log(M), y = log2(D),
pch = 16, cex=1/3,
main = paste0("Cell ", targetCells[2], " vs cell ", targetCells[1]),
xlab = "Log mean", ylab = "Log2 fold-change",
bty = 'l')
abline(h = 0, col="orange", lwd=2)
We see clear bias in the comparison of the 1st and 10th/100th most deeply sequenced cell from this day/patient combination. We see that the log f old-changes are biased downwards. This means that, on average, a gene is lower expressed in cell 1 versus cell 10 (and cell 100). Looking at the library sizes, we can indeed see that the library size for cell 1 is 1.200.691 counts, while it is only 701.550 counts for cell 10 and 468.762 counts for cell 100! This is a clear library size effect that we should take into account.
# normalize the count data using the previously computed "size factors"
assay(sce, "normed") <- normalizeCounts(sce,
log=FALSE,
size.factors=sf,
pseudo.count=0)
targetCells <- names(cs[order(cs, decreasing = TRUE)][c(1,10)])
M <- rowMeans(assays(sce)$normed[,targetCells])
D <- assays(sce)$normed[,targetCells[2]] / assays(sce)$normed[,targetCells[1]]
plot(x = log(M), y = log2(D),
pch = 16, cex=1/3,
main = paste0("Cell ", targetCells[2], " vs cell ", targetCells[1]),
xlab = "Log mean", ylab = "Log2 fold-change",
bty = 'l')
abline(h = 0, col="orange", lwd=2)
targetCells <- names(cs[order(cs, decreasing = TRUE)][c(1,100)])
M <- rowMeans(assays(sce)$normed[,targetCells])
D <- assays(sce)$normed[,targetCells[2]] / assays(sce)$normed[,targetCells[1]]
plot(x = log(M), y = log2(D),
pch = 16, cex=1/3,
main = paste0("Cell ", targetCells[2], " vs cell ", targetCells[1]),
xlab = "Log mean", ylab = "Log2 fold-change",
bty = 'l')
abline(h = 0, col="orange", lwd=2)
Feature selection
Highly variable genes
library(scran)
rownames(sce) <- rowData(sce)$Ensembl
dec <- modelGeneVar(sce)
head(dec)
fit <- metadata(dec)
plot(fit$mean, fit$var,
xlab="Mean of log-expression",
ylab="Variance of log-expression")
curve(fit$trend(x), col="dodgerblue", add=TRUE, lwd=2)
# get 10% highly variable genes
hvg <- getTopHVGs(dec,
prop=0.1)
head(hvg)
# plot these
plot(fit$mean, fit$var,
col = c("orange", "darkseagreen3")[(names(fit$mean) %in% hvg)+1],
xlab="Mean of log-expression",
ylab="Variance of log-expression")
curve(fit$trend(x), col="dodgerblue", add=TRUE, lwd=2)
legend("topleft",
legend = c("Selected", "Not selected"),
col = c("darkseagreen3", "orange"),
pch = 16,
bty='n')
High deviance genes
#BiocManager::install("scry")
library(scry)
sce <- devianceFeatureSelection(object = sce,
assay = "counts",
sorted = FALSE)
plot(sort(rowData(sce)$binomial_deviance, decreasing = TRUE),
type="l",
xlab="ranked genes",
ylab="binomial deviance",
main="Feature Selection with Deviance")
abline(v=2000, lty=2, col="red")
Seurat VST
library(Seurat)
rownames(sce) <- rowData(sce)$Ensembl
seurat_obj <- as.Seurat(sce)
seurat_obj # notice the "0 variable features"
seurat_obj <- Seurat::NormalizeData(seurat_obj,
normalization.method = "LogNormalize",
scale.factor = 10000)
seurat_obj <- FindVariableFeatures(object = seurat_obj,
selection.method = "vst")
seurat_obj # notice the "2000 variable features" (default)
head(VariableFeatures(seurat_obj)) # here they are
rm(seurat_obj)
Dimensionality reduction
The most basic DR
colData(sce)$day <- as.factor(colData(sce)$day)
day <- colData(sce)$day
par(bty='l')
plot(x = assays(sce)$counts[hvg[1],],
y = assays(sce)$counts[hvg[2],],
col = as.numeric(day),
pch = 16, cex = 1/3,
xlab = "Most informative gene",
ylab = "Second most informative gene",
main = "Cells colored acc to cell type")
Just by looking at the top two genes based on our feature selection criterion, we can already see some separation according to the cell type; the cells colored red seem to have low expression values for both genes, and are those solely retireved in the bottom left of the x-y plane.
Linear dimensionality reduction: PCA
PCA with feature selection
set.seed(1234)
sce <- runPCA(sce,
ncomponents=30,
subset_row=hvg)
PCA has been performed. The PCA information has been automatically stored in the reducedDim slot of the SingleCellExperiment object.
head(reducedDim(sce,
type="PCA"))
The plotPCA
function of the scater
package now allows us to visualize the cells in PCA space, based on the PCA information stored in our object:
plotPCA(sce,
colour_by = "day")
We observe a clear-cut clustering of the cells!
percent.var <- attr(reducedDim(sce), "percentVar")
plot(percent.var, log="y", xlab="PC", ylab="Variance explained (%)")
Here, retaining ±15PCs seems reasonable. If you really prefer a more data-driven way for determining this,
library(PCAtools)
chosen.elbow <- findElbowPoint(percent.var)
chosen.elbow
PCA without feature selection
set.seed(1234)
sceNoFS <- runPCA(sce,
ncomponents=30,
subset_row=1:nrow(sce))
plotPCA(sceNoFS, colour_by = "day")
rm(sceNoFS)
While we use more information to make this PCA plot (10.374 genes) as compared to the feature selected PCA plot (506 genes), we seem to retrieve less structure in the data. Indeed, we are here not able to distinguish between day2 and day3 cells. This is the power of feature selection, an increase in the signal-to-noise ratio!
A generalization of PCA for exponential family distributions.
library(glmpca)
set.seed(211103)
poipca <- glmpca(Y = assays(sce)$counts[hvg,],
L = 2,
fam = "poi",
minibatch = "stochastic")
reducedDim(sce, "PoiPCA") <- poipca$factors
plotReducedDim(sce,
dimred="PoiPCA",
colour_by = "day")
Non-linear dimensionality reduction: T-SNE
set.seed(564654)
sce <- runTSNE(sce,
dimred = 'PCA',
external_neighbors=TRUE)
plotTSNE(sce,
colour_by = "day")
plotTSNE(sce,
colour_by = "donor")
plotTSNE(sce,
colour_by = "experiment")
Non-linear dimensionality reduction: UMAP
set.seed(65187)
sce <- runUMAP(sce,
dimred = 'PCA',
external_neighbors = TRUE)
plotUMAP(sce,
colour_by = "day")
plotUMAP(sce,
colour_by = "donor")
plotUMAP(sce,
colour_by = "experiment")
---
title: 'Normalization, feature selection and dimension reduction for the Cuomo dataset'
author: "Koen Van den Berge and Jeroen Gilis"
date: "30/11/2021"
output: 
  html_document:
    code_download: true
    toc: true
    toc_float: true
---

# Introduction

We here make use of the publication of Anna Cuomo et al.
(last author Oliver Stegle), which we will refer to as the `iPSC dataset`. The 
paper that describes this dataset can be found using this 
[link](https://www.nature.com/articles/s41467-020-14457-z).

In the experiment, the authors harvested induced pluripotent stem cells (iPSCs)
from 125 healthy human donors. These cells were used to study the endoderm 
differentiation process. In this process, iPSCs differentiate to endoderm cells,
a process which takes approximately three days. As such, the authors 
cultered the iPSCs cell lines and allowed for differentiation for three days. 
During the experiment, cells were harvested at four different time points: 
day0 (directly at to incubation), day1, day2 and day3. Knowing the process of 
endoderm differentiation, these time points should correspond with different 
cell types: day0 are (undifferentiated) iPSCs, day1 are mesendoderm cells, day2
are "intermediate" cells and day3 are fully differentiated endoderm cells.

This dataset was generated using the **SMART-Seq2** scRNA-seq protocol.

The final goal of the experiment was to characterize population variation in the
process of endoderm differentiation.

# Download data

For this lab session, we will work with a subset of the data, i.e., the data
for the first (alphabetically) 15 patients in the experiment. These can be
downloaded through the *belnet filesender* link provided through email,
https://filesender.belnet.be/?s=download&token=eb8136df-67d3-4869-b2a9-f65767054e81.

The data original (125 patient) could be downloaded from 
[Zenodo](https://zenodo.org/record/3625024#.YWfahtlBxB1). At the bottom of this
web-page, we can download the files `raw_counts.csv.zip` and 
`cell_metadata_cols.tsv` and store these files locally. We do not recommend 
doing this during the lab session, to avoid overloading the system.

# Import data

First we read in the count matrix:

```{r, eval=FALSE, message=FALSE, warning=FALSE}
library(SingleCellExperiment)
sce <- readRDS("/Users/jg/Desktop/sce_15_cuomo.rds") # change to YOUR path!
```

```{r, eval=FALSE}
sce
```

# Explore metadata

Exploration of the metadata is essential to get a better idea of what the
experiment was about and how it was organized.

```{r, eval=FALSE}
colData(sce)[1:5,1:10]
colnames(colData(sce))
```

As stated in the paper, cells were sampled on 4 time points. Each of these 
time points is expected to correspond with different cell types (day0 = iPSC,
day1 = mesendoderm, day2 = intermediate and day3 = endoderm).

```{r, eval=FALSE}
table(colData(sce)$day)
```

As stated in the paper, cells were harvested from 125 patients. Here, we are
working on a subset with 15 patients. The number of cells harvested per patient 
(over all time points) ranges from 31 to 637.

```{r, eval=FALSE}
length(table(colData(sce)$donor)) # number of donors
range(table(colData(sce)$donor)) # cells per donor
```

Below, we look how many cells are harvest per patent and per time point.

```{r, eval=FALSE}
table(colData(sce)$donor,colData(sce)$day)
```

We see that for many patients the data is complete, i.e. cells were sampled
on all time points.

Practically, the cells were prepared in 28 batches. Since we here only look
at a subset of the data, we see that only 14 of these batches are represented 
here.

```{r, eval=FALSE}
length(table(colData(sce)$experiment))
table(colData(sce)$experiment, colData(sce)$day)
```

# Obtaining and including rowData

The `rowData` slot of a `SingleCellExperiment` object allows for storing 
information on the features, i.e. the genes, in a dataset. In our object,
the `rowData` slot currently contains the following:

```{r, eval=FALSE}
head(rowData(sce))
```

To improve our gene-level information, we may:

1. Split `V1` into two columns, one with the ENSEMBL ID and the other with 
the gene symbol.

2. Display which chromosome the gene is located

Many more options are possible, but are not necessary for us right now.

```{r, eval=FALSE}
rowData(sce) <- data.frame(Ensembl = gsub("_.*", "", rowData(sce)$V1),
                           Symbol = gsub("^[^_]*_", "", rowData(sce)$V1))
head(rowData(sce))
```


```{r, eval=FALSE, message=FALSE, warning=FALSE}
library("biomaRt")
ensembl75 <- useEnsembl(biomart = 'genes', 
                        dataset = 'hsapiens_gene_ensembl',
                        version = 75)

GeneInfo <- getBM(attributes = c("ensembl_gene_id", # To match with rownames SCE
                                 "chromosome_name"), # Info on chromose
                  mart = ensembl75)
GeneInfo <- GeneInfo[match(rowData(sce)$Ensembl, GeneInfo$ensembl_gene_id),]

rowData(sce) <- cbind(rowData(sce), GeneInfo)
head(rowData(sce))
all(rowData(sce)$Ensembl == rowData(sce)$ensembl_gene_id) 
# identical, as desired, so we could optionally remove one of the two
```

# Filtering non-informative genes

Let us first try the very simple and very lenient filtering criterion that we
adopted for the Macosko dataset.

```{r, eval=FALSE}
keep <- rowSums(assays(sce)$counts > 0) > 10
table(keep)

sce <- sce[keep,]
```

We see that this filtering strategy does not remove any genes for this dataset.
In general, datasets from plate-based scRNA-seq dataset have a far higher
sequencing depth than data from droplet-based protocols. As requiring a minimum
expression of 1 count in at least 10 cells is a very lenient criterion if we 
consider that we have 36.000 cells, we should consider adopting a more stringent
filtering criterium, like the `filterByExpr` from `edgeR`:

```{r, eval=FALSE, message=FALSE, warning=FALSE}
library(edgeR)

table(colData(sce)$day)

keep2 <- edgeR::filterByExpr(y=sce,
                             group = colData(sce)$day,
                             min.count = 5,
                             min.prop = 0.4)
table(keep2)

sce <- sce[keep2,]
```

```{r, eval=FALSE}
xas <- rowSums(assay(sce) > 0)
yas <- rowSums(assay(sce))


plot(x = xas,
     y = yas,
     log = "y",
     pch = 19,
     cex=0.3,
     col = as.factor(keep2))
```


# Quality control

## Calculate QC variables

```{r, eval=FALSE}
library(scater)

# check ERCC spike-in transcripts
sum(grepl("^ERCC-", rowData(sce)$Symbol)) # no spike-in transcripts available

is.mito <- grepl("^MT", rowData(sce)$chromosome_name)
sum(is.mito) # 13 mitochondrial genes

df <- perCellQCMetrics(sce, subsets=list(Mito=is.mito))
head(df)

## add the QC variables to sce object
colData(sce) <- cbind(colData(sce), df)
```

## Exploratory data analysis

In the figure below, we see that several cells have a very low number of 
expressed genes, and where most of the molecules are derived from 
mitochondrial genes. This indicates likely damaged cells, presumably because 
of loss of cytoplasmic RNA from perforated cells, so we should remove these for 
the downstream analysis.

```{r, eval=FALSE}
# Number of genes vs library size
plotColData(sce, x = "sum", y="detected", colour_by="day") 

# Mitochondrial genes
plotColData(sce, x = "detected", y="subsets_Mito_percent", colour_by="day")
```

## QC using adaptive thresholds

Below, we remove cells that are outlying with respect to

 1. A low sequencing depth (number of UMIs);
 2. A low number of genes detected;
 3. A high percentage of reads from mitochondrial genes.
 
We remove a total of $301$ cells, mainly due to low sequencing depth and
low number of genes detected.

```{r, eval=FALSE}
lowLib <- isOutlier(df$sum, type="lower", log=TRUE)
lowFeatures <- isOutlier(df$detected, type="lower", log=TRUE)
highMito <- isOutlier(df$subsets_Mito_percent, type="higher")

table(lowLib)
table(lowFeatures)
table(highMito)

discardCells <- (lowLib | lowFeatures | highMito)
table(discardCells)
colData(sce)$discardCells <- discardCells

# visualize cells to be removed
plotColData(sce, x = "detected", y="subsets_Mito_percent", colour_by = "discardCells")
plotColData(sce, x = "sum", y="detected", colour_by="discardCells")
```

```{r, eval=FALSE}
# visualize cells to be removed
plotColData(sce, x = "detected", y="subsets_Mito_percent", colour_by = "donor")
plotColData(sce, x = "sum", y="detected", colour_by="donor")
```

```{r, eval=FALSE}
# visualize cells to be removed
plotColData(sce, x = "detected", y="subsets_Mito_percent", colour_by = "experiment")
plotColData(sce, x = "sum", y="detected", colour_by="experiment")
```

```{r, eval=FALSE}
table(sce$donor, sce$discardCells)
table(sce$donor, sce$discardCells)/rowSums(table(sce$donor, sce$discardCells))
#fractions of removed cells per donor
```

Most removed cells (fraction) are from patients `dixh` and `babz`.

```{r, eval=FALSE}
table(sce$experiment, sce$discardCells)
table(sce$experiment, sce$donor)
```

Most removed cells (fraction) are from patients `dixh` and `babz`.
Most low library sizes seem to come from patient `dixh`; for patient `babz`
the effect is less pronounced.

```{r, eval=FALSE}
plotColData(sce[,sce$donor=="dixh"], x = "sum", y="detected")
plotColData(sce[,sce$donor=="babz"], x = "sum", y="detected")
```

As such, we are mainly removing cells from specific patients and the respective
batches in which they were sequenced. However, we want to be careful; we only
want to remove technical artefacts, while retaining as much of the biology as
possible. In our exploratory figure, we see that the cells we are removing based
on the number of genes detected, are quite far apart from the bulk of the data
cloud; as such, these cells are indeed suspicious. For the criterion of
library size, we see that the cells removed there are still strongly connected
to the data cloud. As such, we may want to relax the filtering criterion there a
little bit. When we think of how the adaptive threshold strategy works, we
may want to remove cells that are 4MADs away from the center, rather than
the default 3 MADs.

```{r, eval=FALSE}
# previously
lowLib <- isOutlier(df$sum, type="lower", log=TRUE)
table(lowLib)

# after seeing appropriate exploratory figure
lowLib <- isOutlier(df$sum, nmads=4, type="lower", log=TRUE)
table(lowLib)

discardCells <- (lowLib | lowFeatures | highMito)
table(discardCells)
colData(sce)$discardCells <- discardCells
```

Note that these steps are not exact; different analysts will come with different
filtering criteria for many of the steps. The key ideas are that
we let appropriate exploratory figures guide us to make reasonable choices;
i.e., we look at the data rather than blindly following a standardized pipeline
that may work well in many cases, but maybe not our particular dataset.

```{r, eval=FALSE}
# remove cells identified using adaptive thresholds
sce <- sce[, !colData(sce)$discardCells]
```

## Identifying and removing empty droplets

This does not make much sense for plate-based data! While we could imagine the
presence of empty plate wells for the SMART-Seq2 experiment, these are typically
detected and removed upstream of our analysis using laser technology.

## Identifying and removing doublets

Again, this step typically does not make much sense for plate-based 
data. While we could imagine the presence of plate wells in which more than two
cells are present for the SMART-Seq2 experiment, these are again typically
detected and removed upstream of our analysis using laser technology.

If we would use
[scDblFinder](https://bioconductor.org/packages/3.14/bioc/html/scDblFinder.html) 
to detect doublet cells on this dataset, we would see that only a very small 
fraction of the cells would be removed. These removed cells could correspond to 
actual doublets that were missed by the upstream doublet detection technology,
or more likely could correspond to false positive doublets of the `scDblFinder`
doublet detection procedure.

# Normalization

For normalization, the size factors $s_i$ computed here are simply scaled 
library sizes:

\[ N_i = \sum_g Y_{gi} \]
\[ s_i = N_i / \bar{N}_i \]

```{r, eval=FALSE}
sce <- logNormCounts(sce)

# note we also returned log counts: see the additional logcounts assay.
sce

# you can extract size factors using
sf <- librarySizeFactors(sce)
mean(sf) # equal to 1 due to scaling.
plot(x= log(colSums(assays(sce)$counts)), 
     y=sf)
```

---

--- end lab session 1 ---

---

Visualize the effect of normalization using MD-plots.

In order to only work with cells that could be considered technical replicates,
we should select cells coming from the same biological group. We can do this
approximately by selecting cells from the same sampling day, the same 
donor and the same sequencing experiment.

```{r, eval=FALSE}
table(sce$donor, sce$day)
table(sce$experiment, sce$day)
table(sce$experiment, sce$donor)
```

Below, we will visualize the normalization occurring between two cells of the 
same cell type - donor/experiment combination (which could be considered 
technical repeats): 

```{r, eval=FALSE}
cs <- colSums(assays(sce)$counts[,which(sce$day == "day2" &
                                        sce$donor == "eipl")])
cs[order(cs, decreasing = TRUE)][c(1,10,100)]
```

Let’s take a look at how comparable two cells (replicates) of this biological group
are. We will compare the cell with the highest library size with the cell that 
has the 10th and 100th highest library size using MD-plots (mean-difference 
plots, as introduced by 
[Dudoit et al. (2002)](https://www.jstor.org/stable/24307038?seq=1#metadata_info_tab_contents)), 
also sometimes referred to as MA-plots.

```{r, eval=FALSE}
targetCells <- names(cs[order(cs, decreasing = TRUE)][c(1,10)])

M <- rowMeans(assays(sce)$counts[,targetCells])
D <- assays(sce)$counts[,targetCells[2]] / assays(sce)$counts[,targetCells[1]]
plot(x = log(M), y = log2(D),
     pch = 16, cex=1/3,
     main = paste0("Cell ", targetCells[2], " vs cell ", targetCells[1]),
     xlab = "Log mean", ylab = "Log2 fold-change",
     bty = 'l')
abline(h = 0, col="orange", lwd=2)

targetCells <- names(cs[order(cs, decreasing = TRUE)][c(1,100)])

M <- rowMeans(assays(sce)$counts[,targetCells])
D <- assays(sce)$counts[,targetCells[2]] / assays(sce)$counts[,targetCells[1]]
plot(x = log(M), y = log2(D),
     pch = 16, cex=1/3,
     main = paste0("Cell ", targetCells[2], " vs cell ", targetCells[1]),
     xlab = "Log mean", ylab = "Log2 fold-change",
     bty = 'l')
abline(h = 0, col="orange", lwd=2)
```

We see clear bias in the comparison of the 1st and 10th/100th most deeply 
sequenced cell from this day/patient combination. We see that the log f
old-changes are biased downwards. This means that, on average, a gene is lower 
expressed in cell 1 versus cell 10 (and cell 100). Looking at the library sizes, 
we can indeed see that the library size for cell 1 is 1.200.691 counts, while it
is only 701.550 counts for cell 10 and 468.762 counts for cell 100! This is a
clear library size effect that we should take into account.

```{r, eval=FALSE}
# normalize the count data using the previously computed "size factors"
assay(sce, "normed") <- normalizeCounts(sce, 
                                        log=FALSE,
                                        size.factors=sf, 
                                        pseudo.count=0)

targetCells <- names(cs[order(cs, decreasing = TRUE)][c(1,10)])

M <- rowMeans(assays(sce)$normed[,targetCells])
D <- assays(sce)$normed[,targetCells[2]] / assays(sce)$normed[,targetCells[1]]
plot(x = log(M), y = log2(D),
     pch = 16, cex=1/3,
     main = paste0("Cell ", targetCells[2], " vs cell ", targetCells[1]),
     xlab = "Log mean", ylab = "Log2 fold-change",
     bty = 'l')
abline(h = 0, col="orange", lwd=2)

targetCells <- names(cs[order(cs, decreasing = TRUE)][c(1,100)])

M <- rowMeans(assays(sce)$normed[,targetCells])
D <- assays(sce)$normed[,targetCells[2]] / assays(sce)$normed[,targetCells[1]]
plot(x = log(M), y = log2(D),
     pch = 16, cex=1/3,
     main = paste0("Cell ", targetCells[2], " vs cell ", targetCells[1]),
     xlab = "Log mean", ylab = "Log2 fold-change",
     bty = 'l')
abline(h = 0, col="orange", lwd=2)
```


# Feature selection

## Highly variable genes

```{r, eval=FALSE}
library(scran)
rownames(sce) <- rowData(sce)$Ensembl
dec <- modelGeneVar(sce)
head(dec)
```

```{r, eval=FALSE}
fit <- metadata(dec)
plot(fit$mean, fit$var, 
     xlab="Mean of log-expression",
    ylab="Variance of log-expression")
curve(fit$trend(x), col="dodgerblue", add=TRUE, lwd=2)
```

```{r, eval=FALSE}
# get 10% highly variable genes
hvg <- getTopHVGs(dec, 
                  prop=0.1)
head(hvg)

# plot these 
plot(fit$mean, fit$var, 
     col = c("orange", "darkseagreen3")[(names(fit$mean) %in% hvg)+1],
     xlab="Mean of log-expression",
    ylab="Variance of log-expression")
curve(fit$trend(x), col="dodgerblue", add=TRUE, lwd=2)
legend("topleft", 
       legend = c("Selected", "Not selected"), 
       col = c("darkseagreen3", "orange"),
       pch = 16,
       bty='n')
```

## High deviance genes

```{r, eval=FALSE}
#BiocManager::install("scry")
library(scry)
sce <- devianceFeatureSelection(object = sce, 
                                assay = "counts", 
                                sorted = FALSE)

plot(sort(rowData(sce)$binomial_deviance, decreasing = TRUE), 
     type="l", 
     xlab="ranked genes", 
     ylab="binomial deviance", 
     main="Feature Selection with Deviance")
abline(v=2000, lty=2, col="red")
```

## Seurat VST

```{r, eval=FALSE}
library(Seurat)
rownames(sce) <- rowData(sce)$Ensembl
seurat_obj <- as.Seurat(sce)
seurat_obj # notice the "0 variable features"
```

```{r, eval=FALSE}
seurat_obj <- Seurat::NormalizeData(seurat_obj, 
                                    normalization.method = "LogNormalize", 
                                    scale.factor = 10000)

seurat_obj <-  FindVariableFeatures(object = seurat_obj,
                                    selection.method = "vst")
```

```{r, eval=FALSE}
seurat_obj  # notice the "2000 variable features" (default)
head(VariableFeatures(seurat_obj)) # here they are
rm(seurat_obj)
```

# Dimensionality reduction

## The most basic DR

```{r, eval=FALSE}
colData(sce)$day <- as.factor(colData(sce)$day)
day <- colData(sce)$day
par(bty='l')
plot(x = assays(sce)$counts[hvg[1],],
     y = assays(sce)$counts[hvg[2],],
     col = as.numeric(day),
     pch = 16, cex = 1/3,
     xlab = "Most informative gene",
     ylab = "Second most informative gene",
     main = "Cells colored acc to cell type")
```

Just by looking at the top two genes based on our feature selection criterion, 
we can already see some separation according to the cell type; the cells 
colored red seem to have low expression values for both genes, and are those
solely retireved in the bottom left of the x-y plane.

## Linear dimensionality reduction: PCA

### PCA with feature selection

```{r, eval=FALSE}
set.seed(1234)
sce <- runPCA(sce, 
              ncomponents=30, 
              subset_row=hvg)
```

PCA has been performed. The PCA information has been automatically stored in the
*reducedDim* slot of the SingleCellExperiment object.

```{r, eval=FALSE}
reducedDimNames(sce)
```

```{r, eval=FALSE}
head(reducedDim(sce,
           type="PCA"))
```

The `plotPCA` function of the `scater` package now allows us to visualize
the cells in PCA space, based on the PCA information stored in our object:

```{r, eval=FALSE}
plotPCA(sce, 
        colour_by = "day")
```

We observe a clear-cut clustering of the cells!

```{r, eval=FALSE}
percent.var <- attr(reducedDim(sce), "percentVar")
plot(percent.var, log="y", xlab="PC", ylab="Variance explained (%)")
```

Here, retaining ±15PCs seems reasonable. If you really prefer a more data-driven
way for determining this, 

```{r, eval = FALSE}
library(PCAtools)
chosen.elbow <- findElbowPoint(percent.var)
chosen.elbow
```

### PCA without feature selection

```{r, eval=FALSE}
set.seed(1234)
sceNoFS <- runPCA(sce, 
                  ncomponents=30, 
                  subset_row=1:nrow(sce))
plotPCA(sceNoFS, colour_by = "day")
rm(sceNoFS)
```

While we use more information to make this PCA plot (10.374 genes) as compared
to the feature selected PCA plot (506 genes), we seem to retrieve less structure
in the data. Indeed, we are here not able to distinguish between day2 and day3
cells. This is the power of feature selection, an increase in the
signal-to-noise ratio!

## A generalization of PCA for exponential family distributions.

```{r, eval=FALSE}
library(glmpca)
set.seed(211103)
poipca <- glmpca(Y = assays(sce)$counts[hvg,],
                 L = 2, 
                 fam = "poi",
                 minibatch = "stochastic")
reducedDim(sce, "PoiPCA") <- poipca$factors
plotReducedDim(sce, 
               dimred="PoiPCA",
               colour_by = "day")
```

## Non-linear dimensionality reduction: T-SNE

```{r, eval=FALSE}
set.seed(564654)
sce <- runTSNE(sce, 
               dimred = 'PCA',
               external_neighbors=TRUE)
plotTSNE(sce,
         colour_by = "day")

plotTSNE(sce,
         colour_by = "donor")

plotTSNE(sce,
         colour_by = "experiment")
```

## Non-linear dimensionality reduction: UMAP

```{r, eval=FALSE}
set.seed(65187)
sce <- runUMAP(sce, 
               dimred = 'PCA', 
               external_neighbors = TRUE)
plotUMAP(sce,
         colour_by = "day")
plotUMAP(sce,
         colour_by = "donor")
plotUMAP(sce,
         colour_by = "experiment")
```

