Contents

1 Background

The data used in this workflow comes from an RNA-seq experiment where airway smooth muscle cells were treated with dexamethasone, a synthetic glucocorticoid steroid with anti-inflammatory effects (Himes et al. 2014). Glucocorticoids are used, for example, by people with asthma to reduce inflammation of the airways. In the experiment, four human airway smooth muscle cell lines were treated with 1 micromolar dexamethasone for 18 hours. For each of the four cell lines, we have a treated and an untreated sample. For more description of the experiment see the article, PubMed entry 24926665, and for raw data see the GEO entry GSE52778.

In this file we will assess DE by starting from transcript counts that we summarize at the gene level. The DE analysis is performed by DESeq2. The analysis is based on the R/markdown script from Charlotte Soneson given at shortcourse in 2019: https://uclouvain-cbio.github.io/BSS2019/ https://uclouvain-cbio.github.io/BSS2019/rnaseq_gene_summerschool_belgium_2019.html

2 Data

FastQ files with a small subset of the reads can be found on https://github.com/statOmics/SGA2019/tree/data-rnaseq

2.1 Read Meta Data

target<-readRDS("airwayMetaData.rds")

We do not have the Albuterol samples

rownames(target)<-target$Run
target<-target[-grep("Albuterol",target[,"treatment:ch1"]),]
target[,grep(":ch1",colnames(target))]
##            cell line:ch1              cell type:ch1 ercc_mix:ch1
## SRR1039508        N61311 airway smooth muscle cells            -
## SRR1039509        N61311 airway smooth muscle cells            -
## SRR1039512       N052611 airway smooth muscle cells            -
## SRR1039513       N052611 airway smooth muscle cells            1
## SRR1039516       N080611 airway smooth muscle cells            -
## SRR1039517       N080611 airway smooth muscle cells            -
## SRR1039520       N061011 airway smooth muscle cells            1
## SRR1039521       N061011 airway smooth muscle cells            1
##                                  tissue:ch1 treatment:ch1
## SRR1039508 human airway smooth muscle cells     Untreated
## SRR1039509 human airway smooth muscle cells Dexamethasone
## SRR1039512 human airway smooth muscle cells     Untreated
## SRR1039513 human airway smooth muscle cells Dexamethasone
## SRR1039516 human airway smooth muscle cells     Untreated
## SRR1039517 human airway smooth muscle cells Dexamethasone
## SRR1039520 human airway smooth muscle cells     Untreated
## SRR1039521 human airway smooth muscle cells Dexamethasone
target$cellLine<-as.factor(target[,grep("cell line:ch1",colnames(target))])
target$treatment<-as.factor(target[,grep("treatment:ch1",colnames(target))])
target$treatment<-relevel(target$treatment,"Untreated")

3 Alignment-free transcript quantification

Software such as kallisto [@Bray2016Near], Salmon [@Patro2017Salmon] and Sailfish [@Patro2014Sailfish], as well as other transcript quantification methods like Cufflinks [@Trapnell2010Cufflinks; @Trapnell2013Cufflinks2] and RSEM [@Li2011RSEM], differ from the counting methods introduced in the previous tutorials in that they provide quantifications (usually both as counts and as TPMs) for each transcript. These can then be summarized on the gene level by adding all values for transcripts from the same gene. A simple way to import results from these packages into R is provided by the tximport and tximeta packages. Here, tximport reads the quantifications into a list of matrices, and tximeta aggregates the information into a SummarizedExperiment object, and also automatically adds additional annotations for the features. Both packages can return quantifications on the transcript level or aggregate them on the gene level. They also calculate average transcript lengths for each gene and each sample, which can be used as offsets to improve the differential expression analysis by accounting for differential isoform usage across samples [@Soneson2015Differential]. aturecounts object

3.1 Salmon

The code below imports the Salmon quantifications into R using the tximeta package. Note how the transcriptome that was used for the quantification is automatically recognized and used to annotate the resulting data object. In order for this to work, tximeta requires that the output folder structure from Salmon is retained, since it reads information from the associated log files in addition to the quantified abundances themselves. With the addIds() function, we can add additional annotation columns.

suppressPackageStartupMessages({
  library(tximeta)
  library(DESeq2)
  library(org.Hs.eg.db)
  library(SummarizedExperiment)
  library(ggplot2)
  library(tidyverse)
  library(pheatmap)
})

## List all quant.sf output files from Salmon
salmonfiles <- paste0("./salmon/", target$Run, "/quant.sf")
names(salmonfiles) <- target$Run
stopifnot(all(file.exists(salmonfiles)))

## Add a column "files" to the metadata table. This table must contain at least
## two columns: "names" and "files"
coldata <- cbind(target, files = salmonfiles, stringsAsFactors = FALSE)
coldata$names<-coldata$Run

## Import quantifications on the transcript level
st <- tximeta::tximeta(coldata)
## importing quantifications
## reading in files with read_tsv
## 1 2 3 4 5 6 7 8 
## found matching transcriptome:
## [ Gencode - Homo sapiens - release 29 ]
## loading existing TxDb created: 2019-11-25 07:49:13
## Loading required package: GenomicFeatures
## generating transcript ranges
## fetching genome info
## Summarize quantifications on the gene level
sg <- tximeta::summarizeToGene(st)
## loading existing TxDb created: 2019-11-25 07:49:13
## obtaining transcript-to-gene mapping from TxDb
## summarizing abundance
## summarizing counts
## summarizing length
## Add gene symbols
sg <- tximeta::addIds(sg, "SYMBOL", gene = TRUE)
## mapping to new IDs using 'org.Hs.eg.db' data package
## if all matching IDs are desired, and '1:many mappings' are reported,
## set multiVals='list' to obtain all the matching IDs
## 'select()' returned 1:many mapping between keys and columns
sg
## class: RangedSummarizedExperiment 
## dim: 58294 8 
## metadata(5): tximetaInfo quantInfo countsFromAbundance txomeInfo
##   txdbInfo
## assays(3): counts abundance length
## rownames(58294): ENSG00000000003.14 ENSG00000000005.5 ...
##   ENSG00000285993.1 ENSG00000285994.1
## rowData names(2): gene_id SYMBOL
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(101): SampleName title ... treatment names

Note that Salmon returns estimated counts, which are not necessarily integers. They may need to be rounded before they are passed to count-based statistical methods (e.g. DESeq2). This is not necesary for edgeR.

counts_salmon <- round(assay(sg, "counts"))

4 Data Analysis

4.1 Setup count object in DESeq2

ds_se <- DESeqDataSet(sg, design = ~ cellLine + treatment)
## using counts and average transcript lengths from tximeta

4.2 Data exploration

First we transform the counts with a variance stabilizing transformation.

vsd <- DESeq2::vst(ds_se)
## using 'avgTxLength' from assays(dds), correcting for library size

PCA plot

plotPCA(vsd, intgroup = c("cellLine","treatment"))

plotPCA(vsd, intgroup = "cellLine")

plotPCA(vsd, intgroup = "treatment")

PC1 treatment effect. PC2 cell line effect.

4.3 Main analysis

4.3.1 Fit model

ds_se <- DESeq2::DESeq(ds_se)
## estimating size factors
## using 'avgTxLength' from assays(dds), correcting for library size
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
DESeq2::plotDispEsts(ds_se)

4.3.2 Results

res <- results(ds_se)
head(res)
## log2 fold change (MLE): treatment Dexamethasone vs Untreated 
## Wald test p-value: treatment Dexamethasone vs Untreated 
## DataFrame with 6 rows and 6 columns
##                             baseMean     log2FoldChange             lfcSE
##                            <numeric>          <numeric>         <numeric>
## ENSG00000000003.14  735.850023907492 -0.366237048569725 0.106107674175317
## ENSG00000000005.5                  0                 NA                NA
## ENSG00000000419.12  512.271227978686  0.214909679541509   0.1263334851897
## ENSG00000000457.13  310.647989299582 0.0251964430492595 0.162137539262632
## ENSG00000000460.16  80.6428080390289 -0.136946821759294 0.312082297819594
## ENSG00000000938.12 0.310167355708762  -1.37185321258226  3.50391481168779
##                                  stat               pvalue
##                             <numeric>            <numeric>
## ENSG00000000003.14   -3.4515604212058 0.000557354949043603
## ENSG00000000005.5                  NA                   NA
## ENSG00000000419.12   1.70112998322499   0.0889185818211641
## ENSG00000000457.13   0.15540166184739    0.876504674117252
## ENSG00000000460.16 -0.438816372207241    0.660794596213206
## ENSG00000000938.12 -0.391520138562232    0.695412805880732
##                                   padj
##                              <numeric>
## ENSG00000000003.14 0.00417173237215644
## ENSG00000000005.5                   NA
## ENSG00000000419.12   0.250778420670953
## ENSG00000000457.13   0.954096286210509
## ENSG00000000460.16   0.845668672522145
## ENSG00000000938.12                  NA
summary(res)
## 
## out of 35374 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 2370, 6.7%
## LFC < 0 (down)     : 1965, 5.6%
## outliers [1]       : 0, 0%
## low counts [2]     : 18548, 52%
## (mean count < 8)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
hist(res$pvalue,xlab="p-value")

volcanoEarly<- ggplot(res %>% as.data.frame,aes(x=log2FoldChange,y=-log10(pvalue),color=padj<0.05)) + geom_point() + scale_color_manual(values=c("black","red")) + ggtitle(paste("contrast","early"))
print(volcanoEarly)
## Warning: Removed 41468 rows containing missing values (geom_point).

plotMA(res)

mat <- assay(vsd)[head(order(res$padj), 30), ]
pheatmap(mat)

Modify the file for an edgeR analysis. You can use the script of the short course as resource. s