Statistical Genomics Analysis 2020 (SGA2020)

IntroFig

Course Description

High throughput ‘omics studies generate ever larger datasets and, as a consequence, complex data interpretation challenges. This course focuses on the statistical concepts involved in preprocessing, quantification and differential analysis of high throughput omics data. Moreover, more advanced experimental designs and blocking will also be introduced. The core focus will be on shotgun proteomics and next generation sequencing. The course will rely exclusively on free and userfriendly opensource tools in R/Bioconductor. The course will provide a solid basis for beginners, but will also bring new perspectives to those already familiar with standard data analysis workflows for proteomics and next-generation sequencing applications.

Target Audience

This course is oriented towards biologists and bioinformaticians with a particular interest in differential analysis for quantitative ‘omics.

Prerequisites

The prerequisites for the Statistical Genomics course are the successful completion of a basic course of statistics that covers topics on data exploration and descriptive statistics, statistical modeling, and inference: linear models, confidence intervals, t-tests, F-tests, anova, chi-squared test.

The basis concepts may be revisited in my online course https://gtpb.github.io/PSLS20/ and in https://statomics.github.io/statistiekCursusNotas/

A primer to R and Data visualisation in R can be found in:

R Basics: https://dodona.ugent.be/nl/courses/335/
R Data Exploration: https://dodona.ugent.be/nl/courses/345/

Topics

Introduction

Slides: Intro
Software: Install and Launch Statistical Software
Recap linear models: case study
Entire analysis for KPNA2 gene: KPNA2

Part I: Quantitative proteomics

Download Tutorial Data

Bioinformatics for proteomics
- Slides: Bioinformatics for Proteomics
- Students can sharpen their background knowledge on Mass Spectrometry, Proteomics & Bioinformatics for Proteomics here:Mass Spectrometry and Bioinformatics for Proteomics
Identification
- Slides: False Discovery Rate and Target Decoy Approach
- Tutorial: Evaluating Target Decoy Quality, example script identification, All searches
Preprocessing & Analysis of Label Free Quantitative Proteomics Experiments with Simple Designs
- Install Software: Installation instructions msqrob2
- Slides: Preprocessing
- Tutorial: preprocessing
Statistical Inference & Analysis of Experiments with Factorial Designs
- Slides: Inference
- Tutorial: Statistical Data Analysis with MSqRob for Factorial Designs
Reading Material and Technical details
- Paper: Sticker et al. (2020) Robust summarization and inference in proteome-wide label-free quantification
- part of PhD dissertation: Extensive Background on proteomics and proteomics data analysis
- Inference upon summarization
Stagewise testing: Omnibus test and post hoc analysis: slides
Solutions

Part II: Next-generation sequencing

Download Tutorial Data

Introduction to transcriptomics with next generation sequencing
- slides: intro
- Background: RNA sequencing data hitchhiker’s guide to expression analysis
- tutorial
  - Mapping: html
  - Differential Analysis: html, which source of variability is not included in the analysis and how could we account for this? Try to adjust the script accordingly.
  - Background for the airway example (count table on small fastQ files available in the Tutorial Data): html
More Complex Designs
- Researchers assessed the effect of spinal nerve ligation (SNL) on the transcriptome of rats. In this experiment, transcriptome profiling occurred at two weeks and two months after treatment, for both the SNL group and a control group. Two biological replicates are used for every treatment - time combination. The researchers are interested in early and late effects and in genes for which the effect changes over time. The data can be downloaded from the ReCount project website (http://bowtie-bio.sourceforge.net/recount/, dataset Hammer et al.). The following code can be used to download an R/Bioconductor expression set object.
```
 file <- "http://bowtie-bio.sourceforge.net/recount/ExpressionSets/hammer_eset.RData"
 load(url(file))
 hammer.eset
```
- Paired-end sequencing was performed on primary cultures from parathyroid tumors of 4 patients at 2 time points over 3 conditions (control, treatment with diarylpropionitrile (DPN) and treatment with 4-hydroxytamoxifen (OHT)). DPN is a selective estrogen receptor agonist and OHT is a selective estrogen receptor modulator. One sample (patient 4, 24 hours, control) was omitted by the paper authors due to low quality. Data, the count table and information on the experiment is available at http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE37211. It is not required to do the read mapping!
Technical details on transcriptomics with next generation sequencing. Generalized linear models are introduced in the slides and bulk RNA-seq tools via their corresponding papers

DE analysis starting from transcript level counts

Soneson et al. 2016

Building index with salmon

 salmon index --gencode -t gencode.v32.transcripts.fa -i gencode.v32_salmon_index

Mapping one sample with salmon:

 salmon quant -i gencode.v32_salmon_index -l A --gcBias -1 SRR1039508_subset_1.fastq -2 SRR1039508_subset_2.fastq --validateMappings -o quant/SRR1039508_subset_quant

Intro to salmon
airway with DESeq2
airway with EdgeR

Solutions
- Airway Example: GenomeIndex, read mapping and count table, DE analysis
- Hammer Example: edgeR, edgeQL, DESeq2

Part III: Single-cell RNA-sequencing

Single-cell analysis: Concepts and a general workflow for single-cell RNA-sequencing (scRNA-seq) datasets.
- Slides: Introduction to general concepts, and discussion of analysis pipeline.
- An intuitive primer on why offsets are used for normalization, as opposed to simple count scaling: HTML file, RMarkdown file.
- Main resources: Orchestrating Single-Cell Analysis with Bioconductor and corresponding paper.
- A reproducible workflow for the Drop-seq dataset from Macosko et al. (2015): HTML file, RMarkdown file.
- Working with scRNA-seq data: Assess the effect of feature selection on (i) dimensionality reduction, (ii) clustering using the dataset from Tasic et al. (2016). You can download the data from the Gene Expression Omnibus, accession number GSE71585.
- Understanding UMAP.
Selected topics in single-cell RNA-seq analysis.
- Trajectory-based differential expression analysis: paper, slides, workshop.
- Differential transcript usage analysis for scRNA-seq data: presentation of unpublished work.
- Local FDR paper

Projects

Course Description

Target Audience

Prerequisites

Topics

Instructors