Statistical analysis of mass spectrometry-based proteomics data

A dive into the msqrob2 universe

Authors

Christophe Vanderaa

Stijn Vandenbulcke

Lieven Clement

Published

December 12, 2025

Preamble

This book provides comprehensive hands-on tutorials on how to apply the msqrob2 software for the statistical analysis of mass spectrometry (MS)-based proteomics data. It includes the latest improvements of the software that enable statistical modelling for a wide panel of use cases. The book first introduces general concepts of statistical proteomics data analysis and msqrob2. Further chapters will demonstrate the application of msqrob2 for assessing different biological questions starting from datasets with different experimental designs, acquisition strategies, instruments, and search engines. The book aims to help proteomics researchers and data analysists tailoring their statistical analysis workflow to their specific datasets and research questions.

The sticker is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Why msqrob2?

MS-based proteomics experiments often imposes a complex correlation structure among observations. Addressing this correlation is key for correct statistical inference and reliable biomarker discovery. This msqrob2 book provides a set of (mixed) model-based workflows dedicated to differential abundance analysis for label-free as well as labeled MS-based proteomics data. The key features of msqrob2 workflows are:

Modularity: all core functions rely on the QFeatures class, a standardised data structure, meaning that output of a function can be fed as input to any other function. Hence, different functions are assembled as modular blocks into a complete data analysis workflows that can be easily adapted to the peculiarities of any MS-based proteomics data set. Therefore, the approach extends well beyond the use case presented in this chapter
Flexibility: the msqrob2 modelling approach relies on the lme4::lmer() model specification syntax, meaning that any linear model can be specified. For fixed effects, this includes modelling categorical and numerical variables, as well as their interaction. Moreover, msqrob2 can model both sample-specific and feature-specific (e.g. peptide or protein) covariates, which unlocks the inference to experiments with arbitrarily complex designs as well as to correct explicitly for feature-specific properties.
Performance: thanks to the inclusion of robust ridge regression, we demonstrated improved performance of msqrob2 workflows upon the competing software (Goeminne, Gevaert, and Clement 2016; Sticker et al. 2020; Vandenbulcke et al. 2025).

Outline

The book is divided in three parts.

Concepts

This parts introduces the user to the key concepts in differential proteomics data analysis and provides extensive description of the code. While this part is conceptual, the concepts are illustrated using a real spike-in study.

1 Statistical analysis with msqrob2 introduces the basic concepts for MS-based proteomics analysis. We recommend our users to first read this chapter before reading any other chapter.
2 Advanced statistical analysis with msqrob2 builds upon the previous chapter and introduces more advanced concepts that will be used in later chapters that involve complex designs and analyses.

Benchmarking

This part illustrates how to benchmark data analysis workflows and demonstrates how the guidelines presented in this book were derived. The main sections of the chapters in this part are intended for advanced users with R programming skills. However, the conclusions in each chapter are more accessible and also intended for entry-level users that want to understand how to apply the guidelines and recommendations to their analyses.

Chapter 3 Benchmarking workflows explains how to conduct a benchmarking experiment that compares different workflows. As an example, the chapter compares the performance when starting from the different MaxQuant input files: the evidence file, the peptides file, and the protein-group file. The same benchmark strategy could be used to compare different data sources, such as data generated by different sample preparation protocols, LC-MS approaches and/or search engines.
4 Optimisation of a data analysis workflow demonstrates how to optimise a data analysis workflow. As an example, the chapter compares the performance of two normalisation approaches (median of ratios and median centering) in combination with three summarisation approaches (median, median polish, and robust regression).

Use cases

This part contains a set of chapter that illustrate the data analysis for a range of experimental designs and technological setups.

5 The francisella use case: a MaxQuant LFQ DDA dataset with technical replication analyses the francisella use case: a MaxQuant LFQ DDA dataset with technical replication.
6 Heart use case: a MaxQuant LFQ DDA dataset with a more complex design analyses the heart use case: a MaxQuant LFQ DDA dataset with a more complex design.
7 The mouse diet use case: a Skyline TMT DDA dataset analyses the mouse diet use case: a Skyline TMT DDA dataset.

TODO make table with dataset descriptions to let users easily find their relevant use case.

Targeted audience and assumed background

The course material is targeted to either proteomics practitioners or data analysts/bioinformaticians that would like to learn how to analyse proteomics data.

A working knowledge of R (R syntax, commonly used functions, basic data structures such as data frames, vectors, matrices, … and their manipulation) is required. Familiarity with MS or proteomics in general is recommended, this would allow for a better understanding of the modelling assumptions taken throughout this book. Familiarity with other Bioconductor omics data classes and the tidyverse syntax is useful.

We highly recommend reading the quantitative proteomics chapter of the R for mass spectrometry book.

Setup

To install all the necessary package, please use the latest release of R and execute:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(c(
   "BiocParallel",
   "BiocFileCache",
   "ComplexHeatmap",
   "dplyr",
   "ExploreModelMatrix",
   "ggpattern",
   "ggplot2",
   "ggrepel",
   "impute",
   "MsDataHub",
   "patchwork",
   "scater",
   "tidyr",
   "bookdown"
))

All software versions used to generate this document are recorded at the end of the book in 8 Additional information.

Citation

If you need to cite this book, please use the following reference:

TODO add citation when available on Zenodo (and paper)

Acknowledgments

We thank the R for Mass Spectrometry initiative:

For developing the QFeatures package on which msqrob2 depends
For openly sharing their book, which we used as a template.

License

This material is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. You are free to share (copy and redistribute the material in any medium or format) and adapt (remix, transform, and build upon the material) for any purpose, even commercially, as long as you give appropriate credit and distribute your contributions under the same license as the original.