Tutorial: Intro to DIA proteomics data processing

Author

Affiliation

Lieven Clement

Ghent University

1 Background

We will use using a publicly available spike-in study published by Staes et al. (Staes et al. 2024). They spiked digested UPS proteins in a yeast digested background at the following ratio’s (yeast:ups ratio 10:1, 10:2, 10:4, 10:8, 10:10). Here we will use a subset of the data, i.e. dilutions 10:2 and 10:4. We will use output of the search engine DIA-NN 2.2.0. The main search output for this DIA-NN version was stored in the report.parquet file in the DIA-NN output directory, which can be found under data/spikein24-staesetal2024.parquet

DIA-NN provides multiple quantifications, e.g. derived from the MS1 or MS2 spectra, and at precursor or protein (protein group) level. The term ‘precursor’ refers to a charged peptide species and is the basic unit of identification and quantification in DIA. Hence, in the context of DIA we refer to a precursor table, instead of to a PSM table in DDA.

Examples of different quantities are:

raw MS1 area: Ms1.Area, normalised MS1 Area: Ms1.Normalised, MS2 Precursor quantities: Precursor.Quantity, Normalised MS2 Precursor quantities: Precursor.Normalised, etc., which are all at the precursor level
MS2 based summary at the protein (protein group)-level: PG.MaxLFQ

Here, we will use the Precursor.Quantity column.

2 Load packages

We load the msqrob2 package, along with additional packages for data manipulation and visualisation.

library("QFeatures")

Loading required package: MultiAssayExperiment

Warning: package 'MultiAssayExperiment' was built under R version 4.5.3

Loading required package: SummarizedExperiment

Loading required package: MatrixGenerics

Loading required package: matrixStats


Attaching package: 'MatrixGenerics'

The following objects are masked from 'package:matrixStats':

    colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
    colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
    colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
    colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
    colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
    colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
    colWeightedMeans, colWeightedMedians, colWeightedSds,
    colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
    rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
    rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
    rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
    rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
    rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
    rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
    rowWeightedSds, rowWeightedVars

Loading required package: GenomicRanges

Loading required package: stats4

Loading required package: BiocGenerics

Loading required package: generics


Attaching package: 'generics'

The following objects are masked from 'package:base':

    as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
    setequal, union


Attaching package: 'BiocGenerics'

The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs

The following objects are masked from 'package:base':

    anyDuplicated, aperm, append, as.data.frame, basename, cbind,
    colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
    get, grep, grepl, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, saveRDS, table, tapply, unique,
    unsplit, which.max, which.min

Loading required package: S4Vectors


Attaching package: 'S4Vectors'

The following object is masked from 'package:utils':

    findMatches

The following objects are masked from 'package:base':

    expand.grid, I, unname

Loading required package: IRanges

Loading required package: Seqinfo

Loading required package: Biobase

Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.


Attaching package: 'Biobase'

The following object is masked from 'package:MatrixGenerics':

    rowMedians

The following objects are masked from 'package:matrixStats':

    anyMissing, rowMedians


Attaching package: 'QFeatures'

The following object is masked from 'package:base':

    sweep

library("dplyr")


Attaching package: 'dplyr'

The following object is masked from 'package:Biobase':

    combine

The following objects are masked from 'package:GenomicRanges':

    intersect, setdiff, union

The following object is masked from 'package:Seqinfo':

    intersect

The following objects are masked from 'package:IRanges':

    collapse, desc, intersect, setdiff, slice, union

The following objects are masked from 'package:S4Vectors':

    first, intersect, rename, setdiff, setequal, union

The following objects are masked from 'package:BiocGenerics':

    combine, intersect, setdiff, setequal, union

The following object is masked from 'package:generics':

    explain

The following object is masked from 'package:matrixStats':

    count

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library("tidyr")


Attaching package: 'tidyr'

The following object is masked from 'package:S4Vectors':

    expand

library("ggplot2")
library("msqrob2")    
library("stringr")
library("ExploreModelMatrix")
library("MsCoreUtils")


Attaching package: 'MsCoreUtils'

The following object is masked from 'package:dplyr':

    between

The following object is masked from 'package:GenomicRanges':

    reduce

The following object is masked from 'package:IRanges':

    reduce

The following object is masked from 'package:MatrixGenerics':

    colCounts

The following object is masked from 'package:matrixStats':

    colCounts

The following object is masked from 'package:stats':

    smooth

library("matrixStats")
library("patchwork")
library("kableExtra")


Attaching package: 'kableExtra'

The following object is masked from 'package:dplyr':

    group_rows

library("ComplexHeatmap")

Loading required package: grid

========================================
ComplexHeatmap version 2.26.1
Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
Github page: https://github.com/jokergoo/ComplexHeatmap
Documentation: http://jokergoo.github.io/ComplexHeatmap-reference

If you use it in published research, please cite either one:
- Gu, Z. Complex Heatmap Visualization. iMeta 2022.
- Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional 
    genomic data. Bioinformatics 2016.


The new InteractiveComplexHeatmap package can directly export static 
complex heatmaps into an interactive Shiny app with zero effort. Have a try!

This message can be suppressed by:
  suppressPackageStartupMessages(library(ComplexHeatmap))
========================================

library("purrr")


Attaching package: 'purrr'

The following object is masked from 'package:MsCoreUtils':

    reduce

The following object is masked from 'package:GenomicRanges':

    reduce

The following object is masked from 'package:IRanges':

    reduce

library("tibble")
library("scater")

Loading required package: SingleCellExperiment

Loading required package: scuttle

3 Data

3.1 Precursor table

We load the output from DIA-NN parquet file. Can be file path to local file or url to file that lives on the web.

precursorFile <- "https://github.com/statOmics/msqrob2data/raw/refs/heads/main/dia/spikeinStaes/spikein24-staesetal2024.parquet"

We can import the report.parquet file using the read_parquet function from the arrow package. Note, that older versions of DIA-NN store the output as report.tsv.

precursors <- arrow::read_parquet(precursorFile) # function from the arrow package
#precursors <- data.table::fread(precursorFile) # For older versions of DIA-NN, where the results are stored as tsv files. Note that the precursorFile then would point to "report.tsv"

Each row in the precursor data table is in “long format” and contains information about one precursor in a specific run (the table below shows the first 6 rows). The columns contains various descriptors about the precursor, such as its sequence, its charge, run, etc. Some of these columns contain the quantification values, e.g. Ms1.Area, Precursor.Quantity etc.

Run.Index	Run	Precursor.Id	Modified.Sequence	Stripped.Sequence	Precursor.Charge	Precursor.Lib.Index	Proteotypic	Precursor.Mz	Protein.Ids	Protein.Group	Protein.Names	Genes	RT	iRT	Predicted.RT	Predicted.iRT	Precursor.Quantity	Precursor.Normalised	Ms1.Area	Ms1.Normalised	Ms1.Apex.Area	Ms1.Apex.Mz.Delta	Normalisation.Factor	Quantity.Quality	Empirical.Quality	Normalisation.Noise	Ms1.Profile.Corr	Evidence	Mass.Evidence	Channel.Evidence	Ms1.Total.Signal.Before	Ms1.Total.Signal.After	RT.Start	RT.Stop	FWHM	PG.MaxLFQ	Genes.MaxLFQ	Genes.MaxLFQ.Unique	PG.MaxLFQ.Quality	Genes.MaxLFQ.Quality	Genes.MaxLFQ.Unique.Quality	Q.Value	PEP	Global.Q.Value	Lib.Q.Value	Peptidoform.Q.Value	Global.Peptidoform.Q.Value	Lib.Peptidoform.Q.Value	PTM.Site.Confidence	Site.Occupancy.Probabilities	Protein.Sites	Lib.PTM.Site.Confidence	PG.Q.Value	PG.PEP	GG.Q.Value	Protein.Q.Value	Global.PG.Q.Value	Lib.PG.Q.Value	Best.Fr.Mz	Best.Fr.Mz.Delta
0	B000254_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio02_DIA	(UniMod:1)AAVTLHLR2	(UniMod:1)AAVTLHLR	AAVTLHLR	2	0	1	461.7771	P38998	P38998	LYS1_YEAST	LYS1	89.48647	34.21962	89.64584	34.32372	1027105.2	1018664.5	1664103	1650427	1235652.9	-0.0019836	0.9917820	0.8068524	0.9587077	-0.0107658	0.8455485	2.328422	0.6659890	0.9694718	1446081536	1354918656	89.27837	89.83672	0.1471047	60736892	60736892	60736892	0.9713600	0.9713600	0.9713600	0.0037350	0.0346908	6.00e-07	1.80e-06	0.0058624	0.0006297	0.0004264	1	(UniMod:1){1.000000}AAVTLHLR2	[P38998:nA2]	1	0.0003065	0.0006127	0.0003067	0.0003116	0.0002956	0.0002814	639.3937	-0.0043335
1	B000258_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio04_DIA	(UniMod:1)AAVTLHLR2	(UniMod:1)AAVTLHLR	AAVTLHLR	2	0	1	461.7771	P38998	P38998	LYS1_YEAST	LYS1	88.51812	34.21962	88.65327	34.23263	1686047.0	1473639.0	2225964	1945538	2138060.2	-0.0012512	0.8740202	0.8539391	0.7675559	0.0606278	0.9467068	2.621551	0.9197075	0.7442634	1707504896	1714693888	88.38112	88.65714	0.1049480	59386180	59386180	59386180	0.9682981	0.9682981	0.9682981	0.0000006	0.0000010	6.00e-07	1.80e-06	0.0007060	0.0006297	0.0004264	1	(UniMod:1){1.000000}AAVTLHLR2	[P38998:nA2]	1	0.0003028	0.0006007	0.0003029	0.0003108	0.0002956	0.0002814	538.3460	-0.0010986
3	B000278_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio02_DIA_2	(UniMod:1)AAVTLHLR2	(UniMod:1)AAVTLHLR	AAVTLHLR	2	0	1	461.7771	P38998	P38998	LYS1_YEAST	LYS1	88.94958	34.21962	89.25366	33.75805	1202390.5	1186230.2	1752379	1728827	1130526.8	-0.0038757	0.9865599	0.7660662	0.7737662	0.0300994	0.5843989	2.661267	0.6917942	0.8114025	2323055360	2137220352	88.61057	89.15668	0.2167290	61787356	61787356	61787356	0.9639115	0.9639115	0.9639115	0.0026284	0.0164647	6.00e-07	1.80e-06	0.0061239	0.0006297	0.0004264	1	(UniMod:1){1.000000}AAVTLHLR2	[P38998:nA2]	1	0.0003246	0.0006370	0.0003248	0.0003308	0.0002956	0.0002814	425.2619	0.0019226
6	B000302_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio02_DIA_3	(UniMod:1)AAVTLHLR2	(UniMod:1)AAVTLHLR	AAVTLHLR	2	0	1	461.7771	P38998	P38998	LYS1_YEAST	LYS1	88.94778	34.21962	89.22704	33.83727	1211047.2	1278841.4	1902638	2009148	1505334.4	-0.0023804	1.0559797	0.8122462	0.9176636	-0.0058500	0.8568795	2.149787	0.2013753	0.9174258	1576011776	1611237888	88.80907	89.22405	0.1854964	57830312	57830312	57830312	0.9690518	0.9690518	0.9690518	0.0038881	0.0323859	6.00e-07	1.80e-06	0.0095169	0.0006297	0.0004264	1	(UniMod:1){1.000000}AAVTLHLR2	[P38998:nA2]	1	0.0003108	0.0006149	0.0003110	0.0003160	0.0002956	0.0002814	639.3937	-0.0004883
7	B000306_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio04_DIA_3	(UniMod:1)AAVTLHLR2	(UniMod:1)AAVTLHLR	AAVTLHLR	2	0	1	461.7771	P38998	P38998	LYS1_YEAST	LYS1	89.02591	34.21962	89.40517	33.85801	952255.9	960217.9	2226282	2244896	724402.3	0.0000000	1.0083612	0.8162364	0.6293753	0.0134552	0.8299713	2.589256	0.3061074	0.8181052	1451911296	1386510976	88.81858	89.30609	0.1775406	58230180	58230180	58230180	0.9686975	0.9686975	0.9686975	0.0029205	0.0223659	6.00e-07	1.80e-06	0.0091256	0.0006297	0.0004264	1	(UniMod:1){1.000000}AAVTLHLR2	[P38998:nA2]	1	0.0003064	0.0006060	0.0003066	0.0003114	0.0002956	0.0002814	639.3937	0.0001221
0	B000254_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio02_DIA	(UniMod:1)ADFLLRPIK2	(UniMod:1)ADFLLRPIK	ADFLLRPIK	2	1	1	557.8346	P38768	P38768	PIH1_YEAST	PIH1	121.34409	76.86208	121.09219	77.28922	5792544.5	5805857.5	6862994	6878766	5909809.0	-0.0009155	1.0022984	0.9153399	0.8982903	-0.0820856	0.9530441	4.035011	0.9436575	0.9152272	2151185664	2523537408	121.13556	121.55209	0.2078343	5839721	5839721	5839721	0.9533337	0.9533337	0.9533337	0.0001971	0.0008056	5.94e-05	4.22e-04	0.0038466	0.0000434	0.0004264	1	(UniMod:1){1.000000}ADFLLRPIK2	[P38768:nA2]	1	0.0003065	0.0006127	0.0003067	0.0003116	0.0002956	0.0002814	739.5189	0.0064697

Quick check on distribution of precursors MS2 intensities.

precursors |> 
  ggplot(aes(x = log2(Precursor.Quantity))) +
  geom_density() +
  theme_minimal()

Warning: Removed 1 row containing non-finite outside the scale range
(`stat_density()`).

We filter the data to reduce the memory footprint.

precursors <- precursors |> 
  select(
         Run, 
         Precursor.Id, 
         Modified.Sequence, 
         Stripped.Sequence, 
         Precursor.Charge, 
         Protein.Group, 
         Protein.Names,
         Protein.Ids,
         Genes,
         Precursor.Quantity,
         Precursor.Normalised,
         Normalisation.Factor,
         Ms1.Area,
         Ms1.Normalised,
         PG.MaxLFQ,
         Q.Value, 
         Lib.Q.Value,
         PG.Q.Value,
         Lib.PG.Q.Value,
         Proteotypic,
         Decoy, # Not available in older versions of DIA-NN
         RT)

We known the ground truth: UPS proteins are differentially abundant (DA, spiked in), Yeast proteins are not.

precursors <- precursors |> 
  mutate(species = grepl(pattern = "UPS",Protein.Group) |> 
           as.factor() |>
           recode("TRUE"="ups","FALSE" = "yeast"))
precursors |> 
  pull(species) |>
  table()


 yeast    ups 
201827   2810

3.2 Sample annotation table

The sample annotation table) is not available and can be generated from the run labels, as the researchers included information on the design in the filenames. The sample annotation table) is not available and can be generated from the run labels, as the researchers included information on the design in the filenames.

We will make a new data frame with the annotation.

Extract unique Run labels
Split Run label into variables using the “_” separator
Rename the variable Run to runCol (needed for readQFeatures function)
Add sampleId
Sort annotation file according to sampleId

(annot <- precursors |> 
  dplyr::distinct(Run) |> 
  mutate(runCol=Run) |> 
  separate(Run, 
           into = c(paste("label",1:8), "condition","label9","rep")) |>
  mutate(
    condition = substr(condition,7,7), 
    ratio = as.double(condition),
    rep = ifelse(is.na(rep),1,rep),
    sampleId = paste(condition,rep, sep = "_")) |> 
  select(runCol, sampleId, condition,ratio,rep)
)

Warning: Expected 11 pieces. Missing pieces filled with `NA` in 2 rows [1, 2].

# A tibble: 6 × 5
  runCol                                          sampleId condition ratio rep  
  <chr>                                           <chr>    <chr>     <dbl> <chr>
1 B000254_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio02… 2_1      2             2 1    
2 B000258_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio04… 4_1      4             4 1    
3 B000278_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio02… 2_2      2             2 2    
4 B000302_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio02… 2_3      2             2 3    
5 B000306_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio04… 4_3      4             4 3    
6 B000282_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio04… 4_2      4             4 2

3.3 Convert to QFeatures

First, recall that the precursor table is file in long format. Every quantitative column in the precursor table contains information for multiple runs. Therefore, the function split the table based on the run identifier, given by the runCol argument (for DIA-NN, that identifier is contained in run). So, the QFeatures object after import will contain as many sets as there are runs. Next, the function links the annotation table with the PSM data. To achieve this, the annotation table must contain a runCol column that provides the run identifier in which each sample has been acquired, and this information will be used to match the identifiers in the Run column of the precursor table.

Here, we will use the Precursor.Quantity column as quantification input.

Note, that we filter a number of variables to reduce the footprint of the QFeatures object.

(qf <- readQFeatures(assayData = precursors,  
                              colData = annot,
                              quantCols = "Precursor.Quantity",
                              runCol = "Run",
                              fnames = "Precursor.Id"))

Checking arguments.

Loading data as a 'SummarizedExperiment' object.

Splitting data in runs.

Formatting sample annotations (colData).

Formatting data as a 'QFeatures' object.

Setting assay rownames.

An instance of class QFeatures (type: bulk) with 6 sets:

 [1] B000254_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio02_DIA: SummarizedExperiment with 34471 rows and 1 columns 
 [2] B000258_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio04_DIA: SummarizedExperiment with 35142 rows and 1 columns 
 [3] B000278_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio02_DIA_2: SummarizedExperiment with 33528 rows and 1 columns 
 [4] B000282_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio04_DIA_2: SummarizedExperiment with 33933 rows and 1 columns 
 [5] B000302_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio02_DIA_3: SummarizedExperiment with 33091 rows and 1 columns 
 [6] B000306_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio04_DIA_3: SummarizedExperiment with 34472 rows and 1 columns

4 Data preprocessing

The data preprocessing workflow for DIA data is similar to the workflow for DDA-LFQ data, but there are suble differences as we start from precursor level data and have additional columns.

4.1 Encoding missing values

We first replace any zero in the quantitative data with an NA.

qf <- zeroIsNA(qf, names(qf))

Note that msqrob2 can handle missing data without having to rely on hard-to-verify imputation assumptions, which is our general recommendation. However, msqrob2 does not prevent users from using imputation, which can be performed with impute() from the QFeatures package.

4.2 Precursor Filtering

Filtering removes low-quality and unreliable precursors that would otherwise introduce noise and artefacts in the data.

4.2.1 Remove questionable identifications

We apply standard filtering advised by DIA-NN:

q-value threshold of 0.01 for the identification of precursors (Q.Value) and protein groups (PG.Q.Value). If the file is processed via matching between runs, it is also useful to filter on Lib.Q.Value and Lib.PG.Q.Value.
Remove precursors that could not be mapped, i.e. when Precursor.Id column is an empty string.
Filter decoys, i.e. only keep precursors for which the Decoy column equals 0. (Note, that the Decoy column is not present in the output of older versions of DIA-NN)
Keeping only proteotypic peptides, which map uniquely to a specific protein.

qf <-  qf |>
  filterFeatures(~ Q.Value <= 0.01 & #1.
                     PG.Q.Value <= 0.01 &  #1.
                     Lib.Q.Value <= 0.01 & #1.
                     Lib.PG.Q.Value <= 0.01 & #1.
                     Precursor.Id != "" & #2.
                     Decoy == 0 & #3. (Note, that this filter is not available with previous versions of DIA-NN, as the report.tsv file did not include a Decoy column. So other strategies are needed if Decoys are in the output file.)
                     Proteotypic == 1) #4.

'Q.Value' found in 6 out of 6 assay(s).'PG.Q.Value' found in 6 out of 6 assay(s).'Lib.Q.Value' found in 6 out of 6 assay(s).'Lib.PG.Q.Value' found in 6 out of 6 assay(s).'Precursor.Id' found in 6 out of 6 assay(s).'Decoy' found in 6 out of 6 assay(s).'Proteotypic' found in 6 out of 6 assay(s).

Note, that it is important that the filtering criteria are not distorting the distribution of the test statistics in the downstream analysis for features that are non-DA. It can be shown that filtering will not induce bias results when the filtering criterion is independent of test statistic. The criteria that we proposed above are all based on the results of the identification step, hence, they are independent of the downstream test statistics that will be used to prioritize DA proteins.

4.2.2 Assay joining

Up to now, the data from different runs were kept in separate assays. We can now join the normalised sets into an precursor set using joinAssays(). Sets are joined by stacking the columns (samples) in a matrix and rows (features) are matched according to a row identifier, here the Precursor.Id. We will store the result in the assay with name: precursor.

(qf <- joinAssays(
  x = qf,
  i = names(qf),
  fcol = "Precursor.Id",
  name = "precursors"
  ))

Using 'Precursor.Id' to join assays.

An instance of class QFeatures (type: bulk) with 7 sets:

 [1] B000254_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio02_DIA: SummarizedExperiment with 32848 rows and 1 columns 
 [2] B000258_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio04_DIA: SummarizedExperiment with 33491 rows and 1 columns 
 [3] B000278_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio02_DIA_2: SummarizedExperiment with 31945 rows and 1 columns 
 [4] B000282_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio04_DIA_2: SummarizedExperiment with 32327 rows and 1 columns 
 [5] B000302_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio02_DIA_3: SummarizedExperiment with 31496 rows and 1 columns 
 [6] B000306_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio04_DIA_3: SummarizedExperiment with 32829 rows and 1 columns 
 [7] precursors: SummarizedExperiment with 38131 rows and 6 columns

4.2.3 Filtering: Remove highly missing precursors

We keep peptides that were observed at last 4 times out of the \(n = 6\) samples, so that we can estimate the peptide characteristics. We tolerate the following proportion of NAs: \(\text{pNA} = \frac{(n - 4)}{n} = 0.33\), so we keep peptides that are observed in at least 66.6% of the samples, which roughly corresponds to 2 samples per condition. This is an arbitrary value that may need to be adjusted depending on the experiment and the data set.

nObs <- 4
n <- ncol(qf[["precursors"]])

(qf <- filterNA(qf, i = "precursors", pNA = (n - nObs) / n))

An instance of class QFeatures (type: bulk) with 7 sets:

 [1] B000254_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio02_DIA: SummarizedExperiment with 32848 rows and 1 columns 
 [2] B000258_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio04_DIA: SummarizedExperiment with 33491 rows and 1 columns 
 [3] B000278_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio02_DIA_2: SummarizedExperiment with 31945 rows and 1 columns 
 [4] B000282_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio04_DIA_2: SummarizedExperiment with 32327 rows and 1 columns 
 [5] B000302_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio02_DIA_3: SummarizedExperiment with 31496 rows and 1 columns 
 [6] B000306_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio04_DIA_3: SummarizedExperiment with 32829 rows and 1 columns 
 [7] precursors: SummarizedExperiment with 31920 rows and 6 columns

4.2.4 Filter one-hit wonders

Here, we remove proteins that can only be found by one peptide, as such proteins may not be trustworthy.

We first calculate how many distinct peptides map to each protein (Protein.Group). We use the stripped precursor sequence, i.e. sequence of the base peptide for this purpose.
We store this information in the row data of the precursors assay
We filter precursors of one-hit wonder proteins.

# Filter for peptides per protein
pepsPerProtDf <- qf[["precursors"]] |> 
    rowData() |> 
    data.frame() |>
    dplyr::select("Stripped.Sequence", "Protein.Group") |>
    group_by(Protein.Group) |>
    mutate(pepsPerProt = Stripped.Sequence |>
               unique() |> length()
           ) #1.

rowData(qf[["precursors"]])$pepsPerProt <- pepsPerProtDf$pepsPerProt #2. 

qf <- filterFeatures(qf, 
                              ~ pepsPerProt > 1, 
                              keep = TRUE) #3.

'pepsPerProt' found in 1 out of 7 assay(s).

4.3 Log-transformation

We perform log2-transformation with logTransform() from the QFeatures package. We use base = 2 and store the result in a new summarized experiment named precursors_log

qf <- logTransform(qf, 
                            base = 2, 
                            i = "precursors", 
                            name = "precursors_log")

4.4 Normalisation

The most common objective of MS-based proteomics experiments is to understand the biological changes in protein abundance between experimental conditions. However, changes in measurements between groups can be caused due to technical factors. For instance, there are systematic fluctuations from run-to-run that shift the measured intensity distribution. We can this explore as follows:

We extract the sets containing the log transformed data. This is performed using QFeatures’ 3-way subsetting.
We use longForm() to convert the QFeatures object into a long table, including condition and concentration for filtering and colouring.
We visualise the density of the quantitative values within each sample. We colour each sample based on its spike-in condition.

qf[, , "precursors_log"] |> #1.
  longForm(colvars = colnames(colData(qf))) |>  #2. 
  data.frame() |> 
  filter(!is.na(value)) |>
  ggplot() + #3.
  aes(x = value,
      colour = condition,
      group = colname) +
  geom_density() +
  theme_minimal()

Warning: 'experiments' dropped; see 'drops()'

harmonizing input:
  removing 12 sampleMap rows not in names(experiments)

There are many ways to perform the normalisation, e.g. median centering is a popular choice. If we subtract the sample median at the log scale from each precursor log2 intensity, this basically boils down to calculating log-ratio’s between the precursor intensity and its sample median.

Here, we will use the Median of Ratios method of DESeq2, which was originally developed for RNA-seq data analysis and can also correct for differences in composition of the proteomes in the different samples. The method first calculates pseudo-reference sample (row-wise mean of log2 intensities, which corresponds to the log2 transformed geometric mean). THen calculates log2 ratios of each sample w.r.t. the pseudo-reference. Finally the column wise median of these log2 ratios is taken to obtain the sample based normalisation factor on the log2 scale. This procedure is implemented in the msqrob2 function nfLogMedianOfRatios.

We calculate the log transformed normalisation factor from the Median of Ratios method
Subtract these log2-norm factors from the intensities of each corresponding column of the assay data and store the result in the new assay peptides_norm. (We adopt the sweep function to the peptides_log assay of the spikein QFeatures object with as statistic the log2 normfactor STATS = nfLog the default function FUN = "-", MARGIN = 2 to substract the column wise log2 norm factor from each entry of the corresponding assay data).

qf <- sweep( #4. Subtract log2 norm factor column-by-column (MARGIN = 2)
  qf,
  MARGIN = 2,
  STATS = nfLogMedianOfRatios(qf,"precursors_log"),
  i = "precursors_log",
  name = "precursors_norm"
)

This function aims to calculate norm factors on a log scale, 
          the input data are assumed to be on the log-scale!

We explore the effect of the global normalisation in the subsequent plot.

Formally, the function applies the following operation on each sample \(i\) across all precursors \(p\):

\[ y_{ip}^{\text{norm}} = y_{ip} - \log_2(nf_i) \]

with \(y_ip\) the log2-transformed intensities and \(nf_i\) the log2-transformed norm factor. Upon normalisation, we can see that the distribution of the \(y_{ip}^{\text{norm}}\) nicely overlap (using the same code as above)

qf[, , "precursors_norm"] |>
  longForm(colvars = colnames(colData(qf))) |> 
  data.frame() |> 
  filter(!is.na(value)) |>
  ggplot() + 
  aes(x = value,
      colour = condition,
      group = colname) +
  geom_density() +
  labs(subtitle = "Normalised log2 precursor intensities") +
  theme_minimal()

Warning: 'experiments' dropped; see 'drops()'

harmonizing input:
  removing 18 sampleMap rows not in names(experiments)

4.5 Summarisation

Here, we summarise the precursor-level data into protein intensities using maxLFQ. maxLFQ first calculates all pairwise log ratio’s between samples only using their shared precursors. Particularly, it uses the median of the log ratio’s between the shared precursors s, i.e. \[r_{ij} = median(y_{sj}-y_{si})\].
Hence, it first eliminates peptide effects. It then estimates the summaries by solving

\[ \sum_i\sum_j(y^{prot}_j - y^{prot}_i - r_ij)^2 \] It is implemented in the maxLFQ function of the iq package.

aggregateFeatures() streamlines summarisation. It requires the name of a rowData column to group the precursors into proteins (or protein groups), here Protein.Group. We provide the summarisation approach through the fun argument. Other summarisation methods are available from the MsCoreUtils package, see ?aggregateFeatures for a comprehensive list. The function will return a QFeatures object with a new set that we called proteins.

(qf <- aggregateFeatures(
  qf, i = "precursors_norm", 
  name = "proteins",
  fcol = "Protein.Group", 
  fun = function(X, ...) iq::maxLFQ(X)$estimate
))

Your quantitative data contain missing values. Please read the relevant
section(s) in the aggregateFeatures manual page regarding the effects
of missing values on data aggregation.


Aggregated: 1/1

An instance of class QFeatures (type: bulk) with 10 sets:

 [1] B000254_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio02_DIA: SummarizedExperiment with 32848 rows and 1 columns 
 [2] B000258_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio04_DIA: SummarizedExperiment with 33491 rows and 1 columns 
 [3] B000278_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio02_DIA_2: SummarizedExperiment with 31945 rows and 1 columns 
 ...
 [8] precursors_log: SummarizedExperiment with 31524 rows and 6 columns 
 [9] precursors_norm: SummarizedExperiment with 31524 rows and 6 columns 
 [10] proteins: SummarizedExperiment with 2991 rows and 6 columns

5 Data exploration and QC

Data exploration aims to highlight the main sources of variation in the data prior to data modelling and can pinpoint to outlying or off-behaving samples.

5.1 Marginal distribution at precursor and protein level

qf[, , "precursors_norm"] |>
  longForm(colvars = colnames(colData(qf))) |> 
  data.frame() |> 
  filter(!is.na(value)) |>
  ggplot() + 
  aes(x = value,
      colour = condition,
      group = colname) +
  geom_density() +
  theme_minimal() +
   labs(subtitle = "Normalised log2 precursor intensities")

Warning: 'experiments' dropped; see 'drops()'

harmonizing input:
  removing 24 sampleMap rows not in names(experiments)

qf[, , "precursors_norm"] |>
  longForm(colvars = colnames(colData(qf))) |> 
  data.frame() |> 
  filter(!is.na(value)) |>
  ggplot() + 
  aes(x = sampleId,
      y = value,
      colour = condition,
      group = colname) +
  xlab("sample") +
  geom_boxplot() +
  theme_minimal() +
   labs(subtitle = "Normalised log2 precursor intensities")

Warning: 'experiments' dropped; see 'drops()'

harmonizing input:
  removing 24 sampleMap rows not in names(experiments)

qf[, , "proteins"] |>
  longForm(colvars = colnames(colData(qf))) |> 
  data.frame() |> 
  filter(!is.na(value)) |>
  ggplot() + 
  aes(x = sampleId,
      y = value,
      colour = condition,
      group = colname) +
  xlab("sample") +
  geom_boxplot() +
  theme_minimal() +
   labs(subtitle = "Normalised log2 protein intensities")

Warning: 'experiments' dropped; see 'drops()'

harmonizing input:
  removing 24 sampleMap rows not in names(experiments)

qf[, , "proteins"] |>
  longForm(colvars = colnames(colData(qf))) |> 
  data.frame() |> 
  filter(!is.na(value)) |>
  ggplot() + 
  aes(x = value,
      colour = condition,
      group = colname) +
  geom_density() +
  theme_minimal() +
   labs(subtitle = "Normalised log2 protein intensities")

Warning: 'experiments' dropped; see 'drops()'

harmonizing input:
  removing 24 sampleMap rows not in names(experiments)

5.2 Charge state

qf[, , "precursors_norm"] |>
  longForm(colvars = colnames(colData(qf)), rowvars = "Precursor.Charge") |>
  as.data.frame() |>
  filter(!is.na(value)) |>
  filter(Precursor.Charge<=4) |> 
ggplot(aes(x = sampleId)) +
  geom_bar(aes(fill = factor(Precursor.Charge, levels = 4:1)),
           colour = "black") +
  labs(title = "Peptide types per sample",
       x = "Sample",
       fill = "Charge state") +
  theme_bw()

Warning: 'experiments' dropped; see 'drops()'

harmonizing input:
  removing 24 sampleMap rows not in names(experiments)

5.3 Identifications per sample

qf[,,"precursors_norm"] |> 
  longForm(colvars = colnames(colData(qf)), 
           rowvars= c("Precursor.Id", 
                      "Protein.Group", 
                      "Stripped.Sequence",
                      "Precursor.Charge")) |>
  data.frame() |>
  filter(!is.na(value)) |>
  mutate(cond_rep = paste0("ratio", condition, "_", rep)) |>
  group_by(condition, cond_rep) |>
  summarise(Precursors = length(unique(Precursor.Id)),
            `Protein Groups` = length(unique(Protein.Group))) |> 
  pivot_longer(-(1:2),
               names_to = "Feature",
               values_to = "IDs") |> 
  ggplot(aes(x = cond_rep, y = IDs, fill = condition)) +
  geom_col() +
  #scale_fill_observable() +
  facet_wrap(~Feature,
             scales = "free_y") +
  labs(title = "Precursor and protein group identificiations per sample",
       x = "Sample",
       y = "Identifications") +
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 90))

Warning: 'experiments' dropped; see 'drops()'

harmonizing input:
  removing 24 sampleMap rows not in names(experiments)

`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by condition and cond_rep.
ℹ Output is grouped by condition.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(condition, cond_rep))` for per-operation grouping
  (`?dplyr::dplyr_by`) instead.

# Data Modeling (Robust Regression){#sec-modelling}

5.4 Dimensionality reduction plot

A common approach for data exploration is to perform dimension reduction, such as Multi Dimensional Scaling (MDS). We will first extract the set to explore along the sample annotations (used for plot colouring).

se <- getWithColData(qf, "proteins")

Warning: 'experiments' dropped; see 'drops()'

We then use the scater package to compute and plot the PCA. For technical reasons, it requires SingleCellExperiment class object, but these can easily be generated from a SummarizedExperiment object.

se <- runMDS(as(se, "SingleCellExperiment"), exprs_values = 1)

We can now explore the data structure while coloring for the factor of interest, here condition.

plotMDS(se, colour_by = "condition", shape_by = "rep")

MDS visualisation of the spike-in data set. Each point represents a sample and is coloured by the spike-in condiion.

This plot reveals interesting information. First, we see that the samples are nicely separated according to their spike-in condition. Interestingly, the conditions are sorted by the concentration.

5.5 Correlation matrix

corMat <- qf[["proteins"]] |> 
  assay() |>
  cor(method = "spearman", use = "pairwise.complete.obs") 

colnames(corMat) <- qf$sampleId
rownames(corMat) <- qf$sampleId
corMat |> 
  ggcorrplot::ggcorrplot() +
  scale_fill_viridis_c() +
  labs(title = "Correlation matrix",
       fill = "Correlation") +
  theme(axis.text.x = element_text(angle = 90))

Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the ggcorrplot package.
  Please report the issue at <https://github.com/kassambara/ggcorrplot/issues>.

Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.

6 Data Modeling (Robust Regression)

6.1 Model estimation

We first define the model. We only have one sources of variability in the experiment that we can model, i.e. the effect of the spike-in condition.
We fit the model to each protein in the proteins summarised experiment of the QFeatures object qf using the msqrob function.

model <- ~ condition

qf <- msqrob(
  qf,
  i = "proteins",
  formula = model,
  robust = TRUE)

We enabled M-estimation (robust = TRUE) for improved robustness against outliers.

The fitting results are available in the msqrobModels column of the rowData. More specifically, the modelling output is stored in the rowData as a statModel object, one model per row (protein). We will see in a later section how to perform statistical inference on the estimated parameters.

models <- rowData(qf[["proteins"]])[["msqrobModels"]]
models[1:3]

$D6VTK4
Object of class "StatModel"
The type is "rlm"
There number of elements is 5 

$O13297
Object of class "StatModel"
The type is "rlm"
There number of elements is 5 

$O13329
Object of class "StatModel"
The type is "rlm"
There number of elements is 5

6.2 Inference

We can now convert the research question “do the spike-in conditions affect the protein intensities?” into a statistical hypothesis.

In other words, we need to translate this question in a null and alternative hypothesis on a single model parameter or a linear combination of model parameters, which is also referred to with a contrast.

To aid defining contrasts, we will visualise the experimental design using the ExploreModelMatrix package.

vd <- ExploreModelMatrix::VisualizeDesign(
    sampleData =  colData(qf),
    designFormula = model,
    textSizeFitted = 4
)
vd$plotlist

[[1]]

We have a two group comparison and the average log2 fold change between condition 4 and condition 2 can be estimated using parameter:

\[ \log_2 \widehat{\text{FC}}^{4-2} = \hat\mu^4 - \hat\mu^2 = ((Intercept) + condition4) - (Intercept) = condition4 \] So we can assess the null hypothesis that a protein is not differentially abundant between spike-in condition 4 and spike-in condition 2 as follows \[ H_0 \log_2 \widehat{\text{FC}}^{4-2} = 0 \text{ or } condition4 = 0 \] Below we define the contrast and construct the corresponding contrast matrix.

contrast <- "condition4 = 0"
(L <- makeContrast(
  contrast,
   parameterNames = colnames(vd$designmatrix)
))

            condition4
(Intercept)          0
condition4           1

We assess the contrast for each protein.

qf <- hypothesisTest(qf, i = "proteins", contrast = L)

We extract the results table from the proteins summarised experiment in the qf object.

inference <- 
  msqrobCollect(qf[["proteins"]], L)

6.3 Report results

We report the results using a results table, volcano plots and heatmaps.

6.3.1 Results table

We report all results that are significant at the nominal FDR level of 5%.

We define the nominal FDR level
We filter the results
We arrange them according to statistical significance

alpha  <- 0.05
inference |> 
  filter(adjPval < alpha) |> 
  arrange(pval) |>
  knitr::kable()

	logFC	se	df	t	pval	adjPval	contrast	feature
condition4.UPS\|P02768ups\|ALBU_HUMAN_UPS	0.9528549	0.0634975	6.487500	15.006191	0.0000028	0.0039920	condition4	UPS\|P02768ups\|ALBU_HUMAN_UPS
condition4.UPS\|P63165ups\|SUMO1_HUMAN_UPS	1.0505340	0.0716982	6.484535	14.652160	0.0000033	0.0039920	condition4	UPS\|P63165ups\|SUMO1_HUMAN_UPS
condition4.UPS\|P00918ups\|CAH2_HUMAN_UPS	0.9423899	0.0667104	6.371761	14.126580	0.0000048	0.0039920	condition4	UPS\|P00918ups\|CAH2_HUMAN_UPS
condition4.UPS\|P02144ups\|MYG_HUMAN_UPS	0.9597937	0.0693280	6.366383	13.844253	0.0000055	0.0039920	condition4	UPS\|P02144ups\|MYG_HUMAN_UPS
condition4.UPS\|Q06830ups\|PRDX1_HUMAN_UPS	0.8758805	0.0682188	6.487500	12.839283	0.0000075	0.0039920	condition4	UPS\|Q06830ups\|PRDX1_HUMAN_UPS
condition4.UPS\|P06732ups\|KCRM_HUMAN_UPS	1.0579755	0.0778549	6.154138	13.589068	0.0000081	0.0039920	condition4	UPS\|P06732ups\|KCRM_HUMAN_UPS
condition4.UPS\|P04040ups\|CATA_HUMAN_UPS	0.9148158	0.0748686	6.407758	12.218950	0.0000112	0.0047610	condition4	UPS\|P04040ups\|CATA_HUMAN_UPS
condition4.UPS\|P00915ups\|CAH1_HUMAN_UPS	0.8637199	0.0787941	6.487500	10.961727	0.0000200	0.0069082	condition4	UPS\|P00915ups\|CAH1_HUMAN_UPS
condition4.UPS\|P02753ups\|RETBP_HUMAN_UPS	1.1035108	0.1012478	6.481259	10.899107	0.0000209	0.0069082	condition4	UPS\|P02753ups\|RETBP_HUMAN_UPS
condition4.UPS\|P62937ups\|PPIA_HUMAN_UPS	0.9416590	0.0757976	5.569455	12.423343	0.0000285	0.0076967	condition4	UPS\|P62937ups\|PPIA_HUMAN_UPS
condition4.UPS\|P12081ups\|SYHC_HUMAN_UPS	1.0073028	0.0826043	5.621870	12.194311	0.0000295	0.0076967	condition4	UPS\|P12081ups\|SYHC_HUMAN_UPS
condition4.UPS\|P15559ups\|NQO1_HUMAN_UPS	0.9345225	0.0915209	6.487500	10.211028	0.0000310	0.0076967	condition4	UPS\|P15559ups\|NQO1_HUMAN_UPS
condition4.UPS\|P62988ups\|UBIQ_HUMAN_UPS	0.8392221	0.0824325	6.148863	10.180720	0.0000448	0.0092216	condition4	UPS\|P62988ups\|UBIQ_HUMAN_UPS
condition4.UPS\|P63279ups\|UBC9_HUMAN_UPS	0.8895154	0.0926709	6.487500	9.598646	0.0000454	0.0092216	condition4	UPS\|P63279ups\|UBC9_HUMAN_UPS
condition4.UPS\|O76070ups\|SYUG_HUMAN_UPS	1.1986991	0.1253296	6.484535	9.564376	0.0000465	0.0092216	condition4	UPS\|O76070ups\|SYUG_HUMAN_UPS
condition4.UPS\|P01031ups\|CO5_HUMAN_UPS	0.9610913	0.1046469	6.487500	9.184136	0.0000594	0.0110382	condition4	UPS\|P01031ups\|CO5_HUMAN_UPS
condition4.UPS\|P01008ups\|ANT3_HUMAN_UPS	0.8674849	0.0971549	6.484535	8.928887	0.0000706	0.0123592	condition4	UPS\|P01008ups\|ANT3_HUMAN_UPS
condition4.UPS\|P68871ups\|HBB_HUMAN_UPS	0.8686510	0.1072121	6.487500	8.102172	0.0001264	0.0208844	condition4	UPS\|P68871ups\|HBB_HUMAN_UPS
condition4.UPS\|P69905ups\|HBA_HUMAN_UPS	0.9129100	0.1149937	6.282867	7.938786	0.0001682	0.0263382	condition4	UPS\|P69905ups\|HBA_HUMAN_UPS

We observe that the results only contain spiked UPS proteins.

6.3.2 Volcanoplots

volcanoplot4 <- plotVolcano(inference)
volcanoplot4

Note, that the log fold changes are nicely around the real log2 fold change: \(log2(4/2) = 1\)

6.3.3 Heatmaps

We make the heatmap as follows

We select the names of the proteins that were declared significant between condition A and condition B and extract their quantitative data.
We extract the quantitative data with assay() and scale by rows.
We will create a heatmap using the ComplexHeatmap package, which enables heatmap annotations. We will annotate the heatmap using our model variables condition and lab.
We make the heatmap

sig <- inference |> 
  filter(adjPval < alpha) |> 
  arrange(pval) |> 
  pull(feature) #1.

se <- getWithColData(qf, "proteins")

Warning: 'experiments' dropped; see 'drops()'

quants <- t(scale(t(assay(se[sig,])))) #2. 

colnames(quants) <- paste0("con", se$condition,"_rep",se$rep) #specific to this dataset to get short colnames
annotations <- columnAnnotation(
  condition = se$condition
) #3.

set.seed(1234) ## annotation colours are randomly generated by default
heatmap <- Heatmap(
                       quants, 
                       name = "log2 intensity",
                       top_annotation = annotations, 
                       column_title = paste0(contrast, " = 0")
                       ) #4.
heatmap

6.3.4 Detail plots

We can explore the data for a protein to validate the statistical inference results. For example, let’s explore the normalised precursor and the summarised protein intensities for the protein with the most significant log2 fold change.

(targetProtein <- inference |> 
   slice(which.min(pval)) |> 
   pull(feature)
)

[1] "UPS|P02768ups|ALBU_HUMAN_UPS"

To obtain the required data, we perform a little data manipulation pipeline:

We use the QFeatures subsetting functionality to retrieve all data related to and focusing on the peptides_log and proteins sets that contains the peptide ion data used for model fitting. We then convert the data with longForm() for plotting. Finally, we plot the log2 normalised intensities for each sample at the protein and at the peptide level. Since multiple peptides are recorded for the protein, we link peptides across samples using a grey line. Samples are colored according to E. coli spike-in condition.

qf[targetProtein, , c("precursors_norm", "proteins")] |> #1
  longForm(colvars = colnames(colData(qf))) |> #2
  data.frame() |> 
  ggplot() +
  aes(x = colname,
      y = value) +
  geom_line(aes(group = rowname), linewidth = 0.1) +
  geom_point(aes(colour = condition)) +
  facet_wrap(~ assay, scales = "free") +
  ggtitle(targetProtein) +
  theme_minimal() +
  theme(axis.text.x = element_blank())

Warning: 'experiments' dropped; see 'drops()'

harmonizing input:
  removing 18 sampleMap rows not in names(experiments)

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_line()`).

Warning: Removed 9 rows containing missing values or values outside the scale range
(`geom_point()`).

qf[targetProtein, , c("precursors_norm", "proteins")] |> #1
  longForm(colvars = colnames(colData(qf))) |> #2
  data.frame() %>% 
  {
  ggplot(.) +
  aes(x = condition,
      y = value) +
  geom_boxplot(aes(colour = condition)) +
  facet_wrap(~ assay, scales = "free") +
  geom_jitter(aes(shape = rowname)) +
  scale_shape_manual(values = seq_len(dplyr::n_distinct(.$rowname))) +
  ggtitle(targetProtein) +
  theme_minimal() +
  theme(axis.text.x = element_blank()) +
  guides(shape = "none")
  }

Warning: 'experiments' dropped; see 'drops()'

harmonizing input:
  removing 18 sampleMap rows not in names(experiments)

Warning: Removed 9 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Warning: Removed 9 rows containing missing values or values outside the scale range
(`geom_point()`).

Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '27'

Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '31'

Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '29'

Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '28'

Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '27'

Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '31'

Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '29'

Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '28'

Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '26'

Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '27'

Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '31'

Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '26'

Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '27'

Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '31'

Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '29'

Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '28'

Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '26'

Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '27'

Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '31'

Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '27'

Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '31'

Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '29'

Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '28'

Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '26'

Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '30'
Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '30'
Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '30'
Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '30'
Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '30'
Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '30'

References

Staes, An, Teresa Mendes Maia, Sara Dufour, et al. 2024. “Benefit of in Silico Predicted Spectral Libraries in Data‑independent Acquisition Data Analysis Workflows.” Journal of Proteome Research 23 (6): 2078–89. https://doi.org/10.1021/acs.jproteome.4c00048.