Tutorial: Intro to DIA proteomics data processing

Author
Affiliation

Lieven Clement

Ghent University

1 Background

We will use using a publicly available spike-in study published by Staes et al. (Staes et al. 2024). They spiked digested UPS proteins in a yeast digested background at the following ratio’s (yeast:ups ratio 10:1, 10:2, 10:4, 10:8, 10:10). Here we will use a subset of the data, i.e. dilutions 10:2 and 10:4. We will use output of the search engine DIA-NN 2.2.0. The main search output for this DIA-NN version was stored in the report.parquet file in the DIA-NN output directory, which can be found under data/spikein24-staesetal2024.parquet

DIA-NN provides multiple quantifications, e.g. derived from the MS1 or MS2 spectra, and at precursor or protein (protein group) level. The term ‘precursor’ refers to a charged peptide species and is the basic unit of identification and quantification in DIA. Hence, in the context of DIA we refer to a precursor table, instead of to a PSM table in DDA.

Examples of different quantities are:

  • raw MS1 area: Ms1.Area, normalised MS1 Area: Ms1.Normalised, MS2 Precursor quantities: Precursor.Quantity, Normalised MS2 Precursor quantities: Precursor.Normalised, etc., which are all at the precursor level
  • MS2 based summary at the protein (protein group)-level: PG.MaxLFQ

Here, we will use the Precursor.Quantity column.

2 Load packages

We load the msqrob2 package, along with additional packages for data manipulation and visualisation.

library("QFeatures")  
Loading required package: MultiAssayExperiment
Warning: package 'MultiAssayExperiment' was built under R version 4.5.3
Loading required package: SummarizedExperiment
Loading required package: MatrixGenerics
Loading required package: matrixStats

Attaching package: 'MatrixGenerics'
The following objects are masked from 'package:matrixStats':

    colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
    colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
    colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
    colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
    colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
    colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
    colWeightedMeans, colWeightedMedians, colWeightedSds,
    colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
    rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
    rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
    rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
    rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
    rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
    rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
    rowWeightedSds, rowWeightedVars
Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: generics

Attaching package: 'generics'
The following objects are masked from 'package:base':

    as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
    setequal, union

Attaching package: 'BiocGenerics'
The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':

    anyDuplicated, aperm, append, as.data.frame, basename, cbind,
    colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
    get, grep, grepl, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, saveRDS, table, tapply, unique,
    unsplit, which.max, which.min
Loading required package: S4Vectors

Attaching package: 'S4Vectors'
The following object is masked from 'package:utils':

    findMatches
The following objects are masked from 'package:base':

    expand.grid, I, unname
Loading required package: IRanges
Loading required package: Seqinfo
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Attaching package: 'Biobase'
The following object is masked from 'package:MatrixGenerics':

    rowMedians
The following objects are masked from 'package:matrixStats':

    anyMissing, rowMedians

Attaching package: 'QFeatures'
The following object is masked from 'package:base':

    sweep
library("dplyr") 

Attaching package: 'dplyr'
The following object is masked from 'package:Biobase':

    combine
The following objects are masked from 'package:GenomicRanges':

    intersect, setdiff, union
The following object is masked from 'package:Seqinfo':

    intersect
The following objects are masked from 'package:IRanges':

    collapse, desc, intersect, setdiff, slice, union
The following objects are masked from 'package:S4Vectors':

    first, intersect, rename, setdiff, setequal, union
The following objects are masked from 'package:BiocGenerics':

    combine, intersect, setdiff, setequal, union
The following object is masked from 'package:generics':

    explain
The following object is masked from 'package:matrixStats':

    count
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library("tidyr")

Attaching package: 'tidyr'
The following object is masked from 'package:S4Vectors':

    expand
library("ggplot2")
library("msqrob2")    
library("stringr")
library("ExploreModelMatrix")
library("MsCoreUtils") 

Attaching package: 'MsCoreUtils'
The following object is masked from 'package:dplyr':

    between
The following object is masked from 'package:GenomicRanges':

    reduce
The following object is masked from 'package:IRanges':

    reduce
The following object is masked from 'package:MatrixGenerics':

    colCounts
The following object is masked from 'package:matrixStats':

    colCounts
The following object is masked from 'package:stats':

    smooth
library("matrixStats")
library("patchwork")
library("kableExtra")

Attaching package: 'kableExtra'
The following object is masked from 'package:dplyr':

    group_rows
library("ComplexHeatmap")
Loading required package: grid
========================================
ComplexHeatmap version 2.26.1
Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
Github page: https://github.com/jokergoo/ComplexHeatmap
Documentation: http://jokergoo.github.io/ComplexHeatmap-reference

If you use it in published research, please cite either one:
- Gu, Z. Complex Heatmap Visualization. iMeta 2022.
- Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional 
    genomic data. Bioinformatics 2016.


The new InteractiveComplexHeatmap package can directly export static 
complex heatmaps into an interactive Shiny app with zero effort. Have a try!

This message can be suppressed by:
  suppressPackageStartupMessages(library(ComplexHeatmap))
========================================
library("purrr")

Attaching package: 'purrr'
The following object is masked from 'package:MsCoreUtils':

    reduce
The following object is masked from 'package:GenomicRanges':

    reduce
The following object is masked from 'package:IRanges':

    reduce
library("tibble")
library("scater")
Loading required package: SingleCellExperiment
Loading required package: scuttle

3 Data

3.1 Precursor table

We load the output from DIA-NN parquet file. Can be file path to local file or url to file that lives on the web.

precursorFile <- "https://github.com/statOmics/msqrob2data/raw/refs/heads/main/dia/spikeinStaes/spikein24-staesetal2024.parquet" 

We can import the report.parquet file using the read_parquet function from the arrow package. Note, that older versions of DIA-NN store the output as report.tsv.

precursors <- arrow::read_parquet(precursorFile) # function from the arrow package
#precursors <- data.table::fread(precursorFile) # For older versions of DIA-NN, where the results are stored as tsv files. Note that the precursorFile then would point to "report.tsv" 

Each row in the precursor data table is in “long format” and contains information about one precursor in a specific run (the table below shows the first 6 rows). The columns contains various descriptors about the precursor, such as its sequence, its charge, run, etc. Some of these columns contain the quantification values, e.g. Ms1.Area, Precursor.Quantity etc.

Run.Index Run Channel Precursor.Id Modified.Sequence Stripped.Sequence Precursor.Charge Precursor.Lib.Index Decoy Proteotypic Precursor.Mz Protein.Ids Protein.Group Protein.Names Genes RT iRT Predicted.RT Predicted.iRT IM iIM Predicted.IM Predicted.iIM Precursor.Quantity Precursor.Normalised Ms1.Area Ms1.Normalised Ms1.Apex.Area Ms1.Apex.Mz.Delta Normalisation.Factor Quantity.Quality Empirical.Quality Normalisation.Noise Ms1.Profile.Corr Evidence Mass.Evidence Channel.Evidence Ms1.Total.Signal.Before Ms1.Total.Signal.After RT.Start RT.Stop FWHM PG.TopN PG.MaxLFQ Genes.TopN Genes.MaxLFQ Genes.MaxLFQ.Unique PG.MaxLFQ.Quality Genes.MaxLFQ.Quality Genes.MaxLFQ.Unique.Quality Q.Value PEP Global.Q.Value Lib.Q.Value Peptidoform.Q.Value Global.Peptidoform.Q.Value Lib.Peptidoform.Q.Value PTM.Site.Confidence Site.Occupancy.Probabilities Protein.Sites Lib.PTM.Site.Confidence Translated.Q.Value Channel.Q.Value PG.Q.Value PG.PEP GG.Q.Value Protein.Q.Value Global.PG.Q.Value Lib.PG.Q.Value Best.Fr.Mz Best.Fr.Mz.Delta
0 B000254_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio02_DIA (UniMod:1)AAVTLHLR2 (UniMod:1)AAVTLHLR AAVTLHLR 2 0 0 1 461.7771 P38998 P38998 LYS1_YEAST LYS1 89.48647 34.21962 89.64584 34.32372 0 0 0 0 1027105.2 1018664.5 1664103 1650427 1235652.9 -0.0019836 0.9917820 0.8068524 0.9587077 -0.0107658 0.8455485 2.328422 0.6659890 0.9694718 1446081536 1354918656 89.27837 89.83672 0.1471047 0 60736892 0 60736892 60736892 0.9713600 0.9713600 0.9713600 0.0037350 0.0346908 6.00e-07 1.80e-06 0.0058624 0.0006297 0.0004264 1 (UniMod:1){1.000000}AAVTLHLR2 [P38998:nA2] 1 0 0 0.0003065 0.0006127 0.0003067 0.0003116 0.0002956 0.0002814 639.3937 -0.0043335
1 B000258_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio04_DIA (UniMod:1)AAVTLHLR2 (UniMod:1)AAVTLHLR AAVTLHLR 2 0 0 1 461.7771 P38998 P38998 LYS1_YEAST LYS1 88.51812 34.21962 88.65327 34.23263 0 0 0 0 1686047.0 1473639.0 2225964 1945538 2138060.2 -0.0012512 0.8740202 0.8539391 0.7675559 0.0606278 0.9467068 2.621551 0.9197075 0.7442634 1707504896 1714693888 88.38112 88.65714 0.1049480 0 59386180 0 59386180 59386180 0.9682981 0.9682981 0.9682981 0.0000006 0.0000010 6.00e-07 1.80e-06 0.0007060 0.0006297 0.0004264 1 (UniMod:1){1.000000}AAVTLHLR2 [P38998:nA2] 1 0 0 0.0003028 0.0006007 0.0003029 0.0003108 0.0002956 0.0002814 538.3460 -0.0010986
3 B000278_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio02_DIA_2 (UniMod:1)AAVTLHLR2 (UniMod:1)AAVTLHLR AAVTLHLR 2 0 0 1 461.7771 P38998 P38998 LYS1_YEAST LYS1 88.94958 34.21962 89.25366 33.75805 0 0 0 0 1202390.5 1186230.2 1752379 1728827 1130526.8 -0.0038757 0.9865599 0.7660662 0.7737662 0.0300994 0.5843989 2.661267 0.6917942 0.8114025 2323055360 2137220352 88.61057 89.15668 0.2167290 0 61787356 0 61787356 61787356 0.9639115 0.9639115 0.9639115 0.0026284 0.0164647 6.00e-07 1.80e-06 0.0061239 0.0006297 0.0004264 1 (UniMod:1){1.000000}AAVTLHLR2 [P38998:nA2] 1 0 0 0.0003246 0.0006370 0.0003248 0.0003308 0.0002956 0.0002814 425.2619 0.0019226
6 B000302_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio02_DIA_3 (UniMod:1)AAVTLHLR2 (UniMod:1)AAVTLHLR AAVTLHLR 2 0 0 1 461.7771 P38998 P38998 LYS1_YEAST LYS1 88.94778 34.21962 89.22704 33.83727 0 0 0 0 1211047.2 1278841.4 1902638 2009148 1505334.4 -0.0023804 1.0559797 0.8122462 0.9176636 -0.0058500 0.8568795 2.149787 0.2013753 0.9174258 1576011776 1611237888 88.80907 89.22405 0.1854964 0 57830312 0 57830312 57830312 0.9690518 0.9690518 0.9690518 0.0038881 0.0323859 6.00e-07 1.80e-06 0.0095169 0.0006297 0.0004264 1 (UniMod:1){1.000000}AAVTLHLR2 [P38998:nA2] 1 0 0 0.0003108 0.0006149 0.0003110 0.0003160 0.0002956 0.0002814 639.3937 -0.0004883
7 B000306_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio04_DIA_3 (UniMod:1)AAVTLHLR2 (UniMod:1)AAVTLHLR AAVTLHLR 2 0 0 1 461.7771 P38998 P38998 LYS1_YEAST LYS1 89.02591 34.21962 89.40517 33.85801 0 0 0 0 952255.9 960217.9 2226282 2244896 724402.3 0.0000000 1.0083612 0.8162364 0.6293753 0.0134552 0.8299713 2.589256 0.3061074 0.8181052 1451911296 1386510976 88.81858 89.30609 0.1775406 0 58230180 0 58230180 58230180 0.9686975 0.9686975 0.9686975 0.0029205 0.0223659 6.00e-07 1.80e-06 0.0091256 0.0006297 0.0004264 1 (UniMod:1){1.000000}AAVTLHLR2 [P38998:nA2] 1 0 0 0.0003064 0.0006060 0.0003066 0.0003114 0.0002956 0.0002814 639.3937 0.0001221
0 B000254_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio02_DIA (UniMod:1)ADFLLRPIK2 (UniMod:1)ADFLLRPIK ADFLLRPIK 2 1 0 1 557.8346 P38768 P38768 PIH1_YEAST PIH1 121.34409 76.86208 121.09219 77.28922 0 0 0 0 5792544.5 5805857.5 6862994 6878766 5909809.0 -0.0009155 1.0022984 0.9153399 0.8982903 -0.0820856 0.9530441 4.035011 0.9436575 0.9152272 2151185664 2523537408 121.13556 121.55209 0.2078343 0 5839721 0 5839721 5839721 0.9533337 0.9533337 0.9533337 0.0001971 0.0008056 5.94e-05 4.22e-04 0.0038466 0.0000434 0.0004264 1 (UniMod:1){1.000000}ADFLLRPIK2 [P38768:nA2] 1 0 0 0.0003065 0.0006127 0.0003067 0.0003116 0.0002956 0.0002814 739.5189 0.0064697

Quick check on distribution of precursors MS2 intensities.

precursors |> 
  ggplot(aes(x = log2(Precursor.Quantity))) +
  geom_density() +
  theme_minimal()
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_density()`).

We filter the data to reduce the memory footprint.

precursors <- precursors |> 
  select(
         Run, 
         Precursor.Id, 
         Modified.Sequence, 
         Stripped.Sequence, 
         Precursor.Charge, 
         Protein.Group, 
         Protein.Names,
         Protein.Ids,
         Genes,
         Precursor.Quantity,
         Precursor.Normalised,
         Normalisation.Factor,
         Ms1.Area,
         Ms1.Normalised,
         PG.MaxLFQ,
         Q.Value, 
         Lib.Q.Value,
         PG.Q.Value,
         Lib.PG.Q.Value,
         Proteotypic,
         Decoy, # Not available in older versions of DIA-NN
         RT)

We known the ground truth: UPS proteins are differentially abundant (DA, spiked in), Yeast proteins are not.

precursors <- precursors |> 
  mutate(species = grepl(pattern = "UPS",Protein.Group) |> 
           as.factor() |>
           recode("TRUE"="ups","FALSE" = "yeast"))
precursors |> 
  pull(species) |>
  table()

 yeast    ups 
201827   2810 

3.2 Sample annotation table

The sample annotation table) is not available and can be generated from the run labels, as the researchers included information on the design in the filenames. The sample annotation table) is not available and can be generated from the run labels, as the researchers included information on the design in the filenames.

We will make a new data frame with the annotation.

  1. Extract unique Run labels
  2. Split Run label into variables using the “_” separator
  3. Rename the variable Run to runCol (needed for readQFeatures function)
  4. Add sampleId
  5. Sort annotation file according to sampleId
(annot <- precursors |> 
  dplyr::distinct(Run) |> 
  mutate(runCol=Run) |> 
  separate(Run, 
           into = c(paste("label",1:8), "condition","label9","rep")) |>
  mutate(
    condition = substr(condition,7,7), 
    ratio = as.double(condition),
    rep = ifelse(is.na(rep),1,rep),
    sampleId = paste(condition,rep, sep = "_")) |> 
  select(runCol, sampleId, condition,ratio,rep)
)
Warning: Expected 11 pieces. Missing pieces filled with `NA` in 2 rows [1, 2].
# A tibble: 6 × 5
  runCol                                          sampleId condition ratio rep  
  <chr>                                           <chr>    <chr>     <dbl> <chr>
1 B000254_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio02… 2_1      2             2 1    
2 B000258_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio04… 4_1      4             4 1    
3 B000278_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio02… 2_2      2             2 2    
4 B000302_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio02… 2_3      2             2 3    
5 B000306_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio04… 4_3      4             4 3    
6 B000282_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio04… 4_2      4             4 2    

3.3 Convert to QFeatures

First, recall that the precursor table is file in long format. Every quantitative column in the precursor table contains information for multiple runs. Therefore, the function split the table based on the run identifier, given by the runCol argument (for DIA-NN, that identifier is contained in run). So, the QFeatures object after import will contain as many sets as there are runs. Next, the function links the annotation table with the PSM data. To achieve this, the annotation table must contain a runCol column that provides the run identifier in which each sample has been acquired, and this information will be used to match the identifiers in the Run column of the precursor table.

Here, we will use the Precursor.Quantity column as quantification input.

Note, that we filter a number of variables to reduce the footprint of the QFeatures object.

(qf <- readQFeatures(assayData = precursors,  
                              colData = annot,
                              quantCols = "Precursor.Quantity",
                              runCol = "Run",
                              fnames = "Precursor.Id"))
Checking arguments.
Loading data as a 'SummarizedExperiment' object.
Splitting data in runs.
Formatting sample annotations (colData).
Formatting data as a 'QFeatures' object.
Setting assay rownames.
An instance of class QFeatures (type: bulk) with 6 sets:

 [1] B000254_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio02_DIA: SummarizedExperiment with 34471 rows and 1 columns 
 [2] B000258_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio04_DIA: SummarizedExperiment with 35142 rows and 1 columns 
 [3] B000278_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio02_DIA_2: SummarizedExperiment with 33528 rows and 1 columns 
 [4] B000282_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio04_DIA_2: SummarizedExperiment with 33933 rows and 1 columns 
 [5] B000302_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio02_DIA_3: SummarizedExperiment with 33091 rows and 1 columns 
 [6] B000306_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio04_DIA_3: SummarizedExperiment with 34472 rows and 1 columns 

4 Data preprocessing

The data preprocessing workflow for DIA data is similar to the workflow for DDA-LFQ data, but there are suble differences as we start from precursor level data and have additional columns.

4.1 Encoding missing values

We first replace any zero in the quantitative data with an NA.

qf <- zeroIsNA(qf, names(qf))

Note that msqrob2 can handle missing data without having to rely on hard-to-verify imputation assumptions, which is our general recommendation. However, msqrob2 does not prevent users from using imputation, which can be performed with impute() from the QFeatures package.

4.2 Precursor Filtering

Filtering removes low-quality and unreliable precursors that would otherwise introduce noise and artefacts in the data.

4.2.1 Remove questionable identifications

We apply standard filtering advised by DIA-NN:

  1. q-value threshold of 0.01 for the identification of precursors (Q.Value) and protein groups (PG.Q.Value). If the file is processed via matching between runs, it is also useful to filter on Lib.Q.Value and Lib.PG.Q.Value.

  2. Remove precursors that could not be mapped, i.e. when Precursor.Id column is an empty string.

  3. Filter decoys, i.e. only keep precursors for which the Decoy column equals 0. (Note, that the Decoy column is not present in the output of older versions of DIA-NN)

  4. Keeping only proteotypic peptides, which map uniquely to a specific protein.

qf <-  qf |>
  filterFeatures(~ Q.Value <= 0.01 & #1.
                     PG.Q.Value <= 0.01 &  #1.
                     Lib.Q.Value <= 0.01 & #1.
                     Lib.PG.Q.Value <= 0.01 & #1.
                     Precursor.Id != "" & #2.
                     Decoy == 0 & #3. (Note, that this filter is not available with previous versions of DIA-NN, as the report.tsv file did not include a Decoy column. So other strategies are needed if Decoys are in the output file.)
                     Proteotypic == 1) #4.
'Q.Value' found in 6 out of 6 assay(s).'PG.Q.Value' found in 6 out of 6 assay(s).'Lib.Q.Value' found in 6 out of 6 assay(s).'Lib.PG.Q.Value' found in 6 out of 6 assay(s).'Precursor.Id' found in 6 out of 6 assay(s).'Decoy' found in 6 out of 6 assay(s).'Proteotypic' found in 6 out of 6 assay(s).

Note, that it is important that the filtering criteria are not distorting the distribution of the test statistics in the downstream analysis for features that are non-DA. It can be shown that filtering will not induce bias results when the filtering criterion is independent of test statistic. The criteria that we proposed above are all based on the results of the identification step, hence, they are independent of the downstream test statistics that will be used to prioritize DA proteins.

4.2.2 Assay joining

Up to now, the data from different runs were kept in separate assays. We can now join the normalised sets into an precursor set using joinAssays(). Sets are joined by stacking the columns (samples) in a matrix and rows (features) are matched according to a row identifier, here the Precursor.Id. We will store the result in the assay with name: precursor.

(qf <- joinAssays(
  x = qf,
  i = names(qf),
  fcol = "Precursor.Id",
  name = "precursors"
  ))
Using 'Precursor.Id' to join assays.
An instance of class QFeatures (type: bulk) with 7 sets:

 [1] B000254_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio02_DIA: SummarizedExperiment with 32848 rows and 1 columns 
 [2] B000258_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio04_DIA: SummarizedExperiment with 33491 rows and 1 columns 
 [3] B000278_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio02_DIA_2: SummarizedExperiment with 31945 rows and 1 columns 
 [4] B000282_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio04_DIA_2: SummarizedExperiment with 32327 rows and 1 columns 
 [5] B000302_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio02_DIA_3: SummarizedExperiment with 31496 rows and 1 columns 
 [6] B000306_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio04_DIA_3: SummarizedExperiment with 32829 rows and 1 columns 
 [7] precursors: SummarizedExperiment with 38131 rows and 6 columns 

4.2.3 Filtering: Remove highly missing precursors

We keep peptides that were observed at last 4 times out of the \(n = 6\) samples, so that we can estimate the peptide characteristics. We tolerate the following proportion of NAs: \(\text{pNA} = \frac{(n - 4)}{n} = 0.33\), so we keep peptides that are observed in at least 66.6% of the samples, which roughly corresponds to 2 samples per condition. This is an arbitrary value that may need to be adjusted depending on the experiment and the data set.

nObs <- 4
n <- ncol(qf[["precursors"]])

(qf <- filterNA(qf, i = "precursors", pNA = (n - nObs) / n))
An instance of class QFeatures (type: bulk) with 7 sets:

 [1] B000254_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio02_DIA: SummarizedExperiment with 32848 rows and 1 columns 
 [2] B000258_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio04_DIA: SummarizedExperiment with 33491 rows and 1 columns 
 [3] B000278_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio02_DIA_2: SummarizedExperiment with 31945 rows and 1 columns 
 [4] B000282_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio04_DIA_2: SummarizedExperiment with 32327 rows and 1 columns 
 [5] B000302_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio02_DIA_3: SummarizedExperiment with 31496 rows and 1 columns 
 [6] B000306_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio04_DIA_3: SummarizedExperiment with 32829 rows and 1 columns 
 [7] precursors: SummarizedExperiment with 31920 rows and 6 columns 

4.2.4 Filter one-hit wonders

Here, we remove proteins that can only be found by one peptide, as such proteins may not be trustworthy.

  1. We first calculate how many distinct peptides map to each protein (Protein.Group). We use the stripped precursor sequence, i.e. sequence of the base peptide for this purpose.

  2. We store this information in the row data of the precursors assay

  3. We filter precursors of one-hit wonder proteins.

# Filter for peptides per protein
pepsPerProtDf <- qf[["precursors"]] |> 
    rowData() |> 
    data.frame() |>
    dplyr::select("Stripped.Sequence", "Protein.Group") |>
    group_by(Protein.Group) |>
    mutate(pepsPerProt = Stripped.Sequence |>
               unique() |> length()
           ) #1.

rowData(qf[["precursors"]])$pepsPerProt <- pepsPerProtDf$pepsPerProt #2. 

qf <- filterFeatures(qf, 
                              ~ pepsPerProt > 1, 
                              keep = TRUE) #3.
'pepsPerProt' found in 1 out of 7 assay(s).

4.3 Log-transformation

We perform log2-transformation with logTransform() from the QFeatures package. We use base = 2 and store the result in a new summarized experiment named precursors_log

qf <- logTransform(qf, 
                            base = 2, 
                            i = "precursors", 
                            name = "precursors_log")

4.4 Normalisation

The most common objective of MS-based proteomics experiments is to understand the biological changes in protein abundance between experimental conditions. However, changes in measurements between groups can be caused due to technical factors. For instance, there are systematic fluctuations from run-to-run that shift the measured intensity distribution. We can this explore as follows:

  1. We extract the sets containing the log transformed data. This is performed using QFeatures’ 3-way subsetting.
  2. We use longForm() to convert the QFeatures object into a long table, including condition and concentration for filtering and colouring.
  3. We visualise the density of the quantitative values within each sample. We colour each sample based on its spike-in condition.
qf[, , "precursors_log"] |> #1.
  longForm(colvars = colnames(colData(qf))) |>  #2. 
  data.frame() |> 
  filter(!is.na(value)) |>
  ggplot() + #3.
  aes(x = value,
      colour = condition,
      group = colname) +
  geom_density() +
  theme_minimal()
Warning: 'experiments' dropped; see 'drops()'
harmonizing input:
  removing 12 sampleMap rows not in names(experiments)

There are many ways to perform the normalisation, e.g. median centering is a popular choice. If we subtract the sample median at the log scale from each precursor log2 intensity, this basically boils down to calculating log-ratio’s between the precursor intensity and its sample median.

Here, we will use the Median of Ratios method of DESeq2, which was originally developed for RNA-seq data analysis and can also correct for differences in composition of the proteomes in the different samples. The method first calculates pseudo-reference sample (row-wise mean of log2 intensities, which corresponds to the log2 transformed geometric mean). THen calculates log2 ratios of each sample w.r.t. the pseudo-reference. Finally the column wise median of these log2 ratios is taken to obtain the sample based normalisation factor on the log2 scale. This procedure is implemented in the msqrob2 function nfLogMedianOfRatios.

  1. We calculate the log transformed normalisation factor from the Median of Ratios method
  2. Subtract these log2-norm factors from the intensities of each corresponding column of the assay data and store the result in the new assay peptides_norm. (We adopt the sweep function to the peptides_log assay of the spikein QFeatures object with as statistic the log2 normfactor STATS = nfLog the default function FUN = "-", MARGIN = 2 to substract the column wise log2 norm factor from each entry of the corresponding assay data).
qf <- sweep( #4. Subtract log2 norm factor column-by-column (MARGIN = 2)
  qf,
  MARGIN = 2,
  STATS = nfLogMedianOfRatios(qf,"precursors_log"),
  i = "precursors_log",
  name = "precursors_norm"
)
This function aims to calculate norm factors on a log scale, 
          the input data are assumed to be on the log-scale!

We explore the effect of the global normalisation in the subsequent plot.

Formally, the function applies the following operation on each sample \(i\) across all precursors \(p\):

\[ y_{ip}^{\text{norm}} = y_{ip} - \log_2(nf_i) \]

with \(y_ip\) the log2-transformed intensities and \(nf_i\) the log2-transformed norm factor. Upon normalisation, we can see that the distribution of the \(y_{ip}^{\text{norm}}\) nicely overlap (using the same code as above)

qf[, , "precursors_norm"] |>
  longForm(colvars = colnames(colData(qf))) |> 
  data.frame() |> 
  filter(!is.na(value)) |>
  ggplot() + 
  aes(x = value,
      colour = condition,
      group = colname) +
  geom_density() +
  labs(subtitle = "Normalised log2 precursor intensities") +
  theme_minimal()
Warning: 'experiments' dropped; see 'drops()'
harmonizing input:
  removing 18 sampleMap rows not in names(experiments)

4.5 Summarisation

Here, we summarise the precursor-level data into protein intensities using maxLFQ. maxLFQ first calculates all pairwise log ratio’s between samples only using their shared precursors. Particularly, it uses the median of the log ratio’s between the shared precursors s, i.e.  \[r_{ij} = median(y_{sj}-y_{si})\].
Hence, it first eliminates peptide effects. It then estimates the summaries by solving

\[ \sum_i\sum_j(y^{prot}_j - y^{prot}_i - r_ij)^2 \] It is implemented in the maxLFQ function of the iq package.

aggregateFeatures() streamlines summarisation. It requires the name of a rowData column to group the precursors into proteins (or protein groups), here Protein.Group. We provide the summarisation approach through the fun argument. Other summarisation methods are available from the MsCoreUtils package, see ?aggregateFeatures for a comprehensive list. The function will return a QFeatures object with a new set that we called proteins.

(qf <- aggregateFeatures(
  qf, i = "precursors_norm", 
  name = "proteins",
  fcol = "Protein.Group", 
  fun = function(X, ...) iq::maxLFQ(X)$estimate
))
Your quantitative data contain missing values. Please read the relevant
section(s) in the aggregateFeatures manual page regarding the effects
of missing values on data aggregation.

Aggregated: 1/1
An instance of class QFeatures (type: bulk) with 10 sets:

 [1] B000254_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio02_DIA: SummarizedExperiment with 32848 rows and 1 columns 
 [2] B000258_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio04_DIA: SummarizedExperiment with 33491 rows and 1 columns 
 [3] B000278_Ap_6883_EXT-765_DIA_Yeast_UPS2_ratio02_DIA_2: SummarizedExperiment with 31945 rows and 1 columns 
 ...
 [8] precursors_log: SummarizedExperiment with 31524 rows and 6 columns 
 [9] precursors_norm: SummarizedExperiment with 31524 rows and 6 columns 
 [10] proteins: SummarizedExperiment with 2991 rows and 6 columns 

5 Data exploration and QC

Data exploration aims to highlight the main sources of variation in the data prior to data modelling and can pinpoint to outlying or off-behaving samples.

5.1 Marginal distribution at precursor and protein level

qf[, , "precursors_norm"] |>
  longForm(colvars = colnames(colData(qf))) |> 
  data.frame() |> 
  filter(!is.na(value)) |>
  ggplot() + 
  aes(x = value,
      colour = condition,
      group = colname) +
  geom_density() +
  theme_minimal() +
   labs(subtitle = "Normalised log2 precursor intensities")
Warning: 'experiments' dropped; see 'drops()'
harmonizing input:
  removing 24 sampleMap rows not in names(experiments)

qf[, , "precursors_norm"] |>
  longForm(colvars = colnames(colData(qf))) |> 
  data.frame() |> 
  filter(!is.na(value)) |>
  ggplot() + 
  aes(x = sampleId,
      y = value,
      colour = condition,
      group = colname) +
  xlab("sample") +
  geom_boxplot() +
  theme_minimal() +
   labs(subtitle = "Normalised log2 precursor intensities")
Warning: 'experiments' dropped; see 'drops()'
harmonizing input:
  removing 24 sampleMap rows not in names(experiments)

qf[, , "proteins"] |>
  longForm(colvars = colnames(colData(qf))) |> 
  data.frame() |> 
  filter(!is.na(value)) |>
  ggplot() + 
  aes(x = sampleId,
      y = value,
      colour = condition,
      group = colname) +
  xlab("sample") +
  geom_boxplot() +
  theme_minimal() +
   labs(subtitle = "Normalised log2 protein intensities")
Warning: 'experiments' dropped; see 'drops()'
harmonizing input:
  removing 24 sampleMap rows not in names(experiments)

qf[, , "proteins"] |>
  longForm(colvars = colnames(colData(qf))) |> 
  data.frame() |> 
  filter(!is.na(value)) |>
  ggplot() + 
  aes(x = value,
      colour = condition,
      group = colname) +
  geom_density() +
  theme_minimal() +
   labs(subtitle = "Normalised log2 protein intensities")
Warning: 'experiments' dropped; see 'drops()'
harmonizing input:
  removing 24 sampleMap rows not in names(experiments)

5.2 Charge state

qf[, , "precursors_norm"] |>
  longForm(colvars = colnames(colData(qf)), rowvars = "Precursor.Charge") |>
  as.data.frame() |>
  filter(!is.na(value)) |>
  filter(Precursor.Charge<=4) |> 
ggplot(aes(x = sampleId)) +
  geom_bar(aes(fill = factor(Precursor.Charge, levels = 4:1)),
           colour = "black") +
  labs(title = "Peptide types per sample",
       x = "Sample",
       fill = "Charge state") +
  theme_bw()
Warning: 'experiments' dropped; see 'drops()'
harmonizing input:
  removing 24 sampleMap rows not in names(experiments)

5.3 Identifications per sample

qf[,,"precursors_norm"] |> 
  longForm(colvars = colnames(colData(qf)), 
           rowvars= c("Precursor.Id", 
                      "Protein.Group", 
                      "Stripped.Sequence",
                      "Precursor.Charge")) |>
  data.frame() |>
  filter(!is.na(value)) |>
  mutate(cond_rep = paste0("ratio", condition, "_", rep)) |>
  group_by(condition, cond_rep) |>
  summarise(Precursors = length(unique(Precursor.Id)),
            `Protein Groups` = length(unique(Protein.Group))) |> 
  pivot_longer(-(1:2),
               names_to = "Feature",
               values_to = "IDs") |> 
  ggplot(aes(x = cond_rep, y = IDs, fill = condition)) +
  geom_col() +
  #scale_fill_observable() +
  facet_wrap(~Feature,
             scales = "free_y") +
  labs(title = "Precursor and protein group identificiations per sample",
       x = "Sample",
       y = "Identifications") +
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 90))
Warning: 'experiments' dropped; see 'drops()'
harmonizing input:
  removing 24 sampleMap rows not in names(experiments)
`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by condition and cond_rep.
ℹ Output is grouped by condition.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(condition, cond_rep))` for per-operation grouping
  (`?dplyr::dplyr_by`) instead.

# Data Modeling (Robust Regression){#sec-modelling}

5.4 Dimensionality reduction plot

A common approach for data exploration is to perform dimension reduction, such as Multi Dimensional Scaling (MDS). We will first extract the set to explore along the sample annotations (used for plot colouring).

se <- getWithColData(qf, "proteins")
Warning: 'experiments' dropped; see 'drops()'

We then use the scater package to compute and plot the PCA. For technical reasons, it requires SingleCellExperiment class object, but these can easily be generated from a SummarizedExperiment object.

se <- runMDS(as(se, "SingleCellExperiment"), exprs_values = 1)

We can now explore the data structure while coloring for the factor of interest, here condition.

plotMDS(se, colour_by = "condition", shape_by = "rep") 

MDS visualisation of the spike-in data set. Each point represents a sample and is coloured by the spike-in condiion.

This plot reveals interesting information. First, we see that the samples are nicely separated according to their spike-in condition. Interestingly, the conditions are sorted by the concentration.

5.5 Correlation matrix

corMat <- qf[["proteins"]] |> 
  assay() |>
  cor(method = "spearman", use = "pairwise.complete.obs") 

colnames(corMat) <- qf$sampleId
rownames(corMat) <- qf$sampleId
corMat |> 
  ggcorrplot::ggcorrplot() +
  scale_fill_viridis_c() +
  labs(title = "Correlation matrix",
       fill = "Correlation") +
  theme(axis.text.x = element_text(angle = 90))
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the ggcorrplot package.
  Please report the issue at <https://github.com/kassambara/ggcorrplot/issues>.
Scale for fill is already present.
Adding another scale for fill, which will replace the existing scale.

6 Data Modeling (Robust Regression)

6.1 Model estimation

  1. We first define the model. We only have one sources of variability in the experiment that we can model, i.e. the effect of the spike-in condition.

  2. We fit the model to each protein in the proteins summarised experiment of the QFeatures object qf using the msqrob function.

model <- ~ condition

qf <- msqrob(
  qf,
  i = "proteins",
  formula = model,
  robust = TRUE)

We enabled M-estimation (robust = TRUE) for improved robustness against outliers.

The fitting results are available in the msqrobModels column of the rowData. More specifically, the modelling output is stored in the rowData as a statModel object, one model per row (protein). We will see in a later section how to perform statistical inference on the estimated parameters.

models <- rowData(qf[["proteins"]])[["msqrobModels"]]
models[1:3]
$D6VTK4
Object of class "StatModel"
The type is "rlm"
There number of elements is 5 

$O13297
Object of class "StatModel"
The type is "rlm"
There number of elements is 5 

$O13329
Object of class "StatModel"
The type is "rlm"
There number of elements is 5 

6.2 Inference

We can now convert the research question “do the spike-in conditions affect the protein intensities?” into a statistical hypothesis.

In other words, we need to translate this question in a null and alternative hypothesis on a single model parameter or a linear combination of model parameters, which is also referred to with a contrast.

To aid defining contrasts, we will visualise the experimental design using the ExploreModelMatrix package.

vd <- ExploreModelMatrix::VisualizeDesign(
    sampleData =  colData(qf),
    designFormula = model,
    textSizeFitted = 4
)
vd$plotlist
[[1]]

We have a two group comparison and the average log2 fold change between condition 4 and condition 2 can be estimated using parameter:

\[ \log_2 \widehat{\text{FC}}^{4-2} = \hat\mu^4 - \hat\mu^2 = ((Intercept) + condition4) - (Intercept) = condition4 \] So we can assess the null hypothesis that a protein is not differentially abundant between spike-in condition 4 and spike-in condition 2 as follows \[ H_0 \log_2 \widehat{\text{FC}}^{4-2} = 0 \text{ or } condition4 = 0 \] Below we define the contrast and construct the corresponding contrast matrix.

contrast <- "condition4 = 0"
(L <- makeContrast(
  contrast,
   parameterNames = colnames(vd$designmatrix)
))
            condition4
(Intercept)          0
condition4           1

We assess the contrast for each protein.

qf <- hypothesisTest(qf, i = "proteins", contrast = L)

We extract the results table from the proteins summarised experiment in the qf object.

inference <- 
  msqrobCollect(qf[["proteins"]], L)

6.3 Report results

We report the results using a results table, volcano plots and heatmaps.

6.3.1 Results table

We report all results that are significant at the nominal FDR level of 5%.

  1. We define the nominal FDR level
  2. We filter the results
  3. We arrange them according to statistical significance
alpha  <- 0.05
inference |> 
  filter(adjPval < alpha) |> 
  arrange(pval) |>
  knitr::kable()
logFC se df t pval adjPval contrast feature
condition4.UPS|P02768ups|ALBU_HUMAN_UPS 0.9528549 0.0634975 6.487500 15.006191 0.0000028 0.0039920 condition4 UPS|P02768ups|ALBU_HUMAN_UPS
condition4.UPS|P63165ups|SUMO1_HUMAN_UPS 1.0505340 0.0716982 6.484535 14.652160 0.0000033 0.0039920 condition4 UPS|P63165ups|SUMO1_HUMAN_UPS
condition4.UPS|P00918ups|CAH2_HUMAN_UPS 0.9423899 0.0667104 6.371761 14.126580 0.0000048 0.0039920 condition4 UPS|P00918ups|CAH2_HUMAN_UPS
condition4.UPS|P02144ups|MYG_HUMAN_UPS 0.9597937 0.0693280 6.366383 13.844253 0.0000055 0.0039920 condition4 UPS|P02144ups|MYG_HUMAN_UPS
condition4.UPS|Q06830ups|PRDX1_HUMAN_UPS 0.8758805 0.0682188 6.487500 12.839283 0.0000075 0.0039920 condition4 UPS|Q06830ups|PRDX1_HUMAN_UPS
condition4.UPS|P06732ups|KCRM_HUMAN_UPS 1.0579755 0.0778549 6.154138 13.589068 0.0000081 0.0039920 condition4 UPS|P06732ups|KCRM_HUMAN_UPS
condition4.UPS|P04040ups|CATA_HUMAN_UPS 0.9148158 0.0748686 6.407758 12.218950 0.0000112 0.0047610 condition4 UPS|P04040ups|CATA_HUMAN_UPS
condition4.UPS|P00915ups|CAH1_HUMAN_UPS 0.8637199 0.0787941 6.487500 10.961727 0.0000200 0.0069082 condition4 UPS|P00915ups|CAH1_HUMAN_UPS
condition4.UPS|P02753ups|RETBP_HUMAN_UPS 1.1035108 0.1012478 6.481259 10.899107 0.0000209 0.0069082 condition4 UPS|P02753ups|RETBP_HUMAN_UPS
condition4.UPS|P62937ups|PPIA_HUMAN_UPS 0.9416590 0.0757976 5.569455 12.423343 0.0000285 0.0076967 condition4 UPS|P62937ups|PPIA_HUMAN_UPS
condition4.UPS|P12081ups|SYHC_HUMAN_UPS 1.0073028 0.0826043 5.621870 12.194311 0.0000295 0.0076967 condition4 UPS|P12081ups|SYHC_HUMAN_UPS
condition4.UPS|P15559ups|NQO1_HUMAN_UPS 0.9345225 0.0915209 6.487500 10.211028 0.0000310 0.0076967 condition4 UPS|P15559ups|NQO1_HUMAN_UPS
condition4.UPS|P62988ups|UBIQ_HUMAN_UPS 0.8392221 0.0824325 6.148863 10.180720 0.0000448 0.0092216 condition4 UPS|P62988ups|UBIQ_HUMAN_UPS
condition4.UPS|P63279ups|UBC9_HUMAN_UPS 0.8895154 0.0926709 6.487500 9.598646 0.0000454 0.0092216 condition4 UPS|P63279ups|UBC9_HUMAN_UPS
condition4.UPS|O76070ups|SYUG_HUMAN_UPS 1.1986991 0.1253296 6.484535 9.564376 0.0000465 0.0092216 condition4 UPS|O76070ups|SYUG_HUMAN_UPS
condition4.UPS|P01031ups|CO5_HUMAN_UPS 0.9610913 0.1046469 6.487500 9.184136 0.0000594 0.0110382 condition4 UPS|P01031ups|CO5_HUMAN_UPS
condition4.UPS|P01008ups|ANT3_HUMAN_UPS 0.8674849 0.0971549 6.484535 8.928887 0.0000706 0.0123592 condition4 UPS|P01008ups|ANT3_HUMAN_UPS
condition4.UPS|P68871ups|HBB_HUMAN_UPS 0.8686510 0.1072121 6.487500 8.102172 0.0001264 0.0208844 condition4 UPS|P68871ups|HBB_HUMAN_UPS
condition4.UPS|P69905ups|HBA_HUMAN_UPS 0.9129100 0.1149937 6.282867 7.938786 0.0001682 0.0263382 condition4 UPS|P69905ups|HBA_HUMAN_UPS

We observe that the results only contain spiked UPS proteins.

6.3.2 Volcanoplots

volcanoplot4 <- plotVolcano(inference)
volcanoplot4

Note, that the log fold changes are nicely around the real log2 fold change: \(log2(4/2) = 1\)

6.3.3 Heatmaps

We make the heatmap as follows

  1. We select the names of the proteins that were declared significant between condition A and condition B and extract their quantitative data.

  2. We extract the quantitative data with assay() and scale by rows.

  3. We will create a heatmap using the ComplexHeatmap package, which enables heatmap annotations. We will annotate the heatmap using our model variables condition and lab.

  4. We make the heatmap

sig <- inference |> 
  filter(adjPval < alpha) |> 
  arrange(pval) |> 
  pull(feature) #1.

se <- getWithColData(qf, "proteins")
Warning: 'experiments' dropped; see 'drops()'
quants <- t(scale(t(assay(se[sig,])))) #2. 

colnames(quants) <- paste0("con", se$condition,"_rep",se$rep) #specific to this dataset to get short colnames
annotations <- columnAnnotation(
  condition = se$condition
) #3.

set.seed(1234) ## annotation colours are randomly generated by default
heatmap <- Heatmap(
                       quants, 
                       name = "log2 intensity",
                       top_annotation = annotations, 
                       column_title = paste0(contrast, " = 0")
                       ) #4.
heatmap

6.3.4 Detail plots

We can explore the data for a protein to validate the statistical inference results. For example, let’s explore the normalised precursor and the summarised protein intensities for the protein with the most significant log2 fold change.

(targetProtein <- inference |> 
   slice(which.min(pval)) |> 
   pull(feature)
)
[1] "UPS|P02768ups|ALBU_HUMAN_UPS"

To obtain the required data, we perform a little data manipulation pipeline:

We use the QFeatures subsetting functionality to retrieve all data related to and focusing on the peptides_log and proteins sets that contains the peptide ion data used for model fitting. We then convert the data with longForm() for plotting. Finally, we plot the log2 normalised intensities for each sample at the protein and at the peptide level. Since multiple peptides are recorded for the protein, we link peptides across samples using a grey line. Samples are colored according to E. coli spike-in condition.

qf[targetProtein, , c("precursors_norm", "proteins")] |> #1
  longForm(colvars = colnames(colData(qf))) |> #2
  data.frame() |> 
  ggplot() +
  aes(x = colname,
      y = value) +
  geom_line(aes(group = rowname), linewidth = 0.1) +
  geom_point(aes(colour = condition)) +
  facet_wrap(~ assay, scales = "free") +
  ggtitle(targetProtein) +
  theme_minimal() +
  theme(axis.text.x = element_blank())
Warning: 'experiments' dropped; see 'drops()'
harmonizing input:
  removing 18 sampleMap rows not in names(experiments)
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 9 rows containing missing values or values outside the scale range
(`geom_point()`).

qf[targetProtein, , c("precursors_norm", "proteins")] |> #1
  longForm(colvars = colnames(colData(qf))) |> #2
  data.frame() %>% 
  {
  ggplot(.) +
  aes(x = condition,
      y = value) +
  geom_boxplot(aes(colour = condition)) +
  facet_wrap(~ assay, scales = "free") +
  geom_jitter(aes(shape = rowname)) +
  scale_shape_manual(values = seq_len(dplyr::n_distinct(.$rowname))) +
  ggtitle(targetProtein) +
  theme_minimal() +
  theme(axis.text.x = element_blank()) +
  guides(shape = "none")
  }
Warning: 'experiments' dropped; see 'drops()'
harmonizing input:
  removing 18 sampleMap rows not in names(experiments)
Warning: Removed 9 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Removed 9 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '27'
Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '31'
Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '29'
Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '28'
Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '27'
Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '31'
Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '29'
Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '28'
Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '26'
Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '27'
Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '31'
Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '26'
Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '27'
Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '31'
Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '29'
Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '28'
Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '26'
Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '27'
Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '31'
Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '27'
Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '31'
Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '29'
Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '28'
Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '26'
Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '30'
Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '30'
Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '30'
Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '30'
Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '30'
Warning in grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size): unimplemented
pch value '30'

References

Staes, An, Teresa Mendes Maia, Sara Dufour, et al. 2024. “Benefit of in Silico Predicted Spectral Libraries in Data‑independent Acquisition Data Analysis Workflows.” Journal of Proteome Research 23 (6): 2078–89. https://doi.org/10.1021/acs.jproteome.4c00048.