Usage

Quickstart

[todo]

Usage modes

MS²PIP has various usage modes that each can be accessed through the command-line interface, or through the Python API.

predict-single

In this mode, a single peptide spectrum is predicted with MS²PIP and optionally plotted with spectrum_utils. For instance,

ms2pip predict-single "PGAQANPYSR/3" --model TMT

results in:

Predicted spectrum

predict-batch

Provide a list of peptidoforms (see Peptides / PSMs) to predict multiple spectra at once. For instance,

ms2pip predict-batch peptides.tsv --model TMT

results in a file test_predictions.csv with the predicted spectra.

predict-library

Predict spectra for a full peptide search space generated from a protein FASTA file. Various peptide search space parameters can be configured to control the peptidoforms that are generated. See ms2pip.search_space for more information.

This mode was first developed in collaboration with the ProGenTomics group for the MS²PIP for DIA project.

correlate

Predict spectrum intensities for a list of peptides and correlate them with observed intensities from a spectrum file. This mode is useful for evaluating MS²PIP models or for (re)scoring peptide-spectrum matches.

get-training-data

Given a list of peptides and corresponding spectra, generate training data for MS²PIP. This includes observed intensities for the supported ion types and the feature vectors for each ion. For more info, see Training new MS²PIP models.

annotate-spectra

Given a list of peptides annotate the peaks in the corresponding spectra.

Input

Peptides / PSMs

PSM file types

For peptide information input, MS²PIP accepts any file format that is supported by psm_utils.See Supported file formats for the full list. The simplest format is a tab-separated file with at least the columns peptidoform and spectrum_id present.

  • peptidoform is the full ProForma 2.0 notation including amino acid modifications and precursor charge state.

  • spectrum_id should match the TITLE or nativeID field of the related spectrum in the optional MGF or mzML file, if provided. Otherwise, any value is accepted.

For example:

peptidoform spectrum_id
RNVIM[Oxidation]DKVAK/2     1
KHLEQHPK/2  2
...

See psm_utils.io.tsv for the full specification.

Peptide sequence properties

Peptides must be strictly longer than 2 and shorter than 100 amino acids and cannot contain the following amino acid one-letter codes: B, J, O, U, X or Z. Peptides not fulfilling these requirements will be filtered out and will not be reported in the output.

Amino acid modifications

Amino acid modification labels must be resolvable to a known mass shift. This means that accepted labels are:

  • A name or accession from an controlled vocabulary, such as Unimod or PSI-MOD. (e.g., Oxidation, U:Oxidation, U:35, MOD:00046…)

  • An elemental formula (e.g, Formula:C12H20O2)

  • A mass shift in Da (e.g., +15.9949)

Any unresolvable modification will result in an error. If needed, PSM files can be converted with psm_utils.io and modifications can be renamed with the rename_modifications() method.

Spectrum file

In the correlate and get-training-data usage modes, an MGF or mzML file with observed spectra must be provided to MS²PIP. Make sure that the PSM file spectrum_id matches the MGF TITLE field or mzML nativeID fields. Spectra present in the spectrum file, but missing in the PSM file (and vice versa) will be skipped.

Output

The predictions are saved in the output file(s) specified command. Note that the normalization of intensities depends on the output file format. In the CSV file output, intensities are log2-transformed. To “unlog” the intensities, use the following formula:

intensity = (2 ** log2_intensity) - 0.001