ms2pip.search_space

Define and build the search space for in silico spectral library generation.

This module defines the search space for in silico spectral library generation as a ProteomeSearchSpace object. Variable and fixed modifications can be configured as ModificationConfig objects.

The peptide search space can be built from a protein FASTA file and a set of parameters, which can then be converted to a psm_utils.PSMList object for use in ms2pip. All parameters are listed below at ProteomeSearchSpace and can be passed as a dictionary, a JSON file, or as a ProteomeSearchSpace object. For example:

{
  "fasta_file": "test.fasta",
  "min_length": 8,
  "max_length": 3,
  "cleavage_rule": "trypsin",
  "missed_cleavages": 2,
  "semi_specific": false,
  "add_decoys": true,
  "modifications": [
    {
      "label": "UNIMOD:Oxidation",
      "amino_acid": "M"
    },
    {
      "label": "UNIMOD:Carbamidomethyl",
      "amino_acid": "C",
      "fixed": true
    }
  ],
  "max_variable_modifications": 3,
  "charges": [2, 3]
}

For an unspecific protein digestion, the cleavage rule can be set to unspecific. This will result in a cleavage rule that allows cleavage after any amino acid with an unlimited number of allowed missed cleavages.

To disable protein digestion when the FASTA file contains peptides, set the cleavage rule to -. This will treat each line in the FASTA file as a separate peptide sequence, but still allow for modifications and charges to be added.

Examples

>>> from ms2pip.search_space import ProteomeSearchSpace, ModificationConfig
>>> search_space = ProteomeSearchSpace(
...     fasta_file="tests/data/test_proteins.fasta",
...     min_length=8,
...     max_length=30,
...     cleavage_rule="trypsin",
...     missed_cleavages=2,
...     semi_specific=False,
...     modifications=[
...         ModificationConfig(label="UNIMOD:Oxidation", amino_acid="M"),
...         ModificationConfig(label="UNIMOD:Carbamidomethyl", amino_acid="C", fixed=True),
...     ],
...     charges=[2, 3],
... )
>>> psm_list = search_space.into_psm_list()
>>> from ms2pip.search_space import ProteomeSearchSpace
>>> search_space = ProteomeSearchSpace.from_any("tests/data/test_search_space.json")
>>> psm_list = search_space.into_psm_list()
class ms2pip.search_space.ModificationConfig(*, label, amino_acid=None, peptide_n_term=False, protein_n_term=False, peptide_c_term=False, protein_c_term=False, fixed=False)[source]

Bases: BaseModel

Configuration for a single modification in the search space.

Parameters:
  • label (str) – Label of the modification. This can be any valid ProForma 2.0 label.

  • amino_acid (str | None) – Amino acid target of the modification. None if the modification is not specific to an amino acid. Default is None.

  • peptide_n_term (bool | None) – Whether the modification occurs only on the peptide N-terminus. Default is False.

  • protein_n_term (bool | None) – Whether the modification occurs only on the protein N-terminus. Default is False.

  • peptide_c_term (bool | None) – Whether the modification occurs only on the peptide C-terminus. Default is False.

  • protein_c_term (bool | None) – Whether the modification occurs only on the protein C-terminus. Default is False.

  • fixed (bool | None) – Whether the modification is fixed. Default is False.

model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'amino_acid': FieldInfo(annotation=Union[str, NoneType], required=False, default=None), 'fixed': FieldInfo(annotation=Union[bool, NoneType], required=False, default=False), 'label': FieldInfo(annotation=str, required=True), 'peptide_c_term': FieldInfo(annotation=Union[bool, NoneType], required=False, default=False), 'peptide_n_term': FieldInfo(annotation=Union[bool, NoneType], required=False, default=False), 'protein_c_term': FieldInfo(annotation=Union[bool, NoneType], required=False, default=False), 'protein_n_term': FieldInfo(annotation=Union[bool, NoneType], required=False, default=False)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

class ms2pip.search_space.ProteomeSearchSpace(*, fasta_file, min_length=8, max_length=30, min_precursor_mz=0, max_precursor_mz=inf, cleavage_rule='trypsin', missed_cleavages=2, semi_specific=False, add_decoys=False, modifications=[ModificationConfig(label='UNIMOD:Oxidation', amino_acid='M', peptide_n_term=False, protein_n_term=False, peptide_c_term=False, protein_c_term=False, fixed=False), ModificationConfig(label='UNIMOD:Carbamidomethyl', amino_acid='C', peptide_n_term=False, protein_n_term=False, peptide_c_term=False, protein_c_term=False, fixed=True)], max_variable_modifications=3, charges=[2, 3])[source]

Bases: BaseModel

Search space for in silico spectral library generation.

Parameters:
  • fasta_file (pathlib.Path) – Path to FASTA file with protein sequences.

  • min_length (int) – Minimum peptide length. Default is 8.

  • max_length (int) – Maximum peptide length. Default is 30.

  • min_precursor_mz (float | None) – Minimum precursor m/z for peptides. Default is 0.

  • max_precursor_mz (float | None) – Maximum precursor m/z for peptides. Default is np.Inf.

  • cleavage_rule (str) – Cleavage rule for peptide digestion. Default is “trypsin”.

  • missed_cleavages (int) – Maximum number of missed cleavages. Default is 2.

  • semi_specific (bool) – Allow semi-specific cleavage. Default is False.

  • add_decoys (bool) – Add decoy sequences to search space. Default is False.

  • modifications (List[ms2pip.search_space.ModificationConfig]) – List of modifications to consider. Default is oxidation of M and carbamidomethylation of C.

  • max_variable_modifications (int) – Maximum number of variable modifications per peptide. Default is 3.

  • charges (List[int]) – List of charges to consider. Default is [2, 3].

classmethod from_any(_input)[source]

Create ProteomeSearchSpace from various input types.

Parameters:

_input (dict | str | Path | ProteomeSearchSpace) – Search space parameters as a dictionary, a path to a JSON file, an existing ProteomeSearchSpace object.

Return type:

ProteomeSearchSpace

build(processes=1)[source]

Build peptide search space from FASTA file.

Parameters:

processes (int) – Number of processes to use for parallelization.

filter_psms_by_mz(psms)[source]

Filter PSMs by precursor m/z range.

Parameters:

psms (PSMList) –

Return type:

PSMList

model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'add_decoys': FieldInfo(annotation=bool, required=False, default=False), 'charges': FieldInfo(annotation=List[int], required=False, default=[2, 3]), 'cleavage_rule': FieldInfo(annotation=str, required=False, default='trypsin'), 'fasta_file': FieldInfo(annotation=Path, required=True), 'max_length': FieldInfo(annotation=int, required=False, default=30), 'max_precursor_mz': FieldInfo(annotation=Union[float, NoneType], required=False, default=inf), 'max_variable_modifications': FieldInfo(annotation=int, required=False, default=3), 'min_length': FieldInfo(annotation=int, required=False, default=8), 'min_precursor_mz': FieldInfo(annotation=Union[float, NoneType], required=False, default=0), 'missed_cleavages': FieldInfo(annotation=int, required=False, default=2), 'modifications': FieldInfo(annotation=List[ms2pip.search_space.ModificationConfig], required=False, default=[ModificationConfig(label='UNIMOD:Oxidation', amino_acid='M', peptide_n_term=False, protein_n_term=False, peptide_c_term=False, protein_c_term=False, fixed=False), ModificationConfig(label='UNIMOD:Carbamidomethyl', amino_acid='C', peptide_n_term=False, protein_n_term=False, peptide_c_term=False, protein_c_term=False, fixed=True)]), 'semi_specific': FieldInfo(annotation=bool, required=False, default=False)}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.