.summarize_modifications

proteopy.pp.summarize_modifications(adata, mod_regex='\\\\s*\\\\(.*?\\\\)', method='sum', layer=None, zero_to_na=False, fill_na=None, skip_na=True, sort=True, keep_var_cols=None, inplace=True, verbose=False)[source]

Aggregate modified peptides by their stripped sequence.

Removes modification annotations from peptide identifiers using a regular expression, groups peptides sharing the same stripped sequence, and summarizes intensities with the chosen aggregation method.

Parameters:
  • adata (AnnData) – AnnData with peptide-level data.

  • mod_regex (str, optional) – Regular expression matching modification annotations to strip from peptide identifiers.

  • method ({'sum', 'mean', 'median', 'max'}, optional) – Aggregation applied to each group of modified peptides.

  • layer (str, optional) – Key in adata.layers to use instead of adata.X.

  • zero_to_na (bool, optional) – If True, replace zeros with np.nan before aggregation.

  • fill_na (float or int, optional) – Replace np.nan values with this constant before aggregation.

  • skip_na (bool, optional) – If True, ignore np.nan during aggregation. If False, any np.nan in a group produces np.nan in the result.

  • sort (bool, optional) – If True, sort the output variables alphabetically by stripped sequence. If False, preserve the order of first appearance.

  • keep_var_cols (list of str, optional) – Additional .var columns to carry over to the output. Values are aggregated per group (unique values joined by ';'). 'peptide_id' and 'protein_id' are always included.

  • inplace (bool, optional) – If True, modify adata in place. Otherwise return a new AnnData.

  • verbose (bool, optional) – Print status messages.

Returns:

Aggregated AnnData when inplace=False; otherwise None. Two columns are added to .var: n_peptidoforms (total variants per stripped sequence) and n_modifications (unique modification sites, identified by their position in the stripped sequence and the matched annotation text, collected across all peptidoforms in the group).

Return type:

AnnData or None

Examples

UniMod notation (default mod_regex).

>>> import proteopy as pr
>>> import numpy as np
>>> import pandas as pd
>>> from anndata import AnnData
>>> pids = [
...     "AAADLM(UniMod:35)AYC(UniMod:4)EAHAKEDPLLTPVPASENPFR",
...     "AAADLMAYC(UniMod:4)EAHAKEDPLLTPVPASENPFR",
...     "AAADLMAYCEAHAKEDPLLTPVPASENPFR",
...     "VVDLMR",
... ]
>>> var = pd.DataFrame(
...     {
...         "peptide_id": pids,
...         "protein_id": ["P1", "P1", "P1", "P2"],
...     },
...     index=pids,
... )
>>> obs = pd.DataFrame(
...     {"sample_id": ["s1", "s2"]},
...     index=["s1", "s2"],
... )
>>> X = np.array([
...     [100.0, 200.0, np.nan, 50.0],
...     [150.0, 100.0, 300.0,  75.0],
... ])
>>> adata = AnnData(X=X, obs=obs, var=var)
>>> result = pr.pp.summarize_modifications(
...     adata,
...     method="sum",
...     inplace=False,
... )
>>> result.var_names.tolist()
['AAADLMAYCEAHAKEDPLLTPVPASENPFR', 'VVDLMR']
>>> result.X
array([[300.,  50.],
       [550.,  75.]])
>>> result.var["n_peptidoforms"].tolist()
[3, 1]
>>> result.var["n_modifications"].tolist()
[2, 0]

With skip_na=False, any NaN in a group propagates:

>>> result = pr.pp.summarize_modifications(
...     adata,
...     method="sum",
...     skip_na=False,
...     inplace=False,
... )
>>> result.X
array([[ nan,  50.],
       [550.,  75.]])

With method="max", the maximum intensity per group is retained:

>>> result = pr.pp.summarize_modifications(
...     adata,
...     method="max",
...     inplace=False,
... )
>>> result.X
array([[200.,  50.],
       [300.,  75.]])

Mass-shift notation (e.g. M[+16]) requires a custom mod_regex:

>>> pids_ms = [
...     "AAADLM[+16]AYC[+57]R",
...     "AAADLMAYC[+57]R",
...     "AAADLMAYCR",
... ]
>>> var_ms = pd.DataFrame(
...     {
...         "peptide_id": pids_ms,
...         "protein_id": ["P1", "P1", "P1"],
...     },
...     index=pids_ms,
... )
>>> obs_ms = pd.DataFrame(
...     {"sample_id": ["s1"]},
...     index=["s1"],
... )
>>> X_ms = np.array([[10.0, 20.0, 30.0]])
>>> adata_ms = AnnData(
...     X=X_ms,
...     obs=obs_ms,
...     var=var_ms,
... )
>>> result = pr.pp.summarize_modifications(
...     adata_ms,
...     mod_regex="\[.*?\]",
...     method="sum",
...     inplace=False,
... )
>>> result.var_names.tolist()
['AAADLMAYCR']
>>> result.X
array([[60.]])
>>> result.var["n_peptidoforms"].tolist()
[3]
>>> result.var["n_modifications"].tolist()
[2]