.summarize_modifications
- proteopy.pp.summarize_modifications(adata, mod_regex='\\\\s*\\\\(.*?\\\\)', method='sum', layer=None, zero_to_na=False, fill_na=None, skip_na=True, sort=True, keep_var_cols=None, inplace=True, verbose=False)[source]
Aggregate modified peptides by their stripped sequence.
Removes modification annotations from peptide identifiers using a regular expression, groups peptides sharing the same stripped sequence, and summarizes intensities with the chosen aggregation method.
- Parameters:
adata (AnnData) –
AnnDatawith peptide-level data.mod_regex (str, optional) – Regular expression matching modification annotations to strip from peptide identifiers.
method ({'sum', 'mean', 'median', 'max'}, optional) – Aggregation applied to each group of modified peptides.
layer (str, optional) – Key in
adata.layersto use instead ofadata.X.zero_to_na (bool, optional) – If True, replace zeros with
np.nanbefore aggregation.fill_na (float or int, optional) – Replace
np.nanvalues with this constant before aggregation.skip_na (bool, optional) – If True, ignore
np.nanduring aggregation. If False, anynp.nanin a group producesnp.nanin the result.sort (bool, optional) – If True, sort the output variables alphabetically by stripped sequence. If False, preserve the order of first appearance.
keep_var_cols (list of str, optional) – Additional
.varcolumns to carry over to the output. Values are aggregated per group (unique values joined by';').'peptide_id'and'protein_id'are always included.inplace (bool, optional) – If True, modify
adatain place. Otherwise return a new AnnData.verbose (bool, optional) – Print status messages.
- Returns:
Aggregated AnnData when
inplace=False; otherwise None. Two columns are added to.var:n_peptidoforms(total variants per stripped sequence) andn_modifications(unique modification sites, identified by their position in the stripped sequence and the matched annotation text, collected across all peptidoforms in the group).- Return type:
AnnData or None
Examples
UniMod notation (default
mod_regex).>>> import proteopy as pr >>> import numpy as np >>> import pandas as pd >>> from anndata import AnnData >>> pids = [ ... "AAADLM(UniMod:35)AYC(UniMod:4)EAHAKEDPLLTPVPASENPFR", ... "AAADLMAYC(UniMod:4)EAHAKEDPLLTPVPASENPFR", ... "AAADLMAYCEAHAKEDPLLTPVPASENPFR", ... "VVDLMR", ... ] >>> var = pd.DataFrame( ... { ... "peptide_id": pids, ... "protein_id": ["P1", "P1", "P1", "P2"], ... }, ... index=pids, ... ) >>> obs = pd.DataFrame( ... {"sample_id": ["s1", "s2"]}, ... index=["s1", "s2"], ... ) >>> X = np.array([ ... [100.0, 200.0, np.nan, 50.0], ... [150.0, 100.0, 300.0, 75.0], ... ]) >>> adata = AnnData(X=X, obs=obs, var=var) >>> result = pr.pp.summarize_modifications( ... adata, ... method="sum", ... inplace=False, ... ) >>> result.var_names.tolist() ['AAADLMAYCEAHAKEDPLLTPVPASENPFR', 'VVDLMR'] >>> result.X array([[300., 50.], [550., 75.]]) >>> result.var["n_peptidoforms"].tolist() [3, 1] >>> result.var["n_modifications"].tolist() [2, 0]
With
skip_na=False, any NaN in a group propagates:>>> result = pr.pp.summarize_modifications( ... adata, ... method="sum", ... skip_na=False, ... inplace=False, ... ) >>> result.X array([[ nan, 50.], [550., 75.]])
With
method="max", the maximum intensity per group is retained:>>> result = pr.pp.summarize_modifications( ... adata, ... method="max", ... inplace=False, ... ) >>> result.X array([[200., 50.], [300., 75.]])
Mass-shift notation (e.g.
M[+16]) requires a custommod_regex:>>> pids_ms = [ ... "AAADLM[+16]AYC[+57]R", ... "AAADLMAYC[+57]R", ... "AAADLMAYCR", ... ] >>> var_ms = pd.DataFrame( ... { ... "peptide_id": pids_ms, ... "protein_id": ["P1", "P1", "P1"], ... }, ... index=pids_ms, ... ) >>> obs_ms = pd.DataFrame( ... {"sample_id": ["s1"]}, ... index=["s1"], ... ) >>> X_ms = np.array([[10.0, 20.0, 30.0]]) >>> adata_ms = AnnData( ... X=X_ms, ... obs=obs_ms, ... var=var_ms, ... ) >>> result = pr.pp.summarize_modifications( ... adata_ms, ... mod_regex="\[.*?\]", ... method="sum", ... inplace=False, ... ) >>> result.var_names.tolist() ['AAADLMAYCR'] >>> result.X array([[60.]]) >>> result.var["n_peptidoforms"].tolist() [3] >>> result.var["n_modifications"].tolist() [2]