proteodata: the ProteoPy data format
This tutorial explains the proteodata convention — the set of assumptions that every ProteoPy function relies on when working with an AnnData object. You will learn:
What constitutes a valid proteodata object at the protein level and at the peptide level
How to construct proteodata from scratch
How to use
is_proteodata()andcheck_proteodata()to validate your dataCommon pitfalls that break the format, and how to avoid them
Prerequisites: Basic familiarity with AnnData (observations, variables, .obs, .var, .X).
[1]:
import numpy as np
import pandas as pd
from anndata import AnnData
import proteopy as pr
from proteopy.utils.anndata import is_proteodata, check_proteodata
1. Anatomy of a proteodata object
ProteoPy stores proteomics data in the AnnData format, where:
Slot |
Content |
|---|---|
|
Intensity matrix (samples x proteins/peptides), containing |
|
Sample metadata — must include a |
|
Protein / peptide metadata — must include |
|
Sample index — must be unique |
|
Protein/peptide index — must be unique |
The is_proteodata() function checks all of these assumptions and returns a tuple: (True, "protein"), (True, "peptide"), or (False, None).
2. Building a valid protein-level proteodata
[2]:
# -- Sample metadata --
sample_ids = ["sample_A", "sample_B", "sample_C"]
obs = pd.DataFrame({"sample_id": sample_ids}, index=sample_ids)
# -- Protein metadata --
protein_ids = ["P12345", "Q67890", "O11223", "P44556"]
var = pd.DataFrame({"protein_id": protein_ids}, index=protein_ids)
# -- Intensity matrix (3 samples x 4 proteins) --
X = np.array([
[100.0, 200.0, 50.0, 300.0],
[110.0, np.nan, 55.0, 280.0],
[ 95.0, 210.0, 48.0, 310.0],
])
adata_protein = AnnData(X=X, obs=obs, var=var)
adata_protein
[2]:
AnnData object with n_obs × n_vars = 3 × 4
obs: 'sample_id'
var: 'protein_id'
[3]:
is_proteodata(adata_protein)
[3]:
(True, 'protein')
The key rules for protein-level proteodata:
.var["protein_id"]must exist.var["protein_id"]values must exactly match.var_names(same values, same order).obs["sample_id"]must existAll indices must be unique
.Xmust not containnp.infor-np.infprotein_idmust not containNaN
3. Building a valid peptide-level proteodata
Peptide-level data requires an additional peptide_id column in .var. Each peptide must map to exactly one protein.
[4]:
# -- Sample metadata --
sample_ids = ["sample_A", "sample_B"]
obs = pd.DataFrame({"sample_id": sample_ids}, index=sample_ids)
# -- Peptide metadata --
# Two peptides from PROT_X and one from PROT_Y
peptide_ids = ["PEPTIDE_1", "PEPTIDE_2", "PEPTIDE_3"]
protein_ids = ["PROT_X", "PROT_X", "PROT_Y"]
var = pd.DataFrame(
{"peptide_id": peptide_ids, "protein_id": protein_ids},
index=peptide_ids,
)
# -- Intensity matrix (2 samples x 3 peptides) --
X = np.array([
[500.0, 300.0, 800.0],
[520.0, 310.0, 790.0],
])
adata_peptide = AnnData(X=X, obs=obs, var=var)
is_proteodata(adata_peptide)
[4]:
(True, 'peptide')
The additional rules for peptide-level proteodata:
.var["peptide_id"]must exist and match.var_namesexactly.var["protein_id"]must exist — every peptide needs a parent proteinEach peptide maps to exactly one protein (no multi-mapping like
"PROT_A;PROT_B")Neither
peptide_idnorprotein_idmay containNaN
4. is_proteodata vs check_proteodata
ProteoPy provides two validation functions:
Function |
On failure |
Use case |
|---|---|---|
|
Returns |
Conditional logic — “is this proteodata?” |
|
Raises |
Guard clauses — “this must be proteodata” |
is_proteodata also accepts raise_error=True to behave like check_proteodata.
Both accept a layers parameter to additionally validate layer matrices for infinite values.
[5]:
# is_proteodata: soft check — returns a tuple
result = is_proteodata(adata_protein)
print(f"Valid: {result[0]}, Level: {result[1]}")
Valid: True, Level: protein
[6]:
# check_proteodata: hard check — raises on failure
try:
check_proteodata(adata_protein)
print("Validation passed!")
except ValueError as e:
print(f"Validation failed: {e}")
Validation passed!
5. Pitfalls: how valid proteodata can break
Even if you start with a valid proteodata object, common operations can silently break the format. We define a single adata below and work with deep copies (adata.copy()) in each example so the original remains untouched.
[7]:
# One adata for all pitfall demonstrations (3 samples x 4 proteins)
obs = pd.DataFrame(
{"sample_id": ["s1", "s2", "s3"]},
index=["s1", "s2", "s3"],
)
proteins = ["PROT_A", "PROT_B", "PROT_C", "PROT_D"]
var = pd.DataFrame(
{"protein_id": proteins},
index=proteins,
)
X = np.array([
[100.0, 0.0, 50.0, 200.0],
[200.0, 50.0, 50.0, 300.0],
[150.0, 80.0, 50.0, 250.0],
])
adata = AnnData(X=X, obs=obs, var=var)
print("Starting point:", is_proteodata(adata))
adata
Starting point: (True, 'protein')
[7]:
AnnData object with n_obs × n_vars = 3 × 4
obs: 'sample_id'
var: 'protein_id'
5.1 Renaming an index without updating the corresponding metadata column
If you rename an index such as .var_names in a protein-level proteodata object, the protein_id column no longer matches (a proteodata format requirement).
[8]:
adata_mod = adata.copy()
# Rename the index to gene names
adata_mod.var_names = ["GeneA", "GeneB", "GeneC", "GeneD"]
print("After:", is_proteodata(adata_mod))
After: (False, None)
The protein_id column still contains ["PROT_A", "PROT_B", "PROT_C", "PROT_D"], but the index now says ["GeneA", "GeneB", "GeneC", "GeneD"]. To fix this, always update both the index and the ID column together.
[9]:
# Update protein_id to match the new index
adata_mod.var["protein_id"] = adata_mod.var_names
print("Repaired:", is_proteodata(adata_mod))
Repaired: (True, 'protein')
The same applies to renaming:
.var_namesof a peptide-level proteodata object, where.var["peptide_id"]must be kept in sync..obs_names, where.obs["sample_id"]must be kept in sync.
5.2 Renaming protein_id, peptide_id, or sample_id without updating the corresponding index
Conversely, if protein_id or peptide_id are modified, .var_names must be updated to match. The same holds for sample_id and .obs_names. Here we demonstrate with sample_id:
[10]:
adata_mod = adata.copy()
# Rename sample_id — obs_names is now stale
adata_mod.obs["sample_id"] = ["new_s1", "new_s2", "new_s3"]
print("obs_names:", list(adata_mod.obs_names))
print("sample_id:", list(adata_mod.obs["sample_id"]))
print("After:", is_proteodata(adata_mod))
obs_names: ['s1', 's2', 's3']
sample_id: ['new_s1', 'new_s2', 'new_s3']
After: (False, None)
[11]:
# Update obs_names to match
adata_mod.obs_names = ["new_s1", "new_s2", "new_s3"]
print("obs_names:", list(adata_mod.obs_names))
print("sample_id:", list(adata_mod.obs["sample_id"]))
print("Repaired:", is_proteodata(adata_mod))
obs_names: ['new_s1', 'new_s2', 'new_s3']
sample_id: ['new_s1', 'new_s2', 'new_s3']
Repaired: (True, 'protein')
5.3 Infinite values from mathematical operations
Infinite values can easily arise from common data transformations. A typical example is log-transforming data that contains zeros:
[12]:
adata_mod = adata.copy()
# Our matrix contains a zero in PROT_B (from the initial setup)
print("Current X:")
print(adata_mod.X)
Current X:
[[100. 0. 50. 200.]
[200. 50. 50. 300.]
[150. 80. 50. 250.]]
[13]:
print("\nBefore log2:", is_proteodata(adata_mod))
Before log2: (True, 'protein')
[14]:
# log2(0) = -inf!
adata_mod.X = np.log2(adata_mod.X)
print("Matrix after log2:")
print(adata_mod.X)
print("\nAfter log2:", is_proteodata(adata_mod))
Matrix after log2:
[[6.64385619 -inf 5.64385619 7.64385619]
[7.64385619 5.64385619 5.64385619 8.22881869]
[7.22881869 6.32192809 5.64385619 7.96578428]]
After log2: (False, None)
/tmp/ipykernel_2256571/3414031841.py:2: RuntimeWarning: divide by zero encountered in log2
adata_mod.X = np.log2(adata_mod.X)
[15]:
# The detailed error message tells you what went wrong
try:
is_proteodata(adata_mod, raise_error=True)
except ValueError as e:
print(e)
AnnData.X contains infinite values (np.inf or -np.inf). Please remove or replace infinite values before proceeding.
[16]:
# Fix: replace zeros with nan before log-transforming
adata_mod = adata.copy()
adata_mod.X[adata_mod.X == 0] = np.nan
adata_mod.X = np.log2(adata_mod.X) # log2(NaN) = NaN — safe
print("\nValid again:", is_proteodata(adata_mod))
Valid again: (True, 'protein')
Another source of division by zero is normalization. For example, variance scaling divides each protein’s intensities by its standard deviation. If a protein has identical intensities across all samples, its standard deviation is zero:
[17]:
adata_mod = adata.copy()
# Variance scaling: x / std (per protein)
stds = np.std(adata_mod.X, axis=0)
print("Per-protein std:", stds)
print("PROT_C has std = 0 — division by zero!\n")
Per-protein std: [40.82482905 32.99831646 0. 40.82482905]
PROT_C has std = 0 — division by zero!
[18]:
adata_mod.X = adata_mod.X / stds
print("Scaled X:")
print(adata_mod.X)
Scaled X:
[[2.44948974 0. inf 4.89897949]
[4.89897949 1.51522882 inf 7.34846923]
[3.67423461 2.42436611 inf 6.12372436]]
/tmp/ipykernel_2256571/3963694927.py:1: RuntimeWarning: divide by zero encountered in divide
adata_mod.X = adata_mod.X / stds
[19]:
print("After var-scaling:", is_proteodata(adata_mod))
After var-scaling: (False, None)
One fix is to remove zero-variance proteins before scaling:
[20]:
adata_mod = adata.copy()
pr.pp.remove_zero_variance_vars(adata_mod)
stds = np.std(adata_mod.X, axis=0)
adata_mod.X = adata_mod.X / stds
print("\nSubset and scaled X:")
print(adata_mod.X)
Removed 1 variables.
Subset and scaled X:
[[2.44948974 0. 4.89897949]
[4.89897949 1.51522882 7.34846923]
[3.67423461 2.42436611 6.12372436]]
[21]:
print("Valid again:", is_proteodata(adata_mod))
Valid again: (True, 'protein')
NoteZ-score normalization with zero-variance variables yieldsnp.nan, which is allowed in the proteodata format.
[22]:
adata_mod = adata.copy()
# Z-score normalization: x / std (per protein)
stds = np.std(adata_mod.X, axis=0)
means = np.mean(adata_mod.X, axis=0)
adata_mod.X = (adata_mod.X - means) / stds
print("Normalized X:")
print(adata_mod.X)
Normalized X:
[[-1.22474487 -1.31319831 nan -1.22474487]
[ 1.22474487 0.20203051 nan 1.22474487]
[ 0. 1.1111678 nan 0. ]]
/tmp/ipykernel_2256571/1869068870.py:6: RuntimeWarning: invalid value encountered in divide
adata_mod.X = (adata_mod.X - means) / stds
[23]:
print("\nis_proteodata:", is_proteodata(adata_mod))
is_proteodata: (True, 'protein')
PROT_C is now entirely NaN (0 / 0). While NaN passes proteodata validation, the values are meaningless. With a nonzero numerator, division by zero would produce inf and fail validation.
6. Validating layers
ProteoPy functions sometimes store intermediate results in adata.layers (e.g., raw intensities before transformation). Both is_proteodata and check_proteodata accept a layers parameter to validate those matrices as well.
We continue with a fresh copy of the same adata from section 5.
[24]:
adata_mod = adata.copy()
print("Without layers check:", is_proteodata(adata_mod))
Without layers check: (True, 'protein')
[25]:
# Add a layer with an infinite value
bad_layer = adata_mod.X.copy()
bad_layer[0, 0] = np.inf
adata_mod.layers["transformed"] = bad_layer
# Still passes without the layers parameter
print("Ignoring layers: ", is_proteodata(adata_mod))
Ignoring layers: (True, 'protein')
[26]:
# Fails when we explicitly check that layer
print("Checking layer: ", is_proteodata(adata_mod, layers="transformed"))
try:
check_proteodata(adata_mod, layers="transformed")
except ValueError as e:
print(e)
Checking layer: (False, None)
adata.layers['transformed'] contains infinite values (np.inf or -np.inf). Please remove or replace infinite values before proceeding.
[27]:
# You can check multiple layers at once
adata_mod.layers["raw"] = adata_mod.X.copy() # this one is fine
try:
check_proteodata(adata_mod, layers=["raw", "transformed"])
except ValueError as e:
print(e)
adata.layers['transformed'] contains infinite values (np.inf or -np.inf). Please remove or replace infinite values before proceeding.
7. Quick-reference checklist
Use this checklist when constructing or debugging a proteodata object:
# |
Check |
Applies to |
|---|---|---|
1 |
|
Both |
2 |
|
Both |
3 |
|
Peptide only |
4 |
|
Protein only |
5 |
Each peptide maps to exactly one protein |
Peptide only |
6 |
No |
Both |
7 |
No |
Both |
8 |
All indices (obs, var) are unique |
Both |
9 |
|
Both |
10 |
|
Both |
Summary
The proteodata format is a thin but strict contract on top of AnnData that ensures ProteoPy functions can safely assume:
Who are the samples? — defined by
.obs["sample_id"]What are the variables? — defined by
.var["protein_id"](protein-level) or.var["peptide_id"]+.var["protein_id"](peptide-level)Is the data clean? — no infinite values, no NaN identifiers, no duplicates
Always validate with check_proteodata() after constructing or modifying an AnnData. ProteoPy does this automatically in every public function, so you will get a clear error message if something is wrong.