`proteodata`: the ProteoPy data format

This tutorial explains the proteodata convention — the set of assumptions that every ProteoPy function relies on when working with an AnnData object. You will learn:

What constitutes a valid proteodata object at the protein level and at the peptide level
How to construct proteodata from scratch
How to use is_proteodata() and check_proteodata() to validate your data
Common pitfalls that break the format, and how to avoid them

Prerequisites: Basic familiarity with AnnData (observations, variables, .obs, .var, .X).

[1]:

import numpy as np
import pandas as pd
from anndata import AnnData
import proteopy as pr

from proteopy.utils.anndata import is_proteodata, check_proteodata

1. Anatomy of a proteodata object

ProteoPy stores proteomics data in the AnnData format, where:

Slot	Content
`.X`	Intensity matrix (samples x proteins/peptides), containing `float`, `int`, or `np.nan`
`.obs`	Sample metadata — must include a `sample_id` column
`.var`	Protein / peptide metadata — must include `protein_id` (and `peptide_id` for peptide-level data)
`.obs_names`	Sample index — must be unique
`.var_names`	Protein/peptide index — must be unique

The is_proteodata() function checks all of these assumptions and returns a tuple: (True, "protein"), (True, "peptide"), or (False, None).

2. Building a valid protein-level proteodata

[2]:

# -- Sample metadata --
sample_ids = ["sample_A", "sample_B", "sample_C"]
obs = pd.DataFrame({"sample_id": sample_ids}, index=sample_ids)

# -- Protein metadata --
protein_ids = ["P12345", "Q67890", "O11223", "P44556"]
var = pd.DataFrame({"protein_id": protein_ids}, index=protein_ids)

# -- Intensity matrix (3 samples x 4 proteins) --
X = np.array([
    [100.0, 200.0,  50.0, 300.0],
    [110.0, np.nan, 55.0, 280.0],
    [ 95.0, 210.0,  48.0, 310.0],
])

adata_protein = AnnData(X=X, obs=obs, var=var)
adata_protein

[2]:

AnnData object with n_obs × n_vars = 3 × 4
    obs: 'sample_id'
    var: 'protein_id'

[3]:

is_proteodata(adata_protein)

[3]:

(True, 'protein')

The key rules for protein-level proteodata:

.var["protein_id"] must exist
.var["protein_id"] values must exactly match .var_names (same values, same order)
.obs["sample_id"] must exist
All indices must be unique
.X must not contain np.inf or -np.inf
protein_id must not contain NaN

3. Building a valid peptide-level proteodata

Peptide-level data requires an additional peptide_id column in .var. Each peptide must map to exactly one protein.

[4]:

# -- Sample metadata --
sample_ids = ["sample_A", "sample_B"]
obs = pd.DataFrame({"sample_id": sample_ids}, index=sample_ids)

# -- Peptide metadata --
# Two peptides from PROT_X and one from PROT_Y
peptide_ids = ["PEPTIDE_1", "PEPTIDE_2", "PEPTIDE_3"]
protein_ids = ["PROT_X", "PROT_X", "PROT_Y"]
var = pd.DataFrame(
    {"peptide_id": peptide_ids, "protein_id": protein_ids},
    index=peptide_ids,
)

# -- Intensity matrix (2 samples x 3 peptides) --
X = np.array([
    [500.0, 300.0, 800.0],
    [520.0, 310.0, 790.0],
])

adata_peptide = AnnData(X=X, obs=obs, var=var)
is_proteodata(adata_peptide)

[4]:

(True, 'peptide')

The additional rules for peptide-level proteodata:

.var["peptide_id"] must exist and match .var_names exactly
.var["protein_id"] must exist — every peptide needs a parent protein
Each peptide maps to exactly one protein (no multi-mapping like "PROT_A;PROT_B")
Neither peptide_id nor protein_id may contain NaN

4. `is_proteodata` vs `check_proteodata`

ProteoPy provides two validation functions:

Function	On failure	Use case
`is_proteodata(adata)`	Returns `(False, None)`	Conditional logic — “is this proteodata?”
`check_proteodata(adata)`	Raises `ValueError`	Guard clauses — “this must be proteodata”

is_proteodata also accepts raise_error=True to behave like check_proteodata.

Both accept a layers parameter to additionally validate layer matrices for infinite values.

[5]:

# is_proteodata: soft check — returns a tuple
result = is_proteodata(adata_protein)
print(f"Valid: {result[0]}, Level: {result[1]}")

Valid: True, Level: protein

[6]:

# check_proteodata: hard check — raises on failure
try:
    check_proteodata(adata_protein)
    print("Validation passed!")
except ValueError as e:
    print(f"Validation failed: {e}")

Validation passed!

5. Pitfalls: how valid proteodata can break

Even if you start with a valid proteodata object, common operations can silently break the format. We define a single adata below and work with deep copies (adata.copy()) in each example so the original remains untouched.

[7]:

# One adata for all pitfall demonstrations (3 samples x 4 proteins)
obs = pd.DataFrame(
    {"sample_id": ["s1", "s2", "s3"]},
    index=["s1", "s2", "s3"],
)
proteins = ["PROT_A", "PROT_B", "PROT_C", "PROT_D"]
var = pd.DataFrame(
    {"protein_id": proteins},
    index=proteins,
)
X = np.array([
    [100.0,  0.0, 50.0, 200.0],
    [200.0, 50.0, 50.0, 300.0],
    [150.0, 80.0, 50.0, 250.0],
])

adata = AnnData(X=X, obs=obs, var=var)
print("Starting point:", is_proteodata(adata))
adata

Starting point: (True, 'protein')

[7]:

AnnData object with n_obs × n_vars = 3 × 4
    obs: 'sample_id'
    var: 'protein_id'

5.1 Renaming an index without updating the corresponding metadata column

If you rename an index such as .var_names in a protein-level proteodata object, the protein_id column no longer matches (a proteodata format requirement).

[8]:

adata_mod = adata.copy()

# Rename the index to gene names
adata_mod.var_names = ["GeneA", "GeneB", "GeneC", "GeneD"]

print("After:", is_proteodata(adata_mod))

After: (False, None)

The protein_id column still contains ["PROT_A", "PROT_B", "PROT_C", "PROT_D"], but the index now says ["GeneA", "GeneB", "GeneC", "GeneD"]. To fix this, always update both the index and the ID column together.

[9]:

# Update protein_id to match the new index
adata_mod.var["protein_id"] = adata_mod.var_names

print("Repaired:", is_proteodata(adata_mod))

Repaired: (True, 'protein')

The same applies to renaming:

.var_names of a peptide-level proteodata object, where .var["peptide_id"] must be kept in sync.
.obs_names, where .obs["sample_id"] must be kept in sync.

5.2 Renaming `protein_id`, `peptide_id`, or `sample_id` without updating the corresponding index

Conversely, if protein_id or peptide_id are modified, .var_names must be updated to match. The same holds for sample_id and .obs_names. Here we demonstrate with sample_id:

[10]:

adata_mod = adata.copy()

# Rename sample_id — obs_names is now stale
adata_mod.obs["sample_id"] = ["new_s1", "new_s2", "new_s3"]

print("obs_names:", list(adata_mod.obs_names))
print("sample_id:", list(adata_mod.obs["sample_id"]))

print("After:", is_proteodata(adata_mod))

obs_names: ['s1', 's2', 's3']
sample_id: ['new_s1', 'new_s2', 'new_s3']
After: (False, None)

[11]:

# Update obs_names to match
adata_mod.obs_names = ["new_s1", "new_s2", "new_s3"]

print("obs_names:", list(adata_mod.obs_names))
print("sample_id:", list(adata_mod.obs["sample_id"]))

print("Repaired:", is_proteodata(adata_mod))

obs_names: ['new_s1', 'new_s2', 'new_s3']
sample_id: ['new_s1', 'new_s2', 'new_s3']
Repaired: (True, 'protein')

5.3 Infinite values from mathematical operations

Infinite values can easily arise from common data transformations. A typical example is log-transforming data that contains zeros:

[12]:

adata_mod = adata.copy()

# Our matrix contains a zero in PROT_B (from the initial setup)
print("Current X:")
print(adata_mod.X)

Current X:
[[100.   0.  50. 200.]
 [200.  50.  50. 300.]
 [150.  80.  50. 250.]]

[13]:

print("\nBefore log2:", is_proteodata(adata_mod))


Before log2: (True, 'protein')

[14]:

# log2(0) = -inf!
adata_mod.X = np.log2(adata_mod.X)
print("Matrix after log2:")
print(adata_mod.X)

print("\nAfter log2:", is_proteodata(adata_mod))

Matrix after log2:
[[6.64385619       -inf 5.64385619 7.64385619]
 [7.64385619 5.64385619 5.64385619 8.22881869]
 [7.22881869 6.32192809 5.64385619 7.96578428]]

After log2: (False, None)

/tmp/ipykernel_2256571/3414031841.py:2: RuntimeWarning: divide by zero encountered in log2
  adata_mod.X = np.log2(adata_mod.X)

[15]:

# The detailed error message tells you what went wrong
try:
    is_proteodata(adata_mod, raise_error=True)
except ValueError as e:
    print(e)

AnnData.X contains infinite values (np.inf or -np.inf). Please remove or replace infinite values before proceeding.

[16]:

# Fix: replace zeros with nan before log-transforming
adata_mod = adata.copy()
adata_mod.X[adata_mod.X == 0] = np.nan
adata_mod.X = np.log2(adata_mod.X)  # log2(NaN) = NaN — safe

print("\nValid again:", is_proteodata(adata_mod))


Valid again: (True, 'protein')

Another source of division by zero is normalization. For example, variance scaling divides each protein’s intensities by its standard deviation. If a protein has identical intensities across all samples, its standard deviation is zero:

[17]:

adata_mod = adata.copy()

# Variance scaling: x / std (per protein)
stds = np.std(adata_mod.X, axis=0)
print("Per-protein std:", stds)
print("PROT_C has std = 0 — division by zero!\n")

Per-protein std: [40.82482905 32.99831646  0.         40.82482905]
PROT_C has std = 0 — division by zero!

[18]:

adata_mod.X = adata_mod.X / stds

print("Scaled X:")
print(adata_mod.X)

Scaled X:
[[2.44948974 0.                inf 4.89897949]
 [4.89897949 1.51522882        inf 7.34846923]
 [3.67423461 2.42436611        inf 6.12372436]]

/tmp/ipykernel_2256571/3963694927.py:1: RuntimeWarning: divide by zero encountered in divide
  adata_mod.X = adata_mod.X / stds

[19]:

print("After var-scaling:", is_proteodata(adata_mod))

After var-scaling: (False, None)

One fix is to remove zero-variance proteins before scaling:

[20]:

adata_mod = adata.copy()

pr.pp.remove_zero_variance_vars(adata_mod)

stds = np.std(adata_mod.X, axis=0)
adata_mod.X = adata_mod.X / stds

print("\nSubset and scaled X:")
print(adata_mod.X)

Removed 1 variables.

Subset and scaled X:
[[2.44948974 0.         4.89897949]
 [4.89897949 1.51522882 7.34846923]
 [3.67423461 2.42436611 6.12372436]]

[21]:

print("Valid again:", is_proteodata(adata_mod))

Valid again: (True, 'protein')

Note

Z-score normalization with zero-variance variables yields np.nan, which is allowed in the proteodata format.

[22]:

adata_mod = adata.copy()

# Z-score normalization: x / std (per protein)
stds = np.std(adata_mod.X, axis=0)
means = np.mean(adata_mod.X, axis=0)
adata_mod.X = (adata_mod.X - means) / stds

print("Normalized X:")
print(adata_mod.X)

Normalized X:
[[-1.22474487 -1.31319831         nan -1.22474487]
 [ 1.22474487  0.20203051         nan  1.22474487]
 [ 0.          1.1111678          nan  0.        ]]

/tmp/ipykernel_2256571/1869068870.py:6: RuntimeWarning: invalid value encountered in divide
  adata_mod.X = (adata_mod.X - means) / stds

[23]:

print("\nis_proteodata:", is_proteodata(adata_mod))


is_proteodata: (True, 'protein')

PROT_C is now entirely NaN (0 / 0). While NaN passes proteodata validation, the values are meaningless. With a nonzero numerator, division by zero would produce inf and fail validation.

6. Validating layers

ProteoPy functions sometimes store intermediate results in adata.layers (e.g., raw intensities before transformation). Both is_proteodata and check_proteodata accept a layers parameter to validate those matrices as well.

We continue with a fresh copy of the same adata from section 5.

[24]:

adata_mod = adata.copy()

print("Without layers check:", is_proteodata(adata_mod))

Without layers check: (True, 'protein')

[25]:

# Add a layer with an infinite value
bad_layer = adata_mod.X.copy()
bad_layer[0, 0] = np.inf
adata_mod.layers["transformed"] = bad_layer

# Still passes without the layers parameter
print("Ignoring layers:    ", is_proteodata(adata_mod))

Ignoring layers:     (True, 'protein')

[26]:

# Fails when we explicitly check that layer
print("Checking layer:     ", is_proteodata(adata_mod, layers="transformed"))

try:
    check_proteodata(adata_mod, layers="transformed")
except ValueError as e:
    print(e)

Checking layer:      (False, None)
adata.layers['transformed'] contains infinite values (np.inf or -np.inf). Please remove or replace infinite values before proceeding.

[27]:

# You can check multiple layers at once
adata_mod.layers["raw"] = adata_mod.X.copy()  # this one is fine

try:
    check_proteodata(adata_mod, layers=["raw", "transformed"])
except ValueError as e:
    print(e)

adata.layers['transformed'] contains infinite values (np.inf or -np.inf). Please remove or replace infinite values before proceeding.

7. Quick-reference checklist

Use this checklist when constructing or debugging a proteodata object:

#	Check	Applies to
1	`.obs["sample_id"]` exists	Both
2	`.var["protein_id"]` exists	Both
3	`.var["peptide_id"]` exists and matches `.var_names`	Peptide only
4	`.var["protein_id"]` matches `.var_names`	Protein only
5	Each peptide maps to exactly one protein	Peptide only
6	No `NaN` in `protein_id` or `peptide_id`	Both
7	No `np.inf` or `-np.inf` in `.X` or checked layers	Both
8	All indices (obs, var) are unique	Both
9	`.X` is 2-dimensional	Both
10	`protein_id`/`peptide_id` not in `.obs`; `sample_id` not in `.var`	Both

Summary

The proteodata format is a thin but strict contract on top of AnnData that ensures ProteoPy functions can safely assume:

Who are the samples? — defined by .obs["sample_id"]
What are the variables? — defined by .var["protein_id"] (protein-level) or .var["peptide_id"] + .var["protein_id"] (peptide-level)
Is the data clean? — no infinite values, no NaN identifiers, no duplicates

Always validate with check_proteodata() after constructing or modifying an AnnData. ProteoPy does this automatically in every public function, so you will get a clear error message if something is wrong.

proteodata: the ProteoPy data format