proteodata: the ProteoPy data format

This tutorial explains the proteodata convention — the set of assumptions that every ProteoPy function relies on when working with an AnnData object. You will learn:

  • What constitutes a valid proteodata object at the protein level and at the peptide level

  • How to construct proteodata from scratch

  • How to use is_proteodata() and check_proteodata() to validate your data

  • Common pitfalls that break the format, and how to avoid them

Prerequisites: Basic familiarity with AnnData (observations, variables, .obs, .var, .X).

[1]:
import numpy as np
import pandas as pd
from anndata import AnnData
import proteopy as pr

from proteopy.utils.anndata import is_proteodata, check_proteodata

1. Anatomy of a proteodata object

ProteoPy stores proteomics data in the AnnData format, where:

Slot

Content

.X

Intensity matrix (samples x proteins/peptides), containing float, int, or np.nan

.obs

Sample metadata — must include a sample_id column

.var

Protein / peptide metadata — must include protein_id (and peptide_id for peptide-level data)

.obs_names

Sample index — must be unique

.var_names

Protein/peptide index — must be unique

The is_proteodata() function checks all of these assumptions and returns a tuple: (True, "protein"), (True, "peptide"), or (False, None).


2. Building a valid protein-level proteodata

[2]:
# -- Sample metadata --
sample_ids = ["sample_A", "sample_B", "sample_C"]
obs = pd.DataFrame({"sample_id": sample_ids}, index=sample_ids)

# -- Protein metadata --
protein_ids = ["P12345", "Q67890", "O11223", "P44556"]
var = pd.DataFrame({"protein_id": protein_ids}, index=protein_ids)

# -- Intensity matrix (3 samples x 4 proteins) --
X = np.array([
    [100.0, 200.0,  50.0, 300.0],
    [110.0, np.nan, 55.0, 280.0],
    [ 95.0, 210.0,  48.0, 310.0],
])

adata_protein = AnnData(X=X, obs=obs, var=var)
adata_protein
[2]:
AnnData object with n_obs × n_vars = 3 × 4
    obs: 'sample_id'
    var: 'protein_id'
[3]:
is_proteodata(adata_protein)
[3]:
(True, 'protein')

The key rules for protein-level proteodata:

  1. .var["protein_id"] must exist

  2. .var["protein_id"] values must exactly match .var_names (same values, same order)

  3. .obs["sample_id"] must exist

  4. All indices must be unique

  5. .X must not contain np.inf or -np.inf

  6. protein_id must not contain NaN


3. Building a valid peptide-level proteodata

Peptide-level data requires an additional peptide_id column in .var. Each peptide must map to exactly one protein.

[4]:
# -- Sample metadata --
sample_ids = ["sample_A", "sample_B"]
obs = pd.DataFrame({"sample_id": sample_ids}, index=sample_ids)

# -- Peptide metadata --
# Two peptides from PROT_X and one from PROT_Y
peptide_ids = ["PEPTIDE_1", "PEPTIDE_2", "PEPTIDE_3"]
protein_ids = ["PROT_X", "PROT_X", "PROT_Y"]
var = pd.DataFrame(
    {"peptide_id": peptide_ids, "protein_id": protein_ids},
    index=peptide_ids,
)

# -- Intensity matrix (2 samples x 3 peptides) --
X = np.array([
    [500.0, 300.0, 800.0],
    [520.0, 310.0, 790.0],
])

adata_peptide = AnnData(X=X, obs=obs, var=var)
is_proteodata(adata_peptide)
[4]:
(True, 'peptide')

The additional rules for peptide-level proteodata:

  1. .var["peptide_id"] must exist and match .var_names exactly

  2. .var["protein_id"] must exist — every peptide needs a parent protein

  3. Each peptide maps to exactly one protein (no multi-mapping like "PROT_A;PROT_B")

  4. Neither peptide_id nor protein_id may contain NaN


4. is_proteodata vs check_proteodata

ProteoPy provides two validation functions:

Function

On failure

Use case

is_proteodata(adata)

Returns (False, None)

Conditional logic — “is this proteodata?”

check_proteodata(adata)

Raises ValueError

Guard clauses — “this must be proteodata”

is_proteodata also accepts raise_error=True to behave like check_proteodata.

Both accept a layers parameter to additionally validate layer matrices for infinite values.

[5]:
# is_proteodata: soft check — returns a tuple
result = is_proteodata(adata_protein)
print(f"Valid: {result[0]}, Level: {result[1]}")
Valid: True, Level: protein
[6]:
# check_proteodata: hard check — raises on failure
try:
    check_proteodata(adata_protein)
    print("Validation passed!")
except ValueError as e:
    print(f"Validation failed: {e}")
Validation passed!

5. Pitfalls: how valid proteodata can break

Even if you start with a valid proteodata object, common operations can silently break the format. We define a single adata below and work with deep copies (adata.copy()) in each example so the original remains untouched.

[7]:
# One adata for all pitfall demonstrations (3 samples x 4 proteins)
obs = pd.DataFrame(
    {"sample_id": ["s1", "s2", "s3"]},
    index=["s1", "s2", "s3"],
)
proteins = ["PROT_A", "PROT_B", "PROT_C", "PROT_D"]
var = pd.DataFrame(
    {"protein_id": proteins},
    index=proteins,
)
X = np.array([
    [100.0,  0.0, 50.0, 200.0],
    [200.0, 50.0, 50.0, 300.0],
    [150.0, 80.0, 50.0, 250.0],
])

adata = AnnData(X=X, obs=obs, var=var)
print("Starting point:", is_proteodata(adata))
adata
Starting point: (True, 'protein')
[7]:
AnnData object with n_obs × n_vars = 3 × 4
    obs: 'sample_id'
    var: 'protein_id'

5.1 Renaming an index without updating the corresponding metadata column

If you rename an index such as .var_names in a protein-level proteodata object, the protein_id column no longer matches (a proteodata format requirement).

[8]:
adata_mod = adata.copy()

# Rename the index to gene names
adata_mod.var_names = ["GeneA", "GeneB", "GeneC", "GeneD"]

print("After:", is_proteodata(adata_mod))
After: (False, None)

The protein_id column still contains ["PROT_A", "PROT_B", "PROT_C", "PROT_D"], but the index now says ["GeneA", "GeneB", "GeneC", "GeneD"]. To fix this, always update both the index and the ID column together.

[9]:
# Update protein_id to match the new index
adata_mod.var["protein_id"] = adata_mod.var_names

print("Repaired:", is_proteodata(adata_mod))
Repaired: (True, 'protein')

The same applies to renaming:

  • .var_names of a peptide-level proteodata object, where .var["peptide_id"] must be kept in sync.

  • .obs_names, where .obs["sample_id"] must be kept in sync.

5.2 Renaming protein_id, peptide_id, or sample_id without updating the corresponding index

Conversely, if protein_id or peptide_id are modified, .var_names must be updated to match. The same holds for sample_id and .obs_names. Here we demonstrate with sample_id:

[10]:
adata_mod = adata.copy()

# Rename sample_id — obs_names is now stale
adata_mod.obs["sample_id"] = ["new_s1", "new_s2", "new_s3"]

print("obs_names:", list(adata_mod.obs_names))
print("sample_id:", list(adata_mod.obs["sample_id"]))

print("After:", is_proteodata(adata_mod))
obs_names: ['s1', 's2', 's3']
sample_id: ['new_s1', 'new_s2', 'new_s3']
After: (False, None)
[11]:
# Update obs_names to match
adata_mod.obs_names = ["new_s1", "new_s2", "new_s3"]

print("obs_names:", list(adata_mod.obs_names))
print("sample_id:", list(adata_mod.obs["sample_id"]))

print("Repaired:", is_proteodata(adata_mod))
obs_names: ['new_s1', 'new_s2', 'new_s3']
sample_id: ['new_s1', 'new_s2', 'new_s3']
Repaired: (True, 'protein')

5.3 Infinite values from mathematical operations

Infinite values can easily arise from common data transformations. A typical example is log-transforming data that contains zeros:

[12]:
adata_mod = adata.copy()

# Our matrix contains a zero in PROT_B (from the initial setup)
print("Current X:")
print(adata_mod.X)
Current X:
[[100.   0.  50. 200.]
 [200.  50.  50. 300.]
 [150.  80.  50. 250.]]
[13]:
print("\nBefore log2:", is_proteodata(adata_mod))

Before log2: (True, 'protein')
[14]:
# log2(0) = -inf!
adata_mod.X = np.log2(adata_mod.X)
print("Matrix after log2:")
print(adata_mod.X)

print("\nAfter log2:", is_proteodata(adata_mod))
Matrix after log2:
[[6.64385619       -inf 5.64385619 7.64385619]
 [7.64385619 5.64385619 5.64385619 8.22881869]
 [7.22881869 6.32192809 5.64385619 7.96578428]]

After log2: (False, None)
/tmp/ipykernel_2256571/3414031841.py:2: RuntimeWarning: divide by zero encountered in log2
  adata_mod.X = np.log2(adata_mod.X)
[15]:
# The detailed error message tells you what went wrong
try:
    is_proteodata(adata_mod, raise_error=True)
except ValueError as e:
    print(e)
AnnData.X contains infinite values (np.inf or -np.inf). Please remove or replace infinite values before proceeding.
[16]:
# Fix: replace zeros with nan before log-transforming
adata_mod = adata.copy()
adata_mod.X[adata_mod.X == 0] = np.nan
adata_mod.X = np.log2(adata_mod.X)  # log2(NaN) = NaN — safe

print("\nValid again:", is_proteodata(adata_mod))

Valid again: (True, 'protein')

Another source of division by zero is normalization. For example, variance scaling divides each protein’s intensities by its standard deviation. If a protein has identical intensities across all samples, its standard deviation is zero:

[17]:
adata_mod = adata.copy()

# Variance scaling: x / std (per protein)
stds = np.std(adata_mod.X, axis=0)
print("Per-protein std:", stds)
print("PROT_C has std = 0 — division by zero!\n")
Per-protein std: [40.82482905 32.99831646  0.         40.82482905]
PROT_C has std = 0 — division by zero!

[18]:
adata_mod.X = adata_mod.X / stds

print("Scaled X:")
print(adata_mod.X)
Scaled X:
[[2.44948974 0.                inf 4.89897949]
 [4.89897949 1.51522882        inf 7.34846923]
 [3.67423461 2.42436611        inf 6.12372436]]
/tmp/ipykernel_2256571/3963694927.py:1: RuntimeWarning: divide by zero encountered in divide
  adata_mod.X = adata_mod.X / stds
[19]:
print("After var-scaling:", is_proteodata(adata_mod))
After var-scaling: (False, None)

One fix is to remove zero-variance proteins before scaling:

[20]:
adata_mod = adata.copy()

pr.pp.remove_zero_variance_vars(adata_mod)

stds = np.std(adata_mod.X, axis=0)
adata_mod.X = adata_mod.X / stds

print("\nSubset and scaled X:")
print(adata_mod.X)
Removed 1 variables.

Subset and scaled X:
[[2.44948974 0.         4.89897949]
 [4.89897949 1.51522882 7.34846923]
 [3.67423461 2.42436611 6.12372436]]
[21]:
print("Valid again:", is_proteodata(adata_mod))
Valid again: (True, 'protein')
Note
Z-score normalization with zero-variance variables yields np.nan, which is allowed in the proteodata format.
[22]:
adata_mod = adata.copy()

# Z-score normalization: x / std (per protein)
stds = np.std(adata_mod.X, axis=0)
means = np.mean(adata_mod.X, axis=0)
adata_mod.X = (adata_mod.X - means) / stds

print("Normalized X:")
print(adata_mod.X)
Normalized X:
[[-1.22474487 -1.31319831         nan -1.22474487]
 [ 1.22474487  0.20203051         nan  1.22474487]
 [ 0.          1.1111678          nan  0.        ]]
/tmp/ipykernel_2256571/1869068870.py:6: RuntimeWarning: invalid value encountered in divide
  adata_mod.X = (adata_mod.X - means) / stds
[23]:
print("\nis_proteodata:", is_proteodata(adata_mod))

is_proteodata: (True, 'protein')

PROT_C is now entirely NaN (0 / 0). While NaN passes proteodata validation, the values are meaningless. With a nonzero numerator, division by zero would produce inf and fail validation.


6. Validating layers

ProteoPy functions sometimes store intermediate results in adata.layers (e.g., raw intensities before transformation). Both is_proteodata and check_proteodata accept a layers parameter to validate those matrices as well.

We continue with a fresh copy of the same adata from section 5.

[24]:
adata_mod = adata.copy()

print("Without layers check:", is_proteodata(adata_mod))
Without layers check: (True, 'protein')
[25]:
# Add a layer with an infinite value
bad_layer = adata_mod.X.copy()
bad_layer[0, 0] = np.inf
adata_mod.layers["transformed"] = bad_layer

# Still passes without the layers parameter
print("Ignoring layers:    ", is_proteodata(adata_mod))
Ignoring layers:     (True, 'protein')
[26]:
# Fails when we explicitly check that layer
print("Checking layer:     ", is_proteodata(adata_mod, layers="transformed"))

try:
    check_proteodata(adata_mod, layers="transformed")
except ValueError as e:
    print(e)
Checking layer:      (False, None)
adata.layers['transformed'] contains infinite values (np.inf or -np.inf). Please remove or replace infinite values before proceeding.
[27]:
# You can check multiple layers at once
adata_mod.layers["raw"] = adata_mod.X.copy()  # this one is fine

try:
    check_proteodata(adata_mod, layers=["raw", "transformed"])
except ValueError as e:
    print(e)
adata.layers['transformed'] contains infinite values (np.inf or -np.inf). Please remove or replace infinite values before proceeding.

7. Quick-reference checklist

Use this checklist when constructing or debugging a proteodata object:

#

Check

Applies to

1

.obs["sample_id"] exists

Both

2

.var["protein_id"] exists

Both

3

.var["peptide_id"] exists and matches .var_names

Peptide only

4

.var["protein_id"] matches .var_names

Protein only

5

Each peptide maps to exactly one protein

Peptide only

6

No NaN in protein_id or peptide_id

Both

7

No np.inf or -np.inf in .X or checked layers

Both

8

All indices (obs, var) are unique

Both

9

.X is 2-dimensional

Both

10

protein_id/peptide_id not in .obs; sample_id not in .var

Both


Summary

The proteodata format is a thin but strict contract on top of AnnData that ensures ProteoPy functions can safely assume:

  • Who are the samples? — defined by .obs["sample_id"]

  • What are the variables? — defined by .var["protein_id"] (protein-level) or .var["peptide_id"] + .var["protein_id"] (peptide-level)

  • Is the data clean? — no infinite values, no NaN identifiers, no duplicates

Always validate with check_proteodata() after constructing or modifying an AnnData. ProteoPy does this automatically in every public function, so you will get a clear error message if something is wrong.