{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# `proteodata`: the ProteoPy data format\n",
    "\n",
    "This tutorial explains the **proteodata** convention — the set of assumptions that every ProteoPy function relies on when working with an `AnnData` object. You will learn:\n",
    "\n",
    "- What constitutes a valid proteodata object at the protein level and at the peptide level\n",
    "- How to construct proteodata from scratch\n",
    "- How to use `is_proteodata()` and `check_proteodata()` to validate your data\n",
    "- Common pitfalls that break the format, and how to avoid them\n",
    "\n",
    "**Prerequisites:** Basic familiarity with `AnnData` (observations, variables, `.obs`, `.var`, `.X`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from anndata import AnnData\n",
    "import proteopy as pr\n",
    "\n",
    "from proteopy.utils.anndata import is_proteodata, check_proteodata"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## 1. Anatomy of a proteodata object\n",
    "\n",
    "ProteoPy stores proteomics data in the [AnnData](https://anndata.readthedocs.io/) format, where:\n",
    "\n",
    "| Slot | Content |\n",
    "|------|---------|\n",
    "| `.X` | Intensity matrix (samples x proteins/peptides), containing `float`, `int`, or `np.nan` |\n",
    "| `.obs` | Sample metadata — **must** include a `sample_id` column |\n",
    "| `.var` | Protein / peptide metadata — **must** include `protein_id` (and `peptide_id` for peptide-level data) |\n",
    "| `.obs_names` | Sample index — must be unique |\n",
    "| `.var_names` | Protein/peptide index — must be unique |\n",
    "\n",
    "The `is_proteodata()` function checks all of these assumptions and returns a tuple: `(True, \"protein\")`, `(True, \"peptide\")`, or `(False, None)`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## 2. Building a valid protein-level proteodata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "AnnData object with n_obs × n_vars = 3 × 4\n",
       "    obs: 'sample_id'\n",
       "    var: 'protein_id'"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# -- Sample metadata --\n",
    "sample_ids = [\"sample_A\", \"sample_B\", \"sample_C\"]\n",
    "obs = pd.DataFrame({\"sample_id\": sample_ids}, index=sample_ids)\n",
    "\n",
    "# -- Protein metadata --\n",
    "protein_ids = [\"P12345\", \"Q67890\", \"O11223\", \"P44556\"]\n",
    "var = pd.DataFrame({\"protein_id\": protein_ids}, index=protein_ids)\n",
    "\n",
    "# -- Intensity matrix (3 samples x 4 proteins) --\n",
    "X = np.array([\n",
    "    [100.0, 200.0,  50.0, 300.0],\n",
    "    [110.0, np.nan, 55.0, 280.0],\n",
    "    [ 95.0, 210.0,  48.0, 310.0],\n",
    "])\n",
    "\n",
    "adata_protein = AnnData(X=X, obs=obs, var=var)\n",
    "adata_protein"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(True, 'protein')"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "is_proteodata(adata_protein)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The key rules for protein-level proteodata:\n",
    "\n",
    "1. `.var[\"protein_id\"]` must exist\n",
    "2. `.var[\"protein_id\"]` values must exactly match `.var_names` (same values, same order)\n",
    "3. `.obs[\"sample_id\"]` must exist\n",
    "4. All indices must be unique\n",
    "5. `.X` must not contain `np.inf` or `-np.inf`\n",
    "6. `protein_id` must not contain `NaN`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## 3. Building a valid peptide-level proteodata\n",
    "\n",
    "Peptide-level data requires an additional `peptide_id` column in `.var`. Each peptide must map to exactly **one** protein."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(True, 'peptide')"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# -- Sample metadata --\n",
    "sample_ids = [\"sample_A\", \"sample_B\"]\n",
    "obs = pd.DataFrame({\"sample_id\": sample_ids}, index=sample_ids)\n",
    "\n",
    "# -- Peptide metadata --\n",
    "# Two peptides from PROT_X and one from PROT_Y\n",
    "peptide_ids = [\"PEPTIDE_1\", \"PEPTIDE_2\", \"PEPTIDE_3\"]\n",
    "protein_ids = [\"PROT_X\", \"PROT_X\", \"PROT_Y\"]\n",
    "var = pd.DataFrame(\n",
    "    {\"peptide_id\": peptide_ids, \"protein_id\": protein_ids},\n",
    "    index=peptide_ids,\n",
    ")\n",
    "\n",
    "# -- Intensity matrix (2 samples x 3 peptides) --\n",
    "X = np.array([\n",
    "    [500.0, 300.0, 800.0],\n",
    "    [520.0, 310.0, 790.0],\n",
    "])\n",
    "\n",
    "adata_peptide = AnnData(X=X, obs=obs, var=var)\n",
    "is_proteodata(adata_peptide)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The additional rules for peptide-level proteodata:\n",
    "\n",
    "1. `.var[\"peptide_id\"]` must exist and match `.var_names` exactly\n",
    "2. `.var[\"protein_id\"]` must exist — every peptide needs a parent protein\n",
    "3. Each peptide maps to **exactly one** protein (no multi-mapping like `\"PROT_A;PROT_B\"`)\n",
    "4. Neither `peptide_id` nor `protein_id` may contain `NaN`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## 4. `is_proteodata` vs `check_proteodata`\n",
    "\n",
    "ProteoPy provides two validation functions:\n",
    "\n",
    "| Function | On failure | Use case |\n",
    "|----------|------------|----------|\n",
    "| `is_proteodata(adata)` | Returns `(False, None)` | Conditional logic — \"is this proteodata?\" |\n",
    "| `check_proteodata(adata)` | Raises `ValueError` | Guard clauses — \"this *must* be proteodata\" |\n",
    "\n",
    "`is_proteodata` also accepts `raise_error=True` to behave like `check_proteodata`.\n",
    "\n",
    "Both accept a `layers` parameter to additionally validate layer matrices for infinite values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Valid: True, Level: protein\n"
     ]
    }
   ],
   "source": [
    "# is_proteodata: soft check — returns a tuple\n",
    "result = is_proteodata(adata_protein)\n",
    "print(f\"Valid: {result[0]}, Level: {result[1]}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Validation passed!\n"
     ]
    }
   ],
   "source": [
    "# check_proteodata: hard check — raises on failure\n",
    "try:\n",
    "    check_proteodata(adata_protein)\n",
    "    print(\"Validation passed!\")\n",
    "except ValueError as e:\n",
    "    print(f\"Validation failed: {e}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## 5. Pitfalls: how valid proteodata can break\n",
    "\n",
    "Even if you start with a valid proteodata object, common operations can silently break the format. We define a single `adata` below and work with deep copies (`adata.copy()`) in each example so the original remains untouched."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Starting point: (True, 'protein')\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "AnnData object with n_obs × n_vars = 3 × 4\n",
       "    obs: 'sample_id'\n",
       "    var: 'protein_id'"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# One adata for all pitfall demonstrations (3 samples x 4 proteins)\n",
    "obs = pd.DataFrame(\n",
    "    {\"sample_id\": [\"s1\", \"s2\", \"s3\"]},\n",
    "    index=[\"s1\", \"s2\", \"s3\"],\n",
    ")\n",
    "proteins = [\"PROT_A\", \"PROT_B\", \"PROT_C\", \"PROT_D\"]\n",
    "var = pd.DataFrame(\n",
    "    {\"protein_id\": proteins},\n",
    "    index=proteins,\n",
    ")\n",
    "X = np.array([\n",
    "    [100.0,  0.0, 50.0, 200.0],\n",
    "    [200.0, 50.0, 50.0, 300.0],\n",
    "    [150.0, 80.0, 50.0, 250.0],\n",
    "])\n",
    "\n",
    "adata = AnnData(X=X, obs=obs, var=var)\n",
    "print(\"Starting point:\", is_proteodata(adata))\n",
    "adata"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.1 Renaming an index without updating the corresponding metadata column\n",
    "\n",
    "If you rename an index such as `.var_names` in a protein-level proteodata object, the `protein_id` column no longer matches (a proteodata format requirement)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "After: (False, None)\n"
     ]
    }
   ],
   "source": [
    "adata_mod = adata.copy()\n",
    "\n",
    "# Rename the index to gene names\n",
    "adata_mod.var_names = [\"GeneA\", \"GeneB\", \"GeneC\", \"GeneD\"]\n",
    "\n",
    "print(\"After:\", is_proteodata(adata_mod))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `protein_id` column still contains `[\"PROT_A\", \"PROT_B\", \"PROT_C\", \"PROT_D\"]`, but the index now says `[\"GeneA\", \"GeneB\", \"GeneC\", \"GeneD\"]`. To fix this, always update **both** the index and the ID column together."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Repaired: (True, 'protein')\n"
     ]
    }
   ],
   "source": [
    "# Update protein_id to match the new index\n",
    "adata_mod.var[\"protein_id\"] = adata_mod.var_names\n",
    "\n",
    "print(\"Repaired:\", is_proteodata(adata_mod))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The same applies to renaming:\n",
    "- `.var_names` of a peptide-level proteodata object, where `.var[\"peptide_id\"]` must be kept in sync.\n",
    "- `.obs_names`, where `.obs[\"sample_id\"]` must be kept in sync."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.2 Renaming `protein_id`, `peptide_id`, or `sample_id` without updating the corresponding index"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Conversely, if `protein_id` or `peptide_id` are modified, `.var_names` must be updated to match. The same holds for `sample_id` and `.obs_names`. Here we demonstrate with `sample_id`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "obs_names: ['s1', 's2', 's3']\n",
      "sample_id: ['new_s1', 'new_s2', 'new_s3']\n",
      "After: (False, None)\n"
     ]
    }
   ],
   "source": [
    "adata_mod = adata.copy()\n",
    "\n",
    "# Rename sample_id — obs_names is now stale\n",
    "adata_mod.obs[\"sample_id\"] = [\"new_s1\", \"new_s2\", \"new_s3\"]\n",
    "\n",
    "print(\"obs_names:\", list(adata_mod.obs_names))\n",
    "print(\"sample_id:\", list(adata_mod.obs[\"sample_id\"]))\n",
    "\n",
    "print(\"After:\", is_proteodata(adata_mod))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "obs_names: ['new_s1', 'new_s2', 'new_s3']\n",
      "sample_id: ['new_s1', 'new_s2', 'new_s3']\n",
      "Repaired: (True, 'protein')\n"
     ]
    }
   ],
   "source": [
    "# Update obs_names to match\n",
    "adata_mod.obs_names = [\"new_s1\", \"new_s2\", \"new_s3\"]\n",
    "\n",
    "print(\"obs_names:\", list(adata_mod.obs_names))\n",
    "print(\"sample_id:\", list(adata_mod.obs[\"sample_id\"]))\n",
    "\n",
    "print(\"Repaired:\", is_proteodata(adata_mod))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.3 Infinite values from mathematical operations\n",
    "\n",
    "Infinite values can easily arise from common data transformations. A typical example is log-transforming data that contains zeros:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Current X:\n",
      "[[100.   0.  50. 200.]\n",
      " [200.  50.  50. 300.]\n",
      " [150.  80.  50. 250.]]\n"
     ]
    }
   ],
   "source": [
    "adata_mod = adata.copy()\n",
    "\n",
    "# Our matrix contains a zero in PROT_B (from the initial setup)\n",
    "print(\"Current X:\")\n",
    "print(adata_mod.X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Before log2: (True, 'protein')\n"
     ]
    }
   ],
   "source": [
    "print(\"\\nBefore log2:\", is_proteodata(adata_mod))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Matrix after log2:\n",
      "[[6.64385619       -inf 5.64385619 7.64385619]\n",
      " [7.64385619 5.64385619 5.64385619 8.22881869]\n",
      " [7.22881869 6.32192809 5.64385619 7.96578428]]\n",
      "\n",
      "After log2: (False, None)\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/tmp/ipykernel_2256571/3414031841.py:2: RuntimeWarning: divide by zero encountered in log2\n",
      "  adata_mod.X = np.log2(adata_mod.X)\n"
     ]
    }
   ],
   "source": [
    "# log2(0) = -inf!\n",
    "adata_mod.X = np.log2(adata_mod.X)\n",
    "print(\"Matrix after log2:\")\n",
    "print(adata_mod.X)\n",
    "\n",
    "print(\"\\nAfter log2:\", is_proteodata(adata_mod))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "AnnData.X contains infinite values (np.inf or -np.inf). Please remove or replace infinite values before proceeding.\n"
     ]
    }
   ],
   "source": [
    "# The detailed error message tells you what went wrong\n",
    "try:\n",
    "    is_proteodata(adata_mod, raise_error=True)\n",
    "except ValueError as e:\n",
    "    print(e)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Valid again: (True, 'protein')\n"
     ]
    }
   ],
   "source": [
    "# Fix: replace zeros with nan before log-transforming\n",
    "adata_mod = adata.copy()\n",
    "adata_mod.X[adata_mod.X == 0] = np.nan\n",
    "adata_mod.X = np.log2(adata_mod.X)  # log2(NaN) = NaN — safe\n",
    "\n",
    "print(\"\\nValid again:\", is_proteodata(adata_mod))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another source of division by zero is normalization. For example, variance scaling divides each protein's intensities by its standard deviation. If a protein has identical intensities across all samples, its standard deviation is zero:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Per-protein std: [40.82482905 32.99831646  0.         40.82482905]\n",
      "PROT_C has std = 0 — division by zero!\n",
      "\n"
     ]
    }
   ],
   "source": [
    "adata_mod = adata.copy()\n",
    "\n",
    "# Variance scaling: x / std (per protein)\n",
    "stds = np.std(adata_mod.X, axis=0)\n",
    "print(\"Per-protein std:\", stds)\n",
    "print(\"PROT_C has std = 0 — division by zero!\\n\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Scaled X:\n",
      "[[2.44948974 0.                inf 4.89897949]\n",
      " [4.89897949 1.51522882        inf 7.34846923]\n",
      " [3.67423461 2.42436611        inf 6.12372436]]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/tmp/ipykernel_2256571/3963694927.py:1: RuntimeWarning: divide by zero encountered in divide\n",
      "  adata_mod.X = adata_mod.X / stds\n"
     ]
    }
   ],
   "source": [
    "adata_mod.X = adata_mod.X / stds\n",
    "\n",
    "print(\"Scaled X:\")\n",
    "print(adata_mod.X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "After var-scaling: (False, None)\n"
     ]
    }
   ],
   "source": [
    "print(\"After var-scaling:\", is_proteodata(adata_mod))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One fix is to remove zero-variance proteins before scaling:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Removed 1 variables.\n",
      "\n",
      "Subset and scaled X:\n",
      "[[2.44948974 0.         4.89897949]\n",
      " [4.89897949 1.51522882 7.34846923]\n",
      " [3.67423461 2.42436611 6.12372436]]\n"
     ]
    }
   ],
   "source": [
    "adata_mod = adata.copy()\n",
    "\n",
    "pr.pp.remove_zero_variance_vars(adata_mod)\n",
    "\n",
    "stds = np.std(adata_mod.X, axis=0)\n",
    "adata_mod.X = adata_mod.X / stds\n",
    "\n",
    "print(\"\\nSubset and scaled X:\")\n",
    "print(adata_mod.X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Valid again: (True, 'protein')\n"
     ]
    }
   ],
   "source": [
    "print(\"Valid again:\", is_proteodata(adata_mod))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **Note**  \n",
    "> Z-score normalization with zero-variance variables yields `np.nan`, which is allowed in the proteodata format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Normalized X:\n",
      "[[-1.22474487 -1.31319831         nan -1.22474487]\n",
      " [ 1.22474487  0.20203051         nan  1.22474487]\n",
      " [ 0.          1.1111678          nan  0.        ]]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/tmp/ipykernel_2256571/1869068870.py:6: RuntimeWarning: invalid value encountered in divide\n",
      "  adata_mod.X = (adata_mod.X - means) / stds\n"
     ]
    }
   ],
   "source": [
    "adata_mod = adata.copy()\n",
    "\n",
    "# Z-score normalization: x / std (per protein)\n",
    "stds = np.std(adata_mod.X, axis=0)\n",
    "means = np.mean(adata_mod.X, axis=0)\n",
    "adata_mod.X = (adata_mod.X - means) / stds\n",
    "\n",
    "print(\"Normalized X:\")\n",
    "print(adata_mod.X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "is_proteodata: (True, 'protein')\n"
     ]
    }
   ],
   "source": [
    "print(\"\\nis_proteodata:\", is_proteodata(adata_mod))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "PROT_C is now entirely NaN (0 / 0). While NaN passes proteodata validation, the values are meaningless.\n",
    "With a nonzero numerator, division by zero would produce inf and fail validation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## 6. Validating layers\n",
    "\n",
    "ProteoPy functions sometimes store intermediate results in `adata.layers` (e.g., raw intensities before transformation). Both `is_proteodata` and `check_proteodata` accept a `layers` parameter to validate those matrices as well.\n",
    "\n",
    "We continue with a fresh copy of the same `adata` from section 5."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Without layers check: (True, 'protein')\n"
     ]
    }
   ],
   "source": [
    "adata_mod = adata.copy()\n",
    "\n",
    "print(\"Without layers check:\", is_proteodata(adata_mod))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Ignoring layers:     (True, 'protein')\n"
     ]
    }
   ],
   "source": [
    "# Add a layer with an infinite value\n",
    "bad_layer = adata_mod.X.copy()\n",
    "bad_layer[0, 0] = np.inf\n",
    "adata_mod.layers[\"transformed\"] = bad_layer\n",
    "\n",
    "# Still passes without the layers parameter\n",
    "print(\"Ignoring layers:    \", is_proteodata(adata_mod))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Checking layer:      (False, None)\n",
      "adata.layers['transformed'] contains infinite values (np.inf or -np.inf). Please remove or replace infinite values before proceeding.\n"
     ]
    }
   ],
   "source": [
    "# Fails when we explicitly check that layer\n",
    "print(\"Checking layer:     \", is_proteodata(adata_mod, layers=\"transformed\"))\n",
    "\n",
    "try:\n",
    "    check_proteodata(adata_mod, layers=\"transformed\")\n",
    "except ValueError as e:\n",
    "    print(e)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "adata.layers['transformed'] contains infinite values (np.inf or -np.inf). Please remove or replace infinite values before proceeding.\n"
     ]
    }
   ],
   "source": [
    "# You can check multiple layers at once\n",
    "adata_mod.layers[\"raw\"] = adata_mod.X.copy()  # this one is fine\n",
    "\n",
    "try:\n",
    "    check_proteodata(adata_mod, layers=[\"raw\", \"transformed\"])\n",
    "except ValueError as e:\n",
    "    print(e)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## 7. Quick-reference checklist\n",
    "\n",
    "Use this checklist when constructing or debugging a proteodata object:\n",
    "\n",
    "| # | Check | Applies to |\n",
    "|---|-------|------------|\n",
    "| 1 | `.obs[\"sample_id\"]` exists | Both |\n",
    "| 2 | `.var[\"protein_id\"]` exists | Both |\n",
    "| 3 | `.var[\"peptide_id\"]` exists and matches `.var_names` | Peptide only |\n",
    "| 4 | `.var[\"protein_id\"]` matches `.var_names` | Protein only |\n",
    "| 5 | Each peptide maps to exactly one protein | Peptide only |\n",
    "| 6 | No `NaN` in `protein_id` or `peptide_id` | Both |\n",
    "| 7 | No `np.inf` or `-np.inf` in `.X` or checked layers | Both |\n",
    "| 8 | All indices (obs, var) are unique | Both |\n",
    "| 9 | `.X` is 2-dimensional | Both |\n",
    "| 10 | `protein_id`/`peptide_id` not in `.obs`; `sample_id` not in `.var` | Both |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Summary\n",
    "\n",
    "The proteodata format is a thin but strict contract on top of AnnData that ensures ProteoPy functions can safely assume:\n",
    "\n",
    "- **Who are the samples?** — defined by `.obs[\"sample_id\"]`\n",
    "- **What are the variables?** — defined by `.var[\"protein_id\"]` (protein-level) or `.var[\"peptide_id\"]` + `.var[\"protein_id\"]` (peptide-level)\n",
    "- **Is the data clean?** — no infinite values, no NaN identifiers, no duplicates\n",
    "\n",
    "Always validate with `check_proteodata()` after constructing or modifying an AnnData. ProteoPy does this automatically in every public function, so you will get a clear error message if something is wrong."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "proteopy-usage1",
   "language": "python",
   "name": "proteopy-usage1"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}