MaxQuant -- Contaminant Database
================================
.. bibliography::
:filter: key == "cox-2008"
.. rubric:: Overview
`MaxQuant `__ is a widely used quantitative
proteomics software platform :cite:p:`cox-2008`. It ships with a
built-in contaminant FASTA file that is automatically appended to
search databases during analysis. The contaminant list is not
independently published or documented outside of the MaxQuant
distribution.
The MaxQuant contaminant file contains **246 protein entries**
(MaxQuant_v2.7.5.0) and uses a non-standard FASTA header format (see
:ref:`maxquant-header-format` below).
.. rubric:: Library Composition
The entries can be broadly categorized as:
.. list-table::
:widths: 50 10
:header-rows: 1
* - Category
- Entries
* - Bovine serum and tissue proteins (predominantly serum albumin,
caseins, immunoglobulins, and other plasma proteins)
- 116
* - Keratins and keratin-associated proteins (human, mouse)
- 109
* - Proteolytic enzymes and laboratory reagents (trypsin,
chymotrypsinogen, Lys-C, Glu-C, Asp-N, pepsin, nuclease,
streptavidin)
- 12
* - Other (dermokine, hornerin, filaggrin,
and unannotated Ensembl/RefSeq entries)
- 7
* - Fluorescent proteins (GFP, YFP)
- 2
.. rubric:: Organism Breakdown
.. list-table::
:widths: 50 10
:header-rows: 1
* - Organism
- Proteins
* - *Bos taurus* (serum, tissue, and reagent proteins)
- 121
* - *Homo sapiens* (keratins, dermokine)
- 75
* - *Mus musculus* (mouse keratins)
- 32
* - Other organisms (enzymes, fluorescent proteins, viral)
- 18
.. rubric:: Obtaining the File
ProteoPy does **not** bundle the MaxQuant contaminant file. Users
must obtain it themselves from the MaxQuant distribution:
1. Read and accept the MaxQuant
`terms and conditions `__
before downloading.
2. Download the latest MaxQuant release from
https://maxquant.org/download_asset/maxquant/latest
3. Unzip the downloaded archive.
4. Copy the contaminant FASTA file from the extracted directory:
.. code-block:: text
MaxQuant_vX.X.X.X/bin/conf/contaminants.fasta
to your project directory.
.. _maxquant-header-format:
.. rubric:: Header Format and usage with ProteoPy
MaxQuant uses a non-standard FASTA header format where the UniProt
accession is the first whitespace-delimited token:
.. code-block:: text
>P00761 SWISS-PROT:P00761|TRYP_PIG Trypsin - Sus scrofa (Pig).
>Q32MB2 TREMBL:Q32MB2;Q86Y46 Tax_Id=9606 Gene_Symbol=KRT73 ...
Because this differs from the standard UniProt header format,
you must pass a custom ``header_parser`` to
:func:`pr.pp.remove_contaminants()
`.
.. code-block:: python
import proteopy as pr
def maxquant_header_parser(header: str) -> str:
"""Extract the UniProt accession from a MaxQuant FASTA header.
"""
return header.split()[0]
pr.pp.remove_contaminants(
adata,
contaminant_path="contaminants.fasta",
header_parser=maxquant_header_parser,
)
.. rubric:: Entries to Consider Removing
Four entries in the MaxQuant contaminant file use non-standard headers
that cannot be mapped to UniProt accessions. These will cause the
``header_parser`` above to return identifiers that do not match any
UniProt protein. Depending on your workflow, you may want to manually
remove them from the FASTA before use:
- ``Streptavidin (S.avidinii)``
- ``H-INV:HIT000016045`` -- Similar to Keratin, type II cytoskeletal 8
- ``H-INV:HIT000292931`` -- Similar to Keratin, type II cytoskeletal 8
- ``H-INV:HIT000015463`` -- Similar to Keratin 18 (gene symbol PTPN14)
.. rubric:: Resources
- **MaxQuant**: `maxquant.org `__