MaxQuant – Contaminant Database

[CM08]

Cox J and Mann M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nature Biotechnology, 26(12):1367–1372, 2008. URL: https://www.nature.com/articles/nbt.1511, doi:10.1038/nbt.1511.

Overview

MaxQuant is a widely used quantitative proteomics software platform [Cox and Mann, 2008]. It ships with a built-in contaminant FASTA file that is automatically appended to search databases during analysis. The contaminant list is not independently published or documented outside of the MaxQuant distribution.

The MaxQuant contaminant file contains 246 protein entries (MaxQuant_v2.7.5.0) and uses a non-standard FASTA header format (see Header Format and usage with ProteoPy below).

Library Composition

The entries can be broadly categorized as:

Category

Entries

Bovine serum and tissue proteins (predominantly serum albumin, caseins, immunoglobulins, and other plasma proteins)

116

Keratins and keratin-associated proteins (human, mouse)

109

Proteolytic enzymes and laboratory reagents (trypsin, chymotrypsinogen, Lys-C, Glu-C, Asp-N, pepsin, nuclease, streptavidin)

12

Other (dermokine, hornerin, filaggrin, and unannotated Ensembl/RefSeq entries)

7

Fluorescent proteins (GFP, YFP)

2

Organism Breakdown

Organism

Proteins

Bos taurus (serum, tissue, and reagent proteins)

121

Homo sapiens (keratins, dermokine)

75

Mus musculus (mouse keratins)

32

Other organisms (enzymes, fluorescent proteins, viral)

18

Obtaining the File

ProteoPy does not bundle the MaxQuant contaminant file. Users must obtain it themselves from the MaxQuant distribution:

  1. Read and accept the MaxQuant terms and conditions before downloading.

  2. Download the latest MaxQuant release from https://maxquant.org/download_asset/maxquant/latest

  3. Unzip the downloaded archive.

  4. Copy the contaminant FASTA file from the extracted directory:

    MaxQuant_vX.X.X.X/bin/conf/contaminants.fasta
    

    to your project directory.

Header Format and usage with ProteoPy

MaxQuant uses a non-standard FASTA header format where the UniProt accession is the first whitespace-delimited token:

>P00761 SWISS-PROT:P00761|TRYP_PIG Trypsin - Sus scrofa (Pig).
>Q32MB2 TREMBL:Q32MB2;Q86Y46 Tax_Id=9606 Gene_Symbol=KRT73 ...

Because this differs from the standard UniProt header format, you must pass a custom header_parser to pr.pp.remove_contaminants().

import proteopy as pr

def maxquant_header_parser(header: str) -> str:
    """Extract the UniProt accession from a MaxQuant FASTA header.
    """
    return header.split()[0]

pr.pp.remove_contaminants(
    adata,
    contaminant_path="contaminants.fasta",
    header_parser=maxquant_header_parser,
    )

Entries to Consider Removing

Four entries in the MaxQuant contaminant file use non-standard headers that cannot be mapped to UniProt accessions. These will cause the header_parser above to return identifiers that do not match any UniProt protein. Depending on your workflow, you may want to manually remove them from the FASTA before use:

  • Streptavidin (S.avidinii)

  • H-INV:HIT000016045 – Similar to Keratin, type II cytoskeletal 8

  • H-INV:HIT000292931 – Similar to Keratin, type II cytoskeletal 8

  • H-INV:HIT000015463 – Similar to Keratin 18 (gene symbol PTPN14)

Resources