mzIdentML Conformance to MCP Guidelines

This table lists each point in the Molecular and Cellular Proteomics Publication guidelines for the analysis and documentation of peptide and protein identifications. It states the xpath/CV available to provide the required information.

Do not edit this page directly because the editor on is useless for tables. Source for this page is under svn here. You should edit the source html and copy/paste to here.

The MCP document is available here

MCP Section Item xPath (under mzIdentML) Notes a b c d e f g h i j
1 The method and/or program (including version number) used to create the "peak list" from the raw data and the parameters used in the creation of this peak list.   This is outside the scope of mzIdentML. If the source data is in mzML or mxDatra format, then this information should be available from there.                    
The name and version of the program(s) used for database searching and the values of search parameters. Examples include precursor-ion mass tolerance, fragment-ion mass tolerance, modifications allowed for, any missed cleavages, protein cleavage chemistry , (if any), etc. AnalysisSoftwareList/AnalysisSoftware/name  
  na na
AnalysisProtocolCollection/SpectrumIdentificationProtocol/Enzymes missed cleavages, protein cleavage chemistry
AnalysisProtocolCollection/SpectrumIdentificationProtocol/ModificationParams/SearchModification Using the PSI-MS names available from Unimod na na na na
The name and version of the sequence database(s) used. If a database was compiled in-house, a complete description of the source of the sequences is required. The number of entries actually searched from each database should be included. Authors should justify the use of a very small database or database that excludes common contaminants, since this may generate misleading assignments. DataCollection/Inputs/SearchDatabase/DatabaseName
DataCollection/Inputs/SearchDatabase/numDatabaseSequences   na
Methods used to interpret MS/MS data, thresholds and values specific to judging certainty of identification, whether any statistical analysis was applied to validate the results, and a description of how applied. AnalysisProtocolCollection/SpectrumIdentificationProtocol/AdditionalSearchParams/cvParam e.g. Minimum scores or expect values                  
For example, MS:1001316, mascot:SigThreshold    
AnalysisProtocolCollection/ProteinDetectionProtocol/AnalysisParams/cvParam e.g. Minimum number of peptides per protein. na     na
For large scale experiments, provide the results of any additional statistical analyses that indicate or establish a measure of identification certainty, or allow a determination of the false-positive rate, e.g., the results of randomized database searches or other computational approaches. AnalysisData/SpectrumIdentificationList/SpectrumIdentificationResult/SpectrumIdentificationItem/PeptideEvidence <PeptideEvidence> elements have an isDecoy attribute. A number of controlled vocabulary terms are available for describing the FDR    
2 Information for each protein sequence identified should specify the following:
Accession number and database source SequenceCollection/DBSequence/accession Each protein result has an id which identifies the database and accession
score(s) and any associated statistical information obtained for searches conducted; DataCollection/AnalysisData/ProteinDetectionList/ProteinAmbiguityGroup/ProteinDetectionHypothesis/cvParam Scores etc. are different for each search engine. na na na
sequence coverage, expressed as the number of amino acids spanned by the assigned peptides divided by the sequence length; DataCollection/AnalysisData/ProteinDetectionList/ProteinAmbiguityGroup/ProteinDetectionHypothesis/cvParam
the total number of peptides assigned to the protein. To compute this number, multiple matches to peptides with the same primary sequence count as one, even if they represent different charge states or modification states; DataCollection/AnalysisData/ProteinDetectionList/ProteinAmbiguityGroup/ProteinDetectionHypothesis/cvParam
accession="MS:1001097" or "MS:1001098"
See definitions for MS:1001097 and MS:1001098    
In addition to the above information, for single peptide-based identifications the following data should be provided:
peptide sequence, noting any deviation from the expected protein cleavage specificity; SequenceCollection/Peptide/peptideSequence Does not show deviation from expected cleavage specificity. This would need to be derived.
modifications SequenceCollection/Peptide/Modification Complete information about which sites are modified is available na na na
precursor mass, charge and mass error observed; DataCollection/AnalysisData/SpectrumIdentificationList/SpectrumIdentificationResult/SpectrumIdentificationItem/chargeState Charge state  
DataCollection/AnalysisData/SpectrumIdentificationList/SpectrumIdentificationResult/SpectrumIdentificationItem/experimentalMassToCharge Precursor mass  
Mass error observed  
score(s) and any associated statistical information; DataCollection/AnalysisData/SpectrumIdentificationList/SpectrumIdentificationResult/SpectrumIdentificationItem/cvParam   na
MS/MS spectrum annotated with masses observed as well as fragment assignments AnalysisData/SpectrumIdentificationList/SpectrumIdentificationResult/SpectrumIdentificationItem/Fragmentation/IonType/FragmentArray This contains the information required to be able to generate an annotated spectrum     na      
3 Additional potentially valuable information could include the retention time of each peptide, the observation of multiple charge states, multiple observations of the same peptide, flanking residues, start and end positions of peptides in proteins, and any platform-specific information.   The retention time may be available from the mzML, mzData or MGF file that is referenced. na na na na na na na na na na
  Multiple observations are permiited, so it is trivial to extract information about the observation of multiple charge states, multiple observations of the same peptide na na
AnalysisData/SpectrumIdentificationList/SpectrumIdentificationResult/SpectrumIdentificationItem/PeptideEvidence/pre & post Flanking residues     na      
DataCollection/AnalysisData/SpectrumIdentificationList/SpectrumIdentificationResult/SpectrumIdentificationItem/start & end Start and end positions    
4 Manuscripts presenting any conclusions citing quantitative proteomic results should contain the following information:   Out of scope - will be included in mzQuantML                    
5 Studies focusing on posttranslational modifications
... require specialized methodology and documentation to assign the presence and the site(s) of modification. Certain modifications are also nominally isobaric (e.g., acetylation vs trimethylation, phosphorylation vs sulfation). If one of these modifications is being reported, then evidence for assigning a specific modification over another must be presented. Examples of methods able to distinguish between these include mass spectrometric approaches such as accurate mass determination, observation of signature fragment ions (e.g. m/z 79 vs m/z 80 in negative ion mode for assignment of phosphorylation over sulfation), or biological or chemical strategies.   Not currently supported                    
In the tabular presentation of the data, authors are required to show 1) the sequence of the peptide used to make each such assignment, 2) the precursor mass and charge (not just m/z) observed, and 3) the search engine score for this peptide. Frequently more than one possible site of modification exists within a given peptide sequence. Assignment of specific site(s) of modification requires observation of fragment ions that distinguish among the possible sites. When ambiguity with regard to the modification site cannot be resolved, then the ambiguity must be explicitly shown in the tables (e.g., ALEG(sss)YLLK where one of the three Ser residues in parentheses is phosphorylated, but the spectra do not permit assignment of which one). The number of detected modifications in each peptide (e.g, 1, 2 or 3 phosphates) must also be included in the table.   As described above. To produce a table, a simple xslt could for example be written. na na na na
In all cases involving the assignment of a posttranslational modification(s) , we require that copies of the annotated, mass labeled spectra for those modified peptides be submitted electronically together with the manuscript for review purposes. Authors are required to present representative spectra of posttranslationally modified peptides in the body of the text and the remaining annotated spectra as supplemental material. In addition, authors are encouraged to provide the corresponding peak (m/z and intensity) lists for review. AnalysisData/SpectrumIdentificationList/SpectrumIdentificationResult/SpectrumIdentificationItem/Fragmentation/IonType/FragmentArray This contains the information required to be able to generate an annotated spectrum     na      
6 While more reliable results for peptide identification are generally produced by MS/MS data, in selected circumstances, such as analysis of 2D gel spots, peptide mass fingerprinting can be an effective choice of technique for protein identification. For each identification, an annotated mass spectrum must be supplied. We also encourage the submission of the peak lists for review. In the tabular presentation of the results the authors must supply: 1) the number of matched peaks, 2) the number of unmatched peaks, and 3) the sequence coverage. Some probability value attached to this identification should be presented; for example, in addition to the score for the top match the score for the highest ranked hit to a non-homologous protein. They must describe the parameters and thresholds used to analyze the data (see guideline 1, above), including mass accuracy, resolution, means of calibrating each spectrum, and exclusion of known contaminant ions (keratin, etc.). Authors are required to use and provide the results of scoring schemes that provide a measure of identification certainty, or perform some measure of the false-positive rate.   As described above.
The number of unmatched peaks is recorded using the CV term: accession="MS:1001362" name="number of unmatched peaks"
na na na na na na na na na
7 Identical peptide sequences can be included in multiple unique protein sequences due to biological variation such as single amino acid variants, alternative splice forms, homologs, orthologs and paralogs. Other reasons for apparent redundancy in protein sequence database entries are the inclusion of sequence fragments and sequences with errors. Apparent redundancy can also occur due to clerical errors arising from the merger of multiple sequence databases or identical protein sequences appearing under different names or accession numbers.

Experimental strategies based on proteolytic digestion of protein mixtures introduce the complication of loss of connectivity between peptides and their protein precursors. Assignment of peptide sequences results in two outcomes; distinct peptides that map to only one protein sequence or shared peptides that map to more than one protein sequence. Detection of shared peptides introduces an uncertainty between the possibility that a shared peptide can be mapped to more than one protein sequence (bioinformatics redundancy) versus the possibility that more than one precursor is in the original protein mixture (physical redundancy). The apparent ambiguity in peptide assignment requires reporting of a protein group. When assembling peptides into proteins and protein groups, authors should adhere to principles of parsimony, i.e., describe the minimum set of protein sequences that adequately accounts for all observed peptides. While the identification of shared peptides implies that multiple related protein sequences are present, the initial assumption should be that only a single form is being detected. Authors should explain and be able to justify cases where a single protein from a protein group has been singled out or that more than one member of a protein group is present. When reporting a summary list of peptides belonging to each protein group, peptides shared among multiple proteins and those unique to a specific protein should be clearly indicated. In addition, sometimes proteins are identified from a different species than the one being studied. For example, identification of a mouse or human protein in a hamster study. If such an orthologous protein is included, the circumstances should be mentioned and justified.
  mzIdentML does not require that the "parsimony" technique is used. It does allow representation and complete description of protein groups using the ProteinAmbiguityGroup and ProteinDetectionHypothesis elements  
8 It is strongly encouraged (but not yet required) that all MS/MS spectra mentioned in the paper be submitted as supplemental material. Journals will vary in their ability to handle this information and authors are encouraged to provide access to raw MS data using other means, including group websites and public repositories, as they emerge, in addition to the journal itself.   Not applicable