Proteomics Informatics

Tools implementing mzIdentML

Status of support for mzIdentML in proteome informatics software

 

ToolTypeStatus / DescriptionURL I/E F/C
Byonic (Protein Metrics Inc.)SearchByonic search engine supports mzIdentML 1.1 as an output formathttp://www.proteinmetrics.com/products/byonic/EC
CruxSearchSupports mzIdentML 1.1 as an output format and reads mzIdentML 1.1 to generate spectral count datahttp://crux.ms/I & EF
IDPickerGroupingVersion 3.x implements mzIdentML 1.1 importhttps://medschool.vanderbilt.edu/msrc-bioinformatics/softwareIF
IP2Search & QuantIntegrated Proteomics Pipeline supports export of results into mzIdentML 1.1http://www.integratedproteomics.com/EC
IquantQuantAutomated pipeline for quantification by using isobaric tags; identification results are imported via mzIdentML 1.1https://sourceforge.net/projects/iquant/IF
jmzIdentMLIOJava API for reading and writing mzIdentML 1.1https://github.com/PRIDE-Utilities/jmzIdentMLI & EF
jPOSTDatabaseidentification result files can be uploaded in mzIdentML 1.1http://jpostdb.org/IF
Mascot (Matrix Science)Search & QuantmzIdentML version 1.1 available in Mascot version 2.4+http://www.matrixscience.com/EC
MassIVEDatabaseidentification files can be uploaded in mzIdentML 1.1https://massive.ucsd.eduIF
ms-data-core-apiIOJava API that supports reading of PSI standard and open formats e.g. mzML, mzIdentML, mzTab, mgf and others.https://github.com/PRIDE-Utilities/ms-data-core-apiIF
MS-GF+SearchFull support for exporting identification results into mzIdentML 1.1https://omics.pnl.gov/software/ms-gfEF
MyriMatchSearchIdentifications exported in mzIdentML 1.1https://medschool.vanderbilt.edu/msrc-bioinformatics/softwareEF
mzID packageIOR package available through Bioconductor supporting v 1.1http://www.bioconductor.org/packages/release/bioc/html/mzID.htmlIF
mzidLibraryPost-processingRoutines and viewer (stats, protein inference, CSV import/export, proteogenomics) supporting v1.1 and 1.2https://github.com/PGB-LIV/mzidlibI & EF
OMSSA [mzidLib]SearchConverter from OMSSA .omx files to v1.1 or 1.2 in mzidLibrary.https://github.com/PGB-LIV/mzidlibEF
OpenMSPipelinemzIdentML 1.1 fully supported in release 1.9 +https://www.openms.de/I & EF
PAnalyzerGroupingUsed for protein grouping; it imports and exports mzIdentML (v1.1 and 1.2)https://github.com/akrogp/EhuBio/wiki/PanalyzerI & EF
PEAKS (Bioinformatics Solutions Inc.)Search & QuantNative export of mzIdentML version 1.1http://www.bioinfor.com/EC
PeptideShakerPost-processingJava stand-alone tool for the analysis and post-processing of proteomics experiments; it support mzIdentML 1.1 & 1.2http://compomics.github.io/projects/peptide-shaker.htmlI & EF
PGAProteogenomicsSoftware for creating RNA-Seq based databases; it supports v1.1 as an input format for post-processing.http://www.bioconductor.org/packages/devel/bioc/html/PGA.htmlIF
PIAGroupingToolbox for protein inference and identification analysis; it supports mzIdentML 1.1.https://github.com/mpc-bioinformatics/piaI & EF
ProteinLynx Global ServerSearch & QuantPeptide/protein identification and quantification software; it supports export to mzIdentML in version 3.0.3+http://www.waters.com/waters/en_GB/ProteinLynx-Global-SERVER-%28PLGS%29/...EC
PRIDEDatabasemzIdentML 1.1 fully supported as an import format as part of the “complete” dataset submission pipelinehttps://www.ebi.ac.uk/pride/archive/IF
PRIDE InspectorVisualisationJava stand-alone tool that can be used to visualise mzIdentML 1.1 files, independently or together with the corresponding mass spectra files (available in any open formats e.g. mzML, mzXML, mgf, dta, pkl, and apl).https://github.com/PRIDE-Toolsuite/pride-inspectorIF
Progenesis QI for proteomics (Waters Corp.)QuantLabel-free quantification software can read identifications from Byonic in mzIdentML 1.1http://www.nonlinear.com/progenesis/qi-for-proteomics/IC
ProteinPilotSearch & QuantProteinPilot 5.0+ exports search results in mzIdentML version 1.2.https://sciex.com/products/software/proteinpilot-softwareEC
ProteinScape (Bruker)Search & QuantIt imports search engine results other than Mascot in mzIdentML 1.1https://www.bruker.com/products/mass-spectrometry-and-separations/ms-sof...IC
SEQUEST / Proteome Discoverer (Thermo) [m2Lite / ProCon]Search & QuantConversion of msf files from Proteome Discoverer to mzIdentML 1.1 via m2Lite or ProCon (ProCon also supports ProteinScape and Comet conversions).

https://bitbucket.org/paiyetan/m2lite/downloads/

http://www.ruhr-uni-bochum.de/mpc/software/ProCon/index.html.en

EF*
ProteoAnnotatorProteogenomicsProteogenomics software that uses mzIdentML 1.1 as its internal file formathttp://www.proteoannotator.org/EF
ProteoWizardIOpepXML converter available and support for reading/writing mzIdentML 1.1http://proteowizard.sourceforge.net/I & EF
ScaffoldSearch & QuantScaffold 4.0+ supports reading and writing of mzIdentML 1.1http://www.proteomesoftware.com/products/scaffold/I & EC
Spectrum Identification Machine for Cross-Linked Peptides (SIM-XL)SearchSpectrum Identification Machine for cross-linked Peptides. PMID: 25638023http://patternlabforproteomics.org/sim-xl/EF
SkylineQuantSRM/MRM/PRM, DIA and targeted DDA software can import mzIdentML 1.1 for spectral library constructionhttps://skyline.msIF
Trans Proteomic Pipeline [ProteoWizard]PipelinepepXML to mzIdentML 1.1 converter available from ProteoWizardhttp://proteowizard.sourceforge.net/I & EF
X!Tandem [mzidLib]SearchConverter from X!Tandem XML files to mzIdentML 1.1 or 1.2 as part of the mzidLibrary.https://github.com/PGB-LIV/mzidlibEF
PACom (Proteomics Assay Comparator)Integration, visualisation and comparisonTool for the integration, visualization and comparison of multiple datasets. It supports mzIdentML 1.1 and 1.2. https://github.com/smdb21/PACOM/wikiI 
DTASelect2MzIdIOConverter from DTASelect output files to mzIdentML 1.2.https://github.com/proteomicsyates/DTASelect2MzIdE 

Tags: 

mzIdentML

mzIdentML is one of the standards developed by the Proteomics Informatics working group of the PSI.

For general information of the activities and the organization of this working group see HERE.

Contents

  1. mzIdentML 1.2.0 (current release)
  2. mzIdentML 1.1.1
  3. mzIdentML 1.1.0: XML Schema, Documentation and Ontology
  4. mzIdentML Tools and Implementations
  5. mzIdentML 1.0.0 (Previous Version): Schema, documentation and ontology

 


mzIdentML 1.2.0 (Released March 2017 - current version of the standard)

In 2013-2017, PSI-PI has updated mzIdentML from version 1.1 to 1.2. The main update relates to improvement in the representation of protein grouping relationships, through the use of mandatory CV terms. Minor updates have also being proposed for capturing pre-fractionation of samples, de novo sequencing and the use of multiple search engines. Specifications have also been added for supporting proteogenomics and cross-linking MS.

 


mzIdentML 1.1.1: XML Schema, Documentation

Released in July 2015, as a minor update to version 1.1.0. This update should be viewed as a "bugfix" update only.
The only change is to ensure that mass deltas encoded in the format are consistently encoded as doubles and not as floats. As of March 2017, both mzIdentML 1.1.1 and 1.2 (see above) will be generally supported for some years, although we strongly encourage new implementers to work with mzIdentML 1.2.

This has resulted in a change to the schema (XSD) and the specification document only. All other resources are unchanged from version 1.1.0.

 


mzIdentML 1.1.0: XML Schema, Documentation and Ontology

Released in August 2011.

More documentation is available in the HUPO-PSI GitHub page at https://github.com/HUPO-PSI/mzIdentML.

Direct Links to deliverables:

  • Example Instance Documents:
    • Mascot MS MS example - a simple example of 4 ms-ms spectra searched against a protein database.
    • Mascot Nucleic Acid Example - an example of a search against an EST database
    • Mascot Top Down example - a single ms-ms spectra from a protein.
    • MPC Use case - use peptides from different search engines to assemble proteins with a third-party algorithm;
      false-discovery estimation using decoy database.
    • OMSSA - example MS-MS search results including decoy matches
    • PMF Example - example Peptide Mass Fingerprint search
    • Sequest -a simple example derived from a .out file
    • X! Tandem - example MS-MS search results including decoy matches

 


 

mzIdentML Tools and Implementations

Current status of tools that write and import mzIdentML are on this page.


 


 

mzIdentML 1.0.0 (Previous Version): Schema, documentation and ontology

 This was the first version of the mzIdentML format, released August 2009. mzIdentML 1.0.0 is NOW DEPRECATED - users should use mzIdentML 1.1.x or 1.2 versions.

mzIdentML was developed as an extension to the Functional Genomics Experiment (FuGE) object model. However, in a change agreed at the PSI Spring Meeting, 2008, the XML schema was developed directly rather than performing the design in UML and converting to XML. A cut-down version of the FuGE xsd has been developed to facilitate this. As a consequence, the UML class diagram in subversion is now out of date.

 

Tags: 

mzIdentML Use Cases

Use Cases for mzIdentML

  1. It should be possible to create a tool that loads an mzIdentML document and enables users to examine results from an MS, MS-MS, MSn or tag searches. (For MSn searches, the assumption is that matches will be of a similar format to those from MS-MS searches and there will be no attempt to model combining, say MS4 matches with the corresponding MS3 and MS-MS results). There should be sufficient information for the tool to generate output reports that conform to the requirements made by journals for publication and that conform to the relevant MIAPE guidelines. For example:
    ·    For a PMF search, it should be possible to display the spectrum and show the matches of the peaks to the relevant peptides, but only if the spectrum is available.
    ·    For an MS-MS search, it should be possible to locate which spectrum matched to which peptide in the original file.
    ·    For a tag search, there should be sufficient information to validate that a result is correct.
  2. There should be sufficient information stored in the instance document to enable a user to run the same search on the same or another search engine. This means that all search parameters should be described in sufficient detail and that sufficient information is available to determine which database (if any) the data were searched against. The peak lists data (if any) do not need to be included in the instance document, but do need to be suitably referenced.
  3. A PMF search and an MS-MS search of the same sample can be saved in the same instance document as long as the result is one combined protein list.
  4. It should be possible to save the results of searching a decoy database in the same instance document as the results from the forward database. It should then be possible to write a viewer application that enables a user to investigate the effect of changing, for example, a threshold value on the false discovery rate. This would only be possible if all results (rather than just top matches) from the search are saved in the mzIdentML document and if the results from the decoy search are also saved. It would only be possible to do this at the peptide level for an ms-ms search, because changing thresholds would normally have some effect on the protein grouping algorithm.
  5. It should be possible to save manual or automated annotation of proteins/peptides in an instance document. A third party tool could be used to save annotations and validations of identified proteins/peptides to an existing instance document
  6. It should be possible to save the results from a search of a metabolically labeled sample. For example, with a 14N/15N experiment, two separate sets of amino acid masses are used, and it must be possible to tell which masses were used for each peptide result.
  7. For a search of multiple peaks lists, it should be possible to identify the spectrum that obtained a match to a particular peptide or protein reported by the search engine. For example, in an LC-MS-MS run, it should be possible to refer back to the spectrum in the peak list file that was searched and from there, if the information is available to be able to determine the retention time of the spectrum. For an mzML file, the unique 'id' of the spectrum should be available. For other peak list formats, some other unique identifier should be stored where possible. There is no requirement to store other redundant information in the mzIdentML file that will be available in the peak list data.
  8. It should be possible to search an anlysisXML file to retrieve all molecules that have a specified modification.
  9. It should be possible to store the results of a search of spectra against other spectra - i.e. a spectral library search.
  10. It should be possible to store the results of a top down search i.e. analysis of complete proteins.
  11. Support for storing fragmentation data so that for example viewers could display which ions in the input data match predicted ion fragment masses.
  12. There should be support for storing the results of searches of peptides against nucleic acid databases, including the information about which translation frame the matches were found in.
  13. It should be possible to combine the results from multiple search engines into one mzIdentML document. For example, the peptide identification results from two different search engines could be combined using a third tool to give one set of protein results.

There will be limited support for the following use cases:

  1. De novo. De novo peptide sequencing results will be supported to the extent that it will be possible to enumerate through and record all possible matches found by a denovo technique, however, we anticipate that this will produce extremely large files. In version 2, solutions will be investigated for defining a standard way of reporting ambiguous combinations of residues.

The following use cases will not be supported in version 1 of mzIdentML:

  1. It should be possible to store relative and absolute quantitation information at the peptide and protein level using all the popular techniques [Deferred to version 2].
  2. Support for LC-MS biomarker discovery.
  3. Support for complex workflows where multiple data processing algorithms are tagged together; i.e. only “final” results are represented in mzIdentML v1, no intermediate results.

Tags: 

mzIdentML Conformance to MIAPE

This table lists each point in the MIAPE guidelines and states the xpath/CV available to provide conformance

Do not edit this page directly because the editor on psidev.info is useless for tables. Source for this page is under svn here. You should edit the source html and copy/paste to here.

The MIAPE document is available here, and general information about MIAPE is here.

MIAPE SectionItemxPath (under mzIdentML)Notesabcdefghij
1Date stamp (as YYYY-MM-DD)creationDate (attribute)The creation date of the document itself. xsd:dateTime  
AnalysisCollection/SpectrumIdentification/activityDate (attribute)Date spectrum identification performed. xsd:dateTime 
AnalysisCollection/ProteinDetection/activityDate (attribute)Date protein inferencing performed. xsd:dateTimenana
Responsible person (or institutional role if more appropriate); provide name, affiliation and stable contact informationProvider/ContactRoleAn institutional email address can generally satisfy this requirement.  
Software name, version and manufacturerAnalysisSoftwareList/AnalysisSoftware/name 
AnalysisSoftwareList/AnalysisSoftware/version  
AnalysisSoftwareList/AnalysisSoftware/ContactRole 
Customisations made to that softwareAnalysisSoftwareList/AnalysisSoftware/CustomizationsNo customisations in some examples for illustration.
In the other cases this is just not applicable (na).
nananananana
Availability of that softwareAnalysisSoftwareList/AnalysisSoftware/URIThe references of the vendor or public url if a publicly available version has been used.  
Location of the files generated; parameter files, spectral data (input/output)DataCollection/Inputs/SourceFileThe location of the data generated. If made available in a public repository, describe the URI (for instance an url, or the url of the repository and the information on how to retrieve the data). If not made available for public access, describe the contact person reference or source and the internal coordinates of the data. e.g. Sequest .out, Mascot .dat. [Note to MIAPE Authors: This is confusing because of overlap with next section, so we just consider Inputs/SourceFile here and not the .dta files etc.]. 
2Input data – Description and type of MS dataDataCollection/Inputs/SpectraDataProvide a short description that can refer to the data in the experiment (e.g. LC-MS run1). [Refer to mzML source file for information - outside scope of mzIdentML]          
DataCollection/Inputs/SpectraData/fileFormat    
Input data – Availability of MS data (source of data)DataCollection/Inputs/SpectraDataLocation (URI) of input data file 
Input parameters - Databases queried; description and versions (including number of entries searched)DataCollection/Inputs/SearchDatabase/DatabaseName
and/or
DataCollection/Inputs/SearchDatabase/location
  
DataCollection/Inputs/SearchDatabase/version   
DataCollection/Inputs/SearchDatabase/numDatabaseSequences na
Input parameters - Taxonomical restrictions appliedAnalysisProtocolCollection/SpectrumIdentificationProtocol/DatabaseFiltersSpecify the ... subset of the databank(s) (for instance, “mammals”, a NCBI TaxId, a list of accession numbers).nananananananana
DataCollection/AnalysisData/SpectrumIdentificationList/numSequencesSearchedSpecify the number of entries searched.nananananananana
Input parameters - Description of tool and scoring schemeAnalysisProtocolCollection/SpectrumIdentificationProtocol/AdditionalSearchParams/cvParamDescriptor of the scoring algorithm in the search engine (such as ESI-TRAP in Mascot, ESI... [Note to MIAPE authors: These examples parameters are a little search engine specific]
Input parameters - Specified cleavage agent(s)AnalysisProtocolCollection/SpectrumIdentificationProtocol/EnzymesDescribe the cleavage agent as available on the search engine. If the cleavage agent rules have been defined by the user, describe the cleavage rules)
Input parameters - Allowed number of missed cleavagesAnalysisProtocolCollection/SpectrumIdentificationProtocol/Enzymes/Enzyme/missedCleavagesAllowed maximum number of cleavage sited missed by the specified agent during the in-silico cleavage process. For a no eznyme search, use the "No Enzyme" CV term, and omit the number of missed cleavages.
Input parameters - Additional parameters related to cleavageAnalysisProtocolCollection/SpectrumIdentificationProtocol/EnzymesThe Enzymes section is flexible. Example 'a' shows a case of a mixed enzyme.nanananananananana
Input parameters - Permissible amino acids modificationsAnalysisProtocolCollection/SpectrumIdentificationProtocol/ModificationParams/SearchModificationUsing the PSI-MS names available from Unimodnananana
Input parameters - Precursor-ion and fragment ion mass tolerance for tandem MS (when applicable)AnalysisProtocolCollection/SpectrumIdentificationProtocol/FragmentTolerance
AnalysisProtocolCollection/SpectrumIdentificationProtocol/ParentTolerance
 nana
Input parameters - Mass tolerance for PMF (when applicable)AnalysisProtocolCollection/SpectrumIdentificationProtocol/ParentTolerance nanananananananana
Input parameters - Thresholds; minimum scores for peptides, proteins (probabilities, number of hits, other metrics)AnalysisProtocolCollection/SpectrumIdentificationProtocol/AdditionalSearchParams/cvParam          
AnalysisProtocolCollection/ProteinDetectionProtocol/AnalysisParams/cvParam na  na
Input parameters - Any other relevant parametersAnalysisProtocolCollection/SpectrumIdentificationProtocol/AdditionalSearchParams/cvParam 
3Identified proteins - Accession code in the queried databaseSequenceCollection/DBSequence/accession 
Identified proteins - Protein descriptionSequenceCollection/DBSequence/cvParam accession="MS:1001088" na  na
Identified proteins - Protein scoresDataCollection/AnalysisData/ProteinDetectionList/ProteinAmbiguityGroup/ProteinDetectionHypothesis/cvParam nanana
Identified proteins - Validation statusDataCollection/AnalysisData/ProteinDetectionList/ProteinAmbiguityGroup/ProteinDetectionHypothesis/cvParam accession="MS:1001060"For all protein hits in the search, specify if accepted without post-processing of search engine/de-novo interpretation (accept raw output of identification software) or if manually accepted as valid or as rejected (false positive).     na   na
Identified proteins - Number of different peptide sequences (without considering modifications) assigned to the proteinDataCollection/AnalysisData/ProteinDetectionList/ProteinAmbiguityGroup/ProteinDetectionHypothesis/cvParam
accession="MS:1001097"
 na  na
Identified proteins - Percent peptide coverage of proteinDataCollection/AnalysisData/ProteinDetectionList/ProteinAmbiguityGroup/ProteinDetectionHypothesis/cvParam
accession="MS:1001093"
 na  na
Identified proteins - Identity of supporting peptidesDataCollection/AnalysisData/ProteinDetectionList/ProteinAmbiguityGroup/ProteinDetectionHypothesis/PeptideHypothesis nana
Identified proteins - In the case of PMF, number of matched/unmatched peaksDataCollection/AnalysisData/ProteinDetectionList/ProteinAmbiguityGroup/ProteinDetectionHypothesis/cvParam
accession="MS:1001097" name="distinct peptide sequences"
accession="MS:1001362" name="number of unmatched peaks"
 nanananananananana
For identified peptides - Sequence (indicate any deviation from the expected protein cleavage specificity)SequenceCollection/Peptide/peptideSequence 
For identified peptides - Peptide scoresDataCollection/AnalysisData/SpectrumIdentificationList/SpectrumIdentificationResult/SpectrumIdentificationItem/cvParam na
For identified peptides - Chemical modifications (artefactual) and post-translational modifications (naturallyoccurring); sequence polymorphisms with experimental evidence (particularly for isobaric modifications)SequenceCollection/Peptide/Modification nanana
For identified peptides - Corresponding spectrum locusDataCollection/AnalysisData/SpectrumIdentificationList/SpectrumIdentificationResult/SpectrumIdentificationItem/start and end   
For identified peptides - Charge assumed for identification and a measurement of peptide mass errorDataCollection/AnalysisData/SpectrumIdentificationList/SpectrumIdentificationResult/SpectrumIdentificationItem/chargeState  
DataCollection/AnalysisData/SpectrumIdentificationList/SpectrumIdentificationResult/SpectrumIdentificationItem/calculatedMassToCharge - DataCollection/AnalysisData/SpectrumIdentificationList/SpectrumIdentificationResult/SpectrumIdentificationItem/experimentalMassToCharge  
For identified peptides - Other additional information, when used for evaluation of confidenceDataCollection/AnalysisData/SpectrumIdentificationList/SpectrumIdentificationResult/SpectrumIdentificationItem/cvParam 
Quantitation for selected ions - Quantitation approach (e.g. 4plex-iTRAQ, ICAT, cICAT, COFRADIC)Out of scopePlanned for mzQuantML          
Quantitation for selected ions - Quantity measurement (e.g. integration of signals, use of signal intensity)Out of scopePlanned for mzQuantML          
Quantitation for selected ions - Data transformation and normalisation technique (description of method and software)Out of scopePlanned for mzQuantML          
Quantitation for selected ions - Number of replicates (biological and technical)Out of scopePlanned for mzQuantML          
Quantitation for selected ions - Acceptance criteria (including measure of errors)Out of scopePlanned for mzQuantML          
Quantitation for selected ions - Estimates of uncertainty and the methods for the error analysis, including the treatment of relevant systematic error effects and the treatment of random error issues. Results from controls (when described)Out of scopePlanned for mzQuantML          
4Assessment and confidence given to the identification and quantitation (description of methods, thresholds, values, etc,)AnalysisProtocolCollection/SpectrumIdentificationProtocol/Threshold
and
ProteinDetectionProtocol/Threshold
For example, MS:1001316, mascot:SigThreshold  
4Results of statistical analysis or determination of false positive rate in case of large scale experiments          
4Inclusion/exclusion of the output of the software are provided (description of what part of the output has been kept, what part has been rejected)DataCollection/AnalysisData/SpectrumIdentificationList/SpectrumIdentificationResult/SpectrumIdentificationItem/ @passThreshold 

Tags: 

mzIdentML Conformance to MCP Guidelines

This table lists each point in the Molecular and Cellular Proteomics Publication guidelines for the analysis and documentation of peptide and protein identifications. It states the xpath/CV available to provide the required information.

Do not edit this page directly because the editor on psidev.info is useless for tables. Source for this page is under svn here. You should edit the source html and copy/paste to here.

The MCP document is available here

MCP SectionItemxPath (under mzIdentML)Notesabcdefghij
1The method and/or program (including version number) used to create the "peak list" from the raw data and the parameters used in the creation of this peak list. This is outside the scope of mzIdentML. If the source data is in mzML or mxDatra format, then this information should be available from there.          
The name and version of the program(s) used for database searching and the values of search parameters. Examples include precursor-ion mass tolerance, fragment-ion mass tolerance, modifications allowed for, any missed cleavages, protein cleavage chemistry , (if any), etc.AnalysisSoftwareList/AnalysisSoftware/name 
AnalysisSoftwareList/AnalysisSoftware/version  
AnalysisSoftwareList/AnalysisSoftware/ContactRole 
AnalysisProtocolCollection/SpectrumIdentificationProtocol/FragmentTolerance
AnalysisProtocolCollection/SpectrumIdentificationProtocol/ParentTolerance
 nana
AnalysisProtocolCollection/SpectrumIdentificationProtocol/Enzymesmissed cleavages, protein cleavage chemistry
AnalysisProtocolCollection/SpectrumIdentificationProtocol/ModificationParams/SearchModificationUsing the PSI-MS names available from Unimodnananana
The name and version of the sequence database(s) used. If a database was compiled in-house, a complete description of the source of the sequences is required. The number of entries actually searched from each database should be included. Authors should justify the use of a very small database or database that excludes common contaminants, since this may generate misleading assignments.DataCollection/Inputs/SearchDatabase/DatabaseName
and/or
DataCollection/Inputs/SearchDatabase/location
  
DataCollection/Inputs/SearchDatabase/version   
DataCollection/Inputs/SearchDatabase/numDatabaseSequences na
Methods used to interpret MS/MS data, thresholds and values specific to judging certainty of identification, whether any statistical analysis was applied to validate the results, and a description of how applied.AnalysisProtocolCollection/SpectrumIdentificationProtocol/AdditionalSearchParams/cvParame.g. Minimum scores or expect values         
AnalysisProtocolCollection/SpectrumIdentificationProtocol/Threshold
and
ProteinDetectionProtocol/Threshold
For example, MS:1001316, mascot:SigThreshold  
AnalysisProtocolCollection/ProteinDetectionProtocol/AnalysisParams/cvParame.g. Minimum number of peptides per protein.na  na
For large scale experiments, provide the results of any additional statistical analyses that indicate or establish a measure of identification certainty, or allow a determination of the false-positive rate, e.g., the results of randomized database searches or other computational approaches.AnalysisData/SpectrumIdentificationList/SpectrumIdentificationResult/SpectrumIdentificationItem/PeptideEvidence<PeptideEvidence> elements have an isDecoy attribute. A number of controlled vocabulary terms are available for describing the FDR  
2Information for each protein sequence identified should specify the following:
Accession number and database sourceSequenceCollection/DBSequence/accessionEach protein result has an id which identifies the database and accession
score(s) and any associated statistical information obtained for searches conducted;DataCollection/AnalysisData/ProteinDetectionList/ProteinAmbiguityGroup/ProteinDetectionHypothesis/cvParamScores etc. are different for each search engine.nanana
sequence coverage, expressed as the number of amino acids spanned by the assigned peptides divided by the sequence length;DataCollection/AnalysisData/ProteinDetectionList/ProteinAmbiguityGroup/ProteinDetectionHypothesis/cvParam
accession="MS:1001093"
   
the total number of peptides assigned to the protein. To compute this number, multiple matches to peptides with the same primary sequence count as one, even if they represent different charge states or modification states;DataCollection/AnalysisData/ProteinDetectionList/ProteinAmbiguityGroup/ProteinDetectionHypothesis/cvParam
accession="MS:1001097" or "MS:1001098"
See definitions for MS:1001097 and MS:1001098  
In addition to the above information, for single peptide-based identifications the following data should be provided:
peptide sequence, noting any deviation from the expected protein cleavage specificity;SequenceCollection/Peptide/peptideSequenceDoes not show deviation from expected cleavage specificity. This would need to be derived.
modificationsSequenceCollection/Peptide/ModificationComplete information about which sites are modified is availablenanana
precursor mass, charge and mass error observed;DataCollection/AnalysisData/SpectrumIdentificationList/SpectrumIdentificationResult/SpectrumIdentificationItem/chargeStateCharge state 
DataCollection/AnalysisData/SpectrumIdentificationList/SpectrumIdentificationResult/SpectrumIdentificationItem/experimentalMassToChargePrecursor mass 
DataCollection/AnalysisData/SpectrumIdentificationList/SpectrumIdentificationResult/SpectrumIdentificationItem/calculatedMassToCharge
-
DataCollection/AnalysisData/SpectrumIdentificationList/SpectrumIdentificationResult/SpectrumIdentificationItem/experimentalMassToCharge
Mass error observed 
score(s) and any associated statistical information;DataCollection/AnalysisData/SpectrumIdentificationList/SpectrumIdentificationResult/SpectrumIdentificationItem/cvParam na
MS/MS spectrum annotated with masses observed as well as fragment assignmentsAnalysisData/SpectrumIdentificationList/SpectrumIdentificationResult/SpectrumIdentificationItem/Fragmentation/IonType/FragmentArrayThis contains the information required to be able to generate an annotated spectrum  na   
3Additional potentially valuable information could include the retention time of each peptide, the observation of multiple charge states, multiple observations of the same peptide, flanking residues, start and end positions of peptides in proteins, and any platform-specific information. The retention time may be available from the mzML, mzData or MGF file that is referenced.nananananananananana
 Multiple observations are permiited, so it is trivial to extract information about the observation of multiple charge states, multiple observations of the same peptidenana
AnalysisData/SpectrumIdentificationList/SpectrumIdentificationResult/SpectrumIdentificationItem/PeptideEvidence/pre & postFlanking residues  na   
DataCollection/AnalysisData/SpectrumIdentificationList/SpectrumIdentificationResult/SpectrumIdentificationItem/start & endStart and end positions  
4Manuscripts presenting any conclusions citing quantitative proteomic results should contain the following information: Out of scope - will be included in mzQuantML          
5Studies focusing on posttranslational modifications
... require specialized methodology and documentation to assign the presence and the site(s) of modification. Certain modifications are also nominally isobaric (e.g., acetylation vs trimethylation, phosphorylation vs sulfation). If one of these modifications is being reported, then evidence for assigning a specific modification over another must be presented. Examples of methods able to distinguish between these include mass spectrometric approaches such as accurate mass determination, observation of signature fragment ions (e.g. m/z 79 vs m/z 80 in negative ion mode for assignment of phosphorylation over sulfation), or biological or chemical strategies. Not currently supported          
In the tabular presentation of the data, authors are required to show 1) the sequence of the peptide used to make each such assignment, 2) the precursor mass and charge (not just m/z) observed, and 3) the search engine score for this peptide. Frequently more than one possible site of modification exists within a given peptide sequence. Assignment of specific site(s) of modification requires observation of fragment ions that distinguish among the possible sites. When ambiguity with regard to the modification site cannot be resolved, then the ambiguity must be explicitly shown in the tables (e.g., ALEG(sss)YLLK where one of the three Ser residues in parentheses is phosphorylated, but the spectra do not permit assignment of which one). The number of detected modifications in each peptide (e.g, 1, 2 or 3 phosphates) must also be included in the table. As described above. To produce a table, a simple xslt could for example be written.nananana
In all cases involving the assignment of a posttranslational modification(s) , we require that copies of the annotated, mass labeled spectra for those modified peptides be submitted electronically together with the manuscript for review purposes. Authors are required to present representative spectra of posttranslationally modified peptides in the body of the text and the remaining annotated spectra as supplemental material. In addition, authors are encouraged to provide the corresponding peak (m/z and intensity) lists for review.AnalysisData/SpectrumIdentificationList/SpectrumIdentificationResult/SpectrumIdentificationItem/Fragmentation/IonType/FragmentArrayThis contains the information required to be able to generate an annotated spectrum  na   
6While more reliable results for peptide identification are generally produced by MS/MS data, in selected circumstances, such as analysis of 2D gel spots, peptide mass fingerprinting can be an effective choice of technique for protein identification. For each identification, an annotated mass spectrum must be supplied. We also encourage the submission of the peak lists for review. In the tabular presentation of the results the authors must supply: 1) the number of matched peaks, 2) the number of unmatched peaks, and 3) the sequence coverage. Some probability value attached to this identification should be presented; for example, in addition to the score for the top match the score for the highest ranked hit to a non-homologous protein. They must describe the parameters and thresholds used to analyze the data (see guideline 1, above), including mass accuracy, resolution, means of calibrating each spectrum, and exclusion of known contaminant ions (keratin, etc.). Authors are required to use and provide the results of scoring schemes that provide a measure of identification certainty, or perform some measure of the false-positive rate. As described above.
The number of unmatched peaks is recorded using the CV term: accession="MS:1001362" name="number of unmatched peaks"
nanananananananana
7Identical peptide sequences can be included in multiple unique protein sequences due to biological variation such as single amino acid variants, alternative splice forms, homologs, orthologs and paralogs. Other reasons for apparent redundancy in protein sequence database entries are the inclusion of sequence fragments and sequences with errors. Apparent redundancy can also occur due to clerical errors arising from the merger of multiple sequence databases or identical protein sequences appearing under different names or accession numbers.

Experimental strategies based on proteolytic digestion of protein mixtures introduce the complication of loss of connectivity between peptides and their protein precursors. Assignment of peptide sequences results in two outcomes; distinct peptides that map to only one protein sequence or shared peptides that map to more than one protein sequence. Detection of shared peptides introduces an uncertainty between the possibility that a shared peptide can be mapped to more than one protein sequence (bioinformatics redundancy) versus the possibility that more than one precursor is in the original protein mixture (physical redundancy). The apparent ambiguity in peptide assignment requires reporting of a protein group. When assembling peptides into proteins and protein groups, authors should adhere to principles of parsimony, i.e., describe the minimum set of protein sequences that adequately accounts for all observed peptides. While the identification of shared peptides implies that multiple related protein sequences are present, the initial assumption should be that only a single form is being detected. Authors should explain and be able to justify cases where a single protein from a protein group has been singled out or that more than one member of a protein group is present. When reporting a summary list of peptides belonging to each protein group, peptides shared among multiple proteins and those unique to a specific protein should be clearly indicated. In addition, sometimes proteins are identified from a different species than the one being studied. For example, identification of a mouse or human protein in a hamster study. If such an orthologous protein is included, the circumstances should be mentioned and justified.
 mzIdentML does not require that the "parsimony" technique is used. It does allow representation and complete description of protein groups using the ProteinAmbiguityGroup and ProteinDetectionHypothesis elements 
8It is strongly encouraged (but not yet required) that all MS/MS spectra mentioned in the paper be submitted as supplemental material. Journals will vary in their ability to handle this information and authors are encouraged to provide access to raw MS data using other means, including group websites and public repositories, as they emerge, in addition to the journal itself. Not applicable          

Tags: 

Semantic Validation

The PSI Validator generic framework

The PSI semantic validator tackle the issue of automatically checking that experimental data reported using a specific format and various semantic resources are indeed compliant with the MIAPE recommendations. The semantic validator not only check the XML syntax but it enforces many rules as to how controlled vocabulary terms classes are used, it verifies that the terms mentioned exist in its source CV (and it is not just a random string reported in the XML document), and more importantly that the correct terms are used in the correct location of a document. Moreover the semantic validator framework is extremely flexible and it can be adapted to any PSI workgroup standard just by customizing the three input files:

  1. a list of ontologies or CVs necessary to annotate exchanged data in a MIAPE compliant way

  2. a mapping file formalizing how the necessary CVs and an exchange format are interrelated ( see documentation)

  3. a list of object rules to be run by the validator.

The Java source code can be accessed here. The generic framework and dependencies can be downloaded from here.

A tutorial has been made available to guide users writing their own validator. 

 

Current implementations of the PSI validator in specific workgroups

 

PSI WorkgroupFormatStandardValidator Web ApplicationSource codeConfiguration files
Molecular InteractionMIF25MIMIxMIMIx validatorcontact

MI-mapping   MI-CvSourceList

 MIF25IMEx  IMEx validator  contactMI-mapping-Imex
 MIF25PAR  PAR validator  contactPAR-mapping
PAR-CvSourceList
Mass SpectrometrymzMLMIAPE-MS

MS validator

Java mzML validator (Java Web Start)

Java code available here

MS-mapping MS-CvSourceList MS-ObjectRules

miape-ms-rules miape-object-rules

Beta implementations  draft validatorscode in C++ 
Proteomics InformaticsmzIdentML 1.1.1MIAPE-MSI

Java mzIdentML validator

 

Java code here

mzid-mapping

miape-msi-rules

object-rules

miape-object-rules

Protein SeparationGelML (version 1.1 candidate)MIAPE-GEGelML validator Available hereGelML-mapping (draft)

 

 


 

 

Tags: 

Proteomics Informatics Workgroup

 



HUPO Proteomics Standards Initiative
Proteomics Informatics Working Group (PSI-PI)

Contents

  1. Group Charter
  2. Group Structure
  3. Obtaining the current Documents and Getting Involved
  4. mzIdentML
  5. mzQuantML
  6. Request new CV terms to the PSI-MS Controlled Vocabulary
  7. Meetings and Logistics
  8. Mailing List and Issue Discussion

 

 

The PSI Proteomics Informatics standards group is one of the working groups of the Proteomics Standards Initiative.

 

Proteomics Informatics Group Charter

Please see here.


 

Group Structure (2013)

Role
Current Encumbent
ChairAndy Jones
Co-chairMartin Eisenacher/Juan Antonio Vizcaino
MIAPE Co-ordinatorPierre-Alain Binz
Ontology Co-ordinatorGerhard Mayer
EditorGerhard Mayer
SecretaryJuan Antonio Vizcaino

 

 


 

Obtaining the Current Documents and Getting Involved

The deliverables (XML schema, example documents, specification document etc.) of the standards released by the PSI-PI working group can be found on different pages linked below.

All of the ongoing work, e.g. deliverables and issue list of the versions being under development are managed in GitHub (see URLs on pages linked above!).

GitHub makes use of Git to allow versions of documents to be managed. You can get read-only access all of the files in subversion anonymously, or if you are a member of the group you can check out the contents of the repository, commit changes to the files and add new ones. There are many different clients that you can use to access and write to the Git repository. If you are using Windows, the SourceTree client comes highly recommended.

 


mzIdentML

The main current deliverable of the Proteomics Informatics working group is the mzIdentML data exchange standard (previously known as analysisXML). Please see use cases for mzIdentML to get a flavour of its scope and purpose (mainly storing parameters and results of a spectrum identification search).

Version 1.0.0 of mzIdentML was formally released on 20th August 2009.
Work on an update to version 1.1.0 of mzIdentML was finished in August 2011.
A very minor update to version 1.1.1 was done in July 2015.

More information about mzIdentML is available HERE.

Current status of tools that write and import mzIdentML are on this page.

 


 

mzQuantML

The other current deliverable of the Proteomics Informatics working group is the mzQuantML data exchange standard. It is intended to store parameters and results of quantification workflows.

Version 1.0.0 of mzQuantML was released in Feb 2013. More information about mzQuantML is available HERE.

 


 

Request new CV terms to the PSI-MS Controlled Vocabulary

To request new CV terms to be added to the PSI-MS Controlled Vocabulary, please use the psidev-vocab mailing list.

 


 

Meetings and Logistics

On-going communication between all participants will be achieved via the mailing list psidev-pi-dev@lists.sourceforge.net.

Regularly weekly telephone conferences are organised to allow discussion of the progress and direction of the working group when appropriate. Details of the next meeting will be posted here as soon as they are available.

The group meets face to face at least once a year at the Spring PSI conference.

 


 


Mailing list and Issue Discussion

This is the discussion list for proteomics informatics (mzIdentML and mzQuantML) development.

You can subscribe to the psidev-pi-dev list.

You can view the existing posts to the psidev-pi-dev list.

To post a message to the list, send an email to psidev-pi-dev@lists.sourceforge.net

 


Tags: 

Pages

Subscribe to RSS - Proteomics Informatics