The ProteomeXchange consortium allows researchers to easily deposit and retrieve proteomics data. It includes repositories like PRIDE, PeptideAtlas, and recently MassIVE. The goal is to standardize submission and access across repositories through common identifiers and supported workflows. Over 1,300 datasets have been submitted, with many tools now supporting standard formats like mzIdentML for complete submissions. The most accessed datasets include large reference maps of the human proteome. Open source tools are improving submission and analysis of ProteomeXchange data.
Module for Grade 9 for Asynchronous/Distance learning
ProteomeXchange: data deposition and data retrieval made easy
1. ProteomeXchange: data
deposition and data retrieval made
easy
Juan Antonio VIZCAINO, Ph.D.
PRIDE Group coordinator
Proteomics Services Group
European Bioinformatics Institute
Hinxton, Cambridge
United Kingdom
juan@ebi.ac.uk
2. Overview
• The ProteomeXchange (PX) consortium
• Highlights in the last year
• PRIME-XS datasets
3. ProteomeXchange Consortium
• Goal: Development of a framework to allow
standard data submission and dissemination
pipelines between the main existing proteomics
repositories.
• Includes PeptideAtlas (ISB, Seattle), PRIDE
(Cambridge, UK) and (very recently) MassIVE
(UCSD, San Diego).
• Common identifier space (PXD identifiers)
• Two supported data workflows: MS/MS and SRM.
• Main objective: Make life easier for researchers
http://www.proteomexchange.org
4. ProteomeXchange data workflow
ProteomeCentral
Results
Raw Data*
Metadata /
Manuscript
PRIDE
(MS/MS data)
Journals
UniProt/
neXtProt
Peptide Atlas
Other DBs
Receiving repositories
PASSEL
(SRM data)
Other DBs
GPMDB
Researcher’s results
Reprocessed results
Raw data*
Metadata
MassIVE
(MS/MS data)
Vizcaíno et al., Nat Biotechnol, 2014
6. Overview
• The ProteomeXchange (PX) consortium
• Highlights in the last year
• PRIME-XS datasets
7. PX Data workflow for MS/MS data
1. Mass spectrometer output files: raw data (binary files) or
peak list spectra in a standardized format (mzML, mzXML).
2. Result files:
a. Complete submissions: Result files can be converted to
PRIDE XML or the mzIdentML data standard.
b. Partial submissions: For workflows not yet supported by
PRIDE, search engine output files will be stored and
provided in their original form.
3. Metadata: Sufficiently detailed description of sample origin,
workflow, instrumentation, submitter.
4. Other files: Optional files:
a. QUANT: Quantification related results e. FASTA
b. PEAK: Peak list files f. SP_LIBRARY
c. GEL: Gel images
d. OTHER: Any other file type
Published
Raw
Files
Other
files
8. Complete vs Partial submissions:
processed results
For complete submissions, it is possible to connect the spectra with the identification
processed results and they can be visualized.
PRIDE XML, mzIdentML supported
mzTab to come
Complete Partial
9. Complete vs Partial submissions:
experimental metadata
Complete Partial
General experimental metadata about the projects is similar.
However, at the assay level information, in partial submissions is less annotated
10. Complete submissions using
mzIdentML
Search Engine
Results + MS
files
Search
engines
mzIdentML
An increasing number of tools support export to mzIdentML 1.1
- Mascot
- MSGF+
- Myrimatch and related tools from D. Tabb’s lab
- OpenMS
- PEAKS
- ProCon (ProteomeDiscoverer, Sequest)
- Scaffold
- TPP via the idConvert tool (ProteoWizard)
- ProteinPilot (planned by the end of 2014)
- Others: library for X!Tandem conversion, lab
internal pipelines, …
- Referenced spectral files need to be submitted as well
(all open formats are supported).
Updated list: http://www.psidev.info/tools-implementing-mzIdentML#.
11. Tools ‘RESULT’ file generation Final ‘RESULT’ file
mzIdentML
‘RESULT’
Now: native file export
Spectra
files
Mascot
ProteinPilo
t
Scaffold
PEAKS
MSGF+
Others
Native File export
12. Original data files ‘RESULT’ file generation Final ‘RESULT’ file
Search
output
files
Spectra
files
PRIDE
XML
‘RESULT’
Before: file conversion using PRIDE
Converter
File conversion
PRIDE
Converter
13. PRIDE Inspector 2
Wang et al., Nat. Biotechnology, 2012
PRIDE Inspector 2.0
PRIDE Inspector 2.0 supports:
- PRIDE XML
- mzIdentML + all types of spectra files
- mzML
- mzTab (work in progress)
http://code.google.com/p/pride-toolsuite/
wiki/PRIDEInspector
14. PX submission tool: data submission
Published
Raw
Other
files
http://www.proteomexchange.org/submission
PX
submission
tool
• Capture the mappings between the different types of files.
• Add the mandatory metadata annotation.
• Make the file upload process straightforward to the submitter (It transfers all the
files using Aspera or FTP).
• Command line alternative: some scripting is needed.
15. Uploading large datasets: Aspera
- Aspera is the default file transfer protocol to PRIDE:
- PX Submission tool
- Command line
- Up to 50X faster than FTP
File transfer speed should
not be a problem!!
16. Tutorial manuscript detailing
the process
Example dataset:
PXD000764
- Title: “Discovery of new CSF biomarkers for meningitis in children”
- 12 runs: 4 controls and 8 infected samples
- Identification and quantification data
http://www.proteomexchange.org/submission Ternent et al., Proteomics, 2014
17. ProteomeXchange: 1329 datasets up until October 2014
Origin:
271 USA
166 Germany
115 United Kingdom
73 Switzerland
70 China
68 Netherlands
67 France
55 Canada
44 Spain
42 Belgium
33 Sweden
31 Australia
31 Denmark
31 Japan
20 India
20 Norway
19 Taiwan
17 Ireland
16 Austria
14 Finland
14 Italy
12 Republic of Korea
11 Brazil
9 Russia
8 Israel
7 Singapore …
Type:
437 PRIDE complete
792 PRIDE partial
63 PeptideAtlas/PASSEL complete
14 MassIVE
23 reprocessed
Publicly Accessible:
691 datasets, 52% of all
86% PRIDE
12% PASSEL
2% MassIVE
Top Species studied by at least 10
datasets:
577 Homo sapiens
165 Mus musculus
56 Saccharomyces cerevisiae
53 Arabidopsis thaliana
29 Rattus norvegicus
22 Escherichia coli
17 Bos taurus
16 Mycobacterium tuberculosis
13 Oryza sativa
13 Drosophila melanogaster
13 Glycine max
~ 290 species in total
Data volume:
Total: ~55 TB
Number of all files: ~131,000
PXD000320-324: ~ 5 TB
PXD000065: ~ 1.4TB
Datasets/year:
2012: 102
2013: 527
2014: 700
18. Overview
• The ProteomeXchange (PX) consortium
• Highlights in the last year
• PRIME-XS datasets
19. PX submission tool: PRIME-XS tags
37 Datasets in total (both public and
private at present):
- 20 from the Netherlands
- 4 from UK
- 2 from Austria, Belgium, Denmark,
Spain and Switzerland
- 1 from France and USA.
20. PRIME-XS are now tagged in PRIDE
PRIME-XS datasets are now tagged and can be browsed as a group
http://www.ebi.ac.uk/pride/archive/simpleSearch?q=prime-xs
22. Which are the most accessed
datasets?
PXD Identifier Total Hits Dataset title Publication
PXD000561 153512 A draft map of the human proteome
Kim et al., Nature,2014.
PMID: 24870542
PXD000851 111587
Membrane proteomic analysis of
colorectal cancer tissue
Kume et al., MCP, 2014.
PMID:24687888
PXD000865 51639
Mass spectrometry based draft of the
human proteome
Wilhelm et al., 2014,
Nature, PMID:24870543
24. Find the desired PRIDE project …
… inspect the project details ….
Reshake PRIDE data in
PeptideShaker
… and start re-analyzing the data!
http://peptide-shaker.googlecode.com
Vaudel M, Burkhart J, Zahedi RP, Berven FS, Sickmann A, Martens L,
Barsnes H. Nature Biotechnology (in press)
25. A little bit of perspective
Berlin 2011 Mallorca 2012
Annecy 2013 Split 2013
26. A little bit of perspective
2011 2012 2013 2014
PRIDE Inspector PX Submission Tool
mzIdentML mzQuantML
PRIDE/PX datasets
qcML
mzTab
PRIDE web (2011)
PRIDE Converter
PRIDE Converter 2
PRIDE Inspector 2
PRIDE web (2014)
27. Conclusions
• ProteomeXchange is widely used.
– PRIDE contains most of the MS/MS datasets.
– It has now a new consortium member:
MassIVE (UCSD).
– Around half of the datasets are already public.
• Different open source tools available to
facilitate the process:
– File transfer speed should not be a problem
(Aspera support)
28. Aknowledgements: People
Attila Csordas
Tobias Ternent
Noemi del Toro
Rui Wang
Florian Reisinger
Jose A. Dianes
Johannes Griss
Steven Lewis
Yasset Perez-Riverol
Henning Hermjakob
All previous team members
ProteomeXchange partners