O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Proteomics public data resources: enabling "big data" analysis in proteomics

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio

Confira estes a seguir

1 de 62 Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Anúncio

Semelhante a Proteomics public data resources: enabling "big data" analysis in proteomics (20)

Mais de Juan Antonio Vizcaino (18)

Anúncio

Mais recentes (20)

Proteomics public data resources: enabling "big data" analysis in proteomics

  1. 1. Proteomics public data resources: enabling “big data” analysis in proteomics Dr. Juan Antonio Vizcaíno EMBL-European Bioinformatics Institute Hinxton, Cambridge, UK
  2. 2. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Overview • Intro: Concept of “Big data” in biology and proteomics • PRIDE Archive and ProteomeXchange • PRIDE tools • Reuse of public proteomics data • Working with “Big data”: PRIDE Cluster
  3. 3. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 “Big data”: definition Slide from: http://www.ibmbigdatahub.com/
  4. 4. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 “Big data” in biology The term has been applied so far mainly to genomics
  5. 5. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 One slide intro to MS based proteomics Hein et al., Handbook of Systems Biology, 2012
  6. 6. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Overview • Intro: Concept of “Big data” in biology and proteomics • PRIDE Archive and ProteomeXchange • PRIDE tools • Reuse of public proteomics data • Working with “Big data”: PRIDE Cluster
  7. 7. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Data resources at EMBL-EBI Genes, genomes & variation ArrayExpress Expression Atlas PRIDE InterPro Pfam UniProt ChEMBL ChEBI Molecular structures Protein Data Bank in Europe Electron Microscopy Data Bank European Nucleotide Archive European Variation Archive European Genome-phenome Archive Gene & protein expression Protein sequences, families & motifs Chemical biology Reactions, interactions & pathways IntAct Reactome MetaboLights Systems BioModels Enzyme Portal BioSamples Ensembl Ensembl Genomes GWAS Catalog Metagenomics portal Europe PubMed Central Gene Ontology Experimental Factor Ontology Literature & ontologies
  8. 8. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 What is a proteomics publication in 2016? • Proteomics studies generate potentially large amounts of data and results. • Ideally, a proteomics publication needs to: • Summarize the results of the study • Provide supporting information for reliability of any results reported • Information in a publication: • Manuscript • Supplementary material • Associated data submitted to a public repository
  9. 9. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 • PRIDE stores mass spectrometry (MS)-based proteomics data: • Peptide and protein expression data (identification and quantification) • Post-translational modifications • Mass spectra (raw data and peak lists) • Technical and biological metadata • Any other related information • Full support for tandem MS approaches PRIDE (PRoteomics IDEntifications) Archive http://www.ebi.ac.uk/pride/archive Martens et al., Proteomics, 2005 Vizcaíno et al., NAR, 2016
  10. 10. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 ProteomeXchange: A Global, distributed proteomics database PASSEL (SRM data) PRIDE (MS/MS data) MassIVE (MS/MS data) Raw ID/Q Meta jPOST (MS/MS data) Mandatory raw data deposition since July 2015 • Goal: Development of a framework to allow standard data submission and dissemination pipelines between the main existing proteomics repositories. http://www.proteomexchange.org New in 2016 Vizcaíno et al., Nat Biotechnol, 2014 Deustch et al., NAR, 2017, in press
  11. 11. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 ProteomeCentral Metadata / Manuscript Raw Data Results Journals Peptide Atlas Receiving repositories PRIDE Researcher’s results Raw data Metadata PASSEL Research groups Reanalysis of datasets MassIVE jPOST MS/MS data (as complete submissions) Any other workflow (mainly partial submissions) DATASETS SRM data Reprocessed results MassIVE ProteomeXchange data workflow Vizcaíno et al., Nat Biotechnol, 2014 Deustch et al., NAR, 2017, in press
  12. 12. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 ProteomeCentral: Centralised portal for all PX datasets http://proteomecentral.proteomexchange.org/cgi/GetDataset
  13. 13. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 ProteomeCentral Metadata / Manuscript Raw Data Results Journals Peptide Atlas Receiving repositories PRIDE Researcher’s results Raw data Metadata PASSEL Research groups Reanalysis of datasets MassIVE jPOST MS/MS data (as complete submissions) Any other workflow (mainly partial submissions) DATASETS SRM data Reprocessed results MassIVE ProteomeXchange data workflow Vizcaíno et al., Nat Biotechnol, 2014 Deustch et al., NAR, 2017, in press
  14. 14. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 ProteomeCentral Metadata / Manuscript Raw Data Results Journals UniProt/ neXtProtPeptide Atlas Other DBs Receiving repositories PRIDE GPMDBResearcher’s results Raw data Metadata PASSEL proteomicsDB Research groups Reanalysis of datasets MassIVE jPOST MS/MS data (as complete submissions) Any other workflow (mainly partial submissions) DATASETS OmicsDI Integration with other omics datasets SRM data Reprocessed results MassIVE ProteomeXchange data workflow Vizcaíno et al., Nat Biotechnol, 2014 Deustch et al., NAR, 2017, in press
  15. 15. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 PRIDE: Source of MS proteomics data • PRIDE Archive already provides or will soon provide MS proteomics data to other EMBL-EBI resources such as UniProt, Ensembl and the EBI Expression Atlas. http://www.ebi.ac.uk/pride/archive
  16. 16. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 PRIDE Archive – over 4,500 datasets from over 51 countries and 1,700 groups • USA – 814 datasets • Germany – 528 • UK – 338 • China – 328 • France – 222 • Netherlands – 175 • Canada - 137 Data volume: • Total: ~275 TB • Number of all files: ~560,000 • PXD000320-324: ~ 4 TB • PXD002319-26 ~2.4 TB • PXD001471 ~1.6 TB • 1,973 datasets i.e. 52% of all are publicly accessible • ~90% of all ProteomeXchange datasets YearSubmissions All submissions Complete PRIDE Archive growth In the last 12 months: ~165 submitted datasets per month Top Species studied by at least 100 datasets: 2,010 Homo sapiens 604 Mus musculus 191 Saccharomyces cerevisiae 140 Arabidopsis thaliana 127 Rattus norvegicus >900 reported taxa in total
  17. 17. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Overview • Intro: Concept of “Big data” in biology and proteomics • PRIDE Archive and ProteomeXchange • PRIDE tools • Reuse of public proteomics data • Working with “Big data”: PRIDE Cluster
  18. 18. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 PRIDE Components: Data Submission Process PRIDE Converter 2 PRIDE Inspector PX Submission Tool mzIdentML PRIDE XML In addition to PRIDE Archive, the PRIDE team develops and maintains different tools and software libraries to facilitate the handling and visualisation of MS proteomics data and the submission process
  19. 19. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Current PSI Standard File Formats for MS • mzMLMS data • mzIdentMLIdentification • mzQuantMLQuantitation • mzTabFinal Results • TraMLSRM
  20. 20. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 PRIDE Inspector Toolsuite Wang et al., Nat. Biotechnology, 2012 Perez-Riverol et al., Bioinformatics, 2015 Perez-Riverol et al., MCP, 2016 • PRIDE Inspector - standalone tool to enable visualisation and validation of MS data. • Build on top of ms-data-core-api - open source algorithms and libraries for computational proteomics. • Supported file formats: mzIdentML, mzML, mzTab (PSI standards), and PRIDE XML. • Broad functionality. https://github.com/PRIDE-Utilities/ms-data-core-api https://github.com/PRIDE-Toolsuite/pride-inspector Summary and QC charts Peptide spectra annotation and visualization
  21. 21. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 PX Submission Tool  Desktop application for data submissions to ProteomeXchange via PRIDE • Implemented in Java 7 • Streamlines the submission process • Capture mappings between files • Retain metadata • Fast file transfer with Aspera (FASP® transfer technology) – FTP also available • Command line option Submission tool screenshot
  22. 22. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Overview • Intro: Concept of “Big data” in biology and proteomics • PRIDE Archive and ProteomeXchange • PRIDE tools • Reuse of public proteomics data • Working with “Big data”: PRIDE Cluster
  23. 23. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Datasets are being reused more and more…. Vaudel et al., Proteomics, 2016 Data download volume for PRIDE Archive in 2015: 198 TB 0 50 100 150 200 250 2013 2014 2015 2016 Downloads in TBs
  24. 24. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Data sharing in Proteomics Vaudel et al., Proteomics, 2016
  25. 25. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Draft Human proteome papers published in 2014 Wilhelm et al., Nature, 2014 Kim et al., Nature, 2014 •Two independent groups claimed to have produced the first complete draft of the human proteome by MS. • Some of their findings are controversial and need further validation… but generated a lot of discussion and put proteomics in the spotlight. •They used many different tissues. Nature cover 29 May 2014
  26. 26. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Draft Human proteome papers published in 2014 Wilhelm et al., Nature, 2014 •Around 60% of the data used for the analysis comes from previous experiments, most of them stored in proteomics repositories such as PRIDE/ProteomeXchange, PASSEL or MassIVE. •They complement that data with “exotic” tissues.
  27. 27. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Data sharing in Proteomics Vaudel et al., Proteomics, 2016
  28. 28. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Examples of repurposing in proteogenomics
  29. 29. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Data sharing in Proteomics
  30. 30. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Challenges for data reuse in proteomics • Insufficient technical and biological metadata. • Large computational infrastructure maybe needed (e.g. when analysing many datasets together). • Shortage of expertise (people). • Lack of standardisation in the field.
  31. 31. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Summary of the talk so far • PRIDE Archive and other ProteomeXchange resources make possible data sharing in the MS proteomics field. • Data sharing is becoming the norm in the field. • Standalone tools: PRIDE Inspector and PX Submission tool. • Datasets are increasingly reused (many opportunities): • Example of one of the drafts of the human proteome. • Proteogenomics approaches. • But there are important challenges as well.
  32. 32. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Overview • Intro: Concept of “Big data” in biology and proteomics • PRIDE Archive and ProteomeXchange • PRIDE tools • Reuse of public proteomics data • Working with Big data: PRIDE Cluster
  33. 33. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Data sharing in Proteomics Vaudel et al., Proteomics, 2016
  34. 34. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 PRIDE Cluster: Initial Motivation • Provide a QC-filtered peptide-centric view of PRIDE. • Data is stored in PRIDE Archive as originally analysed by the submitters (no data reprocessing is done). • Heterogeneous quality, difficult to make the data comparable. • Enable assessment of (published) proteomics data.
  35. 35. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 PRIDE Cluster • Provide an aggregated peptide centric view of PRIDE Archive. • Hypothesis: same peptide will generate similar MS/MS spectra across experiments. • Enables QC of peptide-spectrum matches (PSMs). Infer reliable identifications by comparing submitted identifications of spectra within a cluster.  After clustering, a representative spectrum is built for all peptides consistently identified across different datasets. Griss et al., Nat. Methods, 2013 Griss et al., Nat. Methods, 2016
  36. 36. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 PRIDE Cluster - Concept NMMAACDPR NMMAACDPR PPECPDFDPPR NMMAACDPR NMMAACDPR NMMAACDPR Consensus spectrum PPECPDFDPPR Threshold: At least 3 spectra in a cluster and ratio >70%. Originally submitted identified spectra Spectrum clustering
  37. 37. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 PRIDE Cluster - Concept
  38. 38. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 PRIDE Cluster: Implementation • Griss et al., Nat. Methods, 2013 • Clustered all public, identified spectra in PRIDE • EBI compute farm, LSF • 20.7 M identified spectra • 610 CPU days, two calendar weeks • Validation, calibration • Feedback into PRIDE datasets • EBI farm, LSF
  39. 39. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 PRIDE Cluster Iteration 2: Why? • PRIDE Archive has experienced a huge increase in data since 2013. • We wanted to develop an algorithm that could also work with unidentified spectra. Year Submissions All submissions Complete PRIDE Archive growth
  40. 40. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Parallelizing Spectrum Clustering: Hadoop • Optimizes work distribution among machines. • Hadoop is a (open source) Framework for parallelism using the Map-Reduce algorithm by Google. • Solves many general issues of large parallel jobs: • Scheduling • inter-job communication • failure https://hadoop.apache.org/
  41. 41. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 PRIDE Cluster: Second Implementation • Griss et al., Nat. Methods, 2013 • Clustered all public, identified spectra in PRIDE • EBI compute farm, LSF • 20.7 M identified spectra • 610 CPU days, two calendar weeks • Validation, calibration • Feedback into PRIDE datasets • EBI farm, LSF • Griss et al., Nat. Methods, 2016 • Clustered all public spectra in PRIDE by April 2015 • Apache Hadoop. • Starting with 256 M spectra. • 190 M unidentified spectra (they were filtered to 111 M for spectra that are likely to represent a peptide). • 66 M identified spectra • Result: 28 M clusters • 5 calendar days on 30 node Hadoop cluster, 340 CPU cores
  42. 42. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Examples: one perfect cluster - 880 PSMs give the same peptide ID - 4 species - 28 datasets - Same instruments
  43. 43. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Examples: one perfect cluster (2)
  44. 44. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 PRIDE Cluster Sequence-based search engines Spectrum clustering Incorrectly or unidentified spectra
  45. 45. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Output of the analysis • 1. Inconsistent spectrum clusters • 2. Clusters including identified and unidentified spectra. • 3. Clusters just containing unidentified spectra.
  46. 46. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 1. Re-analysis of inconsistent clusters NMMAACDPR NMMAACDPR IGGIGTVPVGR NMMAACDPR PPECPDFDPPR VFDEFKPLVEEPQNLIK NMMAACDPR IGGIGTVPVGR No sequence has a proportion in the cluster >50% Consensus spectrum PPECPDFDPPR VFDEFKPLVEEP QNLIK Originally submitted identified spectra Spectrum clustering
  47. 47. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 1. Re-analysis of inconsistent clusters • Re-analysed 3,997 large (>100 spectra), inconsistent clusters with PepNovo, SpectraST, X!Tandem. • 453 clusters (11%) were identified as peptides originated from keratins, trypsin, albumin, and hemoglobin. • In this case, it is likely that a contaminants DB was not used in the search.
  48. 48. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Validation
  49. 49. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016
  50. 50. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016
  51. 51. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016
  52. 52. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 2. Inferring identifications for originally unidentified spectra 52 • 9.1 M unidentified spectra were contained in clusters with a reliable identification. • These are candidate new identifications (that need to be confirmed), often missed due to search engine settings • Example: 49,263 reliable clusters (containing 560,000 identified and 130,000 unidentified spectra) contained phosphorylated peptides, many of them from non-enriched studies.
  53. 53. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 3. Consistently unidentified clusters • 19 M clusters contain only unidentified spectra. • 41,155 of these spectra have more than 100 spectra (= 12 M spectra). • Most of them are likely to be derived from peptides. • They could correspond to PTMs or variant peptides. • With various methods, we found likely identifications for about 20%. • Vast amount of data mining remains to be done.
  54. 54. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 3. Consistently unidentified clusters
  55. 55. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 PRIDE Cluster as a Public Data Mining Resource 55 • http://www.ebi.ac.uk/pride/cluster • Spectral libraries for 16 species. • All clustering results, as well as specific subsets of interest available. • Source code (open source) and Java API
  56. 56. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Public datasets from different omics: OmicsDI http://www.ebi.ac.uk/Tools/omicsdi/ • Aims to integrate of ‘omics’ datasets (proteomics, transcriptomics, metabolomics and genomics at present). PRIDE MassIVE jPOST PASSEL GPMDB ArrayExpress Expression Atlas MetaboLights Metabolomics Workbench GNPS EGA Perez-Riverol et al., 2016, BioRXxiv
  57. 57. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 OmicsDI: Portal for omics datasets
  58. 58. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 OmicsDI: Portal for omics datasets
  59. 59. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 OmicsDI: Portal for omics datasets
  60. 60. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Summary part 2 • Using a “big data” approach we were able to get extra knowledge from all the public data in PRIDE Archive. • Spectrum clustering enables QC in proteomics resources such as PRIDE Archive. • It is possible to detect spectra that are consistently unidentified across hundreds of datasets (maybe peptide variants, or peptides with PTMs not initially considered). • OmicsDI: new platform to identify public datasets coming from different omics technologies (more possibilities for data reuse!)
  61. 61. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Aknowledgements: People Attila Csordas Tobias Ternent Gerhard Mayer (de.NBI) Johannes Griss Yasset Perez-Riverol Manuel Bernal-Llinares Andrew Jarnuczak Enrique Perez Former team members, especially Rui Wang, Florian Reisinger, Noemi del Toro, Jose A. Dianes & Henning Hermjakob Acknowledgements: The PRIDE Team All data submitters !!! @pride_ebi @proteomexchange
  62. 62. Juan A. Vizcaíno juan@ebi.ac.uk International de.NBI Symposium Heidelberg, 9 November 2016 Questions? http://www.slideshare.net/JuanAntonioVizcaino

×