SlideShare uma empresa Scribd logo
1 de 26
Baixar para ler offline
Developing open data analysis pipelines
in the cloud: Enabling the ‘big data’
revolution in proteomics
Dr. Juan Antonio Vizcaíno
EMBL-European Bioinformatics Institute (EMBL-EBI)
Hinxton, Cambridge, UK
E-mail: juan@ebi.ac.uk
Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
Overview
• Short introduction to PRIDE and ProteomeXchange
• Enabling public data re-use
• Open analysis pipelines for proteomics data
Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
•PRIDE stores mass spectrometry (MS)-
based proteomics data:
•Peptide and protein expression data
(identification and quantification)
•Post-translational modifications
•Mass spectra (raw data and peak lists)
•Technical and biological metadata
•Any other related information
•Full support for tandem MS approaches
•Any type of data can be stored
•From July 2017, an ELIXIR core resource
PRIDE (PRoteomics IDEntifications) database
http://www.ebi.ac.uk/pride/archive
Martens et al., Proteomics, 2005
Vizcaíno et al., NAR, 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
Raw
data
ID&
Quant
Meta
data
ProteomeXchange: A global, distributed proteomics database
Vizcaíno et al., Nat Biotechnol, 2014
Deutsch et al., NAR, 2017
http://www.proteomexchange.org
Standard data submission and data dissemination practises between the
main proteomics repositories
New
member
Mandatory data deposition
Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
PRIDE data submissions and data growth
May 2018 (320 datasets) was again a record month in terms of datasets submitted
Datasets submitted per month
PRIDE contains >85% of all ProteomeXchange datasets
Dataset PXD010000 reached on June 1st
Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
Stats: Data growth in EMBL-EBI resources
Sequence data
Micro-array
Metabolomics
Proteomics
Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
We are already in a “Big data” age for proteomics
”Big data” is a term used to describe data collections where processing becomes
problematic due to any combination of their size (Volume), frequency of update
(Velocity), diversity (Variety), or level or inaccuracy (Veracity).
Slide from: http://www.ibmbigdatahub.com/
Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
Overview
• Short introduction to PRIDE and ProteomeXchange
• Enabling public data re-use
• Open analysis pipelines for proteomics data
Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
Public Data Reuse in Proteomics
Vaudel et al., Proteomics, 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
Data re-use in proteomics keeps increasing
Data download volume for PRIDE in 2017: 295 TB
0
50
100
150
200
250
300
350
2013 2014 2015 2016 2017
Downloads in TBs
Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
How can public data be re-used?
Martens & Vizcaíno,
Trends Bioch Sci, 2017
Vaudel et al., Proteomics, 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
Overview
• Short introduction to PRIDE and ProteomeXchange
• Enabling public data re-use
• Open analysis pipelines for proteomics data
Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
Motivation
• Make analysis scalable
• Facilitate re-use of public data (e.g. PRIDE datasets)
• Also applicable to datasets produced by individual labs, since they
are becoming larger and larger
• Make analysis reproducible
• Make pipelines openly available to everyone in the community
Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
Infrastructure as a Service (IaaS):
• Compute power is remote
• Still from compute centres, but on a larger scale
• Large local IT infrastructure is then not required
• Provides scalability
• Will enable wide-availability of the pipelines
Moving pipelines to a cloud environment
Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
Cloud computing is increasingly used in genomics
Simplified workflow launcher:
• For AWS
• Enabling co-analysis with the larger PanCancer dataset
http://icgc.org/working-pancancer-data-aws
Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
Supporting Reproducible Science
http://www.nature.com/nature/focus/reproducibility/
Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
Making data analysis pipelines reproducible
• That means using:
• Exactly the same software (including the same version), in the
same order.
• The same protein sequence database (including the same
version).
• If we use the same files as input to the software, we will get
EXACTLY the same results.
• If that’s not the case, something has gone wrong.
• Computers are much more reliable than people.
Picture from: http://www.drugdesigntech.com/how-maximize-the-reproducibility-of-your-research/
Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
Develop exemplary proteomics data analysis workflows and deploy
them in the EMBL-EBI "Embassy Cloud”:
(1) Standard identification workflow
(2) Identification workflow for PTMs
(3) Quantification (label-free/label-based approaches)
(4) Quality Control (to aid data set interpretation/reanalysis evaluation)
(5) Versions of quantification approaches (including PTMs)
è Connected to public proteomics data from
Starting with DDA data
Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
We are using the framework
Features:
• Tool modularisation
• Solutions for data handover between tools with standardised
(PSI) formats
• Adapters for integrating third-party software (search engines,
LuciPHOr, FIDO, percolator, etc.)
• Integration into various workflow systems as a basis
Analysis software used so far (more to come)
Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
Infrastructure Open source infrastructure:
Galaxy, Docker, Kubernetes, R, OpenStack, etc
Open source analysis software: OpenMS
as a starting point
Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
Simple interface planned for parameter selection
Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
• The workflows have been deployed into the EMBL-EBI
“Embassy Cloud” Portal and fitted with a dashboard as a proof
of concept
• PRIDE data connection into the cloud is being optimised (not
URL available yet)
• Only a first step. Lots of work to do: benchmarking, interface for
users, technical optimisations, etc.
Current status of the DDA pipelines
Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
Open analysis pipelines: DIA approaches
• We have started working in the development of open data analysis
pipelines for the analysis of DIA experiments (the main focus is put in
SWATH-MS experiments).
• In collaboration with the Stoller Center (Manchester) (B. Graham, T.
Hubbard & P. Townsend).
• Initial pipeline is based on OpenSWATH, developed in the Stoller.
• We are currently migrating it to the EBI “Embassy Cloud” (same
infrastructure developed for the DDA pipelines).
Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
Open analysis pipelines: proteogenomics
• Work in collaboration with J. Choudhary (Institute of Cancer Research,
London). Project just started on June 2018.
• Two types of pipelines to be developed:
• To improve genome annotation.
• To enable personalised medicine approaches.
• Pipelines based on the OpenMS framework.
Weisser et al., JPR, 2016
Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
Summary
• Public proteomics datasets are on the rise! Reliable (widely used)
infrastructure now exists: PRIDE and ProteomeXchange
• A lot of possibilities open for reuse of this data
• Development of data analysis pipelines in the cloud:
• Enable public data reuse, pipelines connected to PRIDE
• Scalability
• Reproducibility
• Make them available to everyone in the community
Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
Aknowledgements: People
Yasset Perez-Riverol
Johannes Griss
Suresh Hewapathirana
Tobias Ternent
Jingwen Bai
Attila Csordas
Deepti Jaiswal
Andrew Jarnuczak
Mathias Walzer
Gerhard Mayer (de.NBI)
Acknowledgements
All data submitters !!!
@pride_ebi
@proteomexchange
Bobby Graham, Simon Hubbard, Paul Brack and others (Stoller Centre)
Jyoti Choudhary & James Wright (ICR)

Mais conteúdo relacionado

Mais procurados

re3data.org – Registry of Research Data Repositories
re3data.org – Registry of Research Data Repositoriesre3data.org – Registry of Research Data Repositories
re3data.org – Registry of Research Data RepositoriesHeinz Pampel
 
Public proteomics data: a (mostly unexploited) gold mine for computational re...
Public proteomics data: a (mostly unexploited) gold mine for computational re...Public proteomics data: a (mostly unexploited) gold mine for computational re...
Public proteomics data: a (mostly unexploited) gold mine for computational re...Juan Antonio Vizcaino
 
Role of Data Accessibility During Pandemic
Role of Data Accessibility During PandemicRole of Data Accessibility During Pandemic
Role of Data Accessibility During PandemicDatabricks
 
20200130_Mannocci_OpenAIRE_ResearchGraph
20200130_Mannocci_OpenAIRE_ResearchGraph20200130_Mannocci_OpenAIRE_ResearchGraph
20200130_Mannocci_OpenAIRE_ResearchGraphOpenAIRE
 
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...dkNET
 
A Justification-based Semantic Framework for Representing, Evaluating and Uti...
A Justification-based Semantic Framework for Representing, Evaluating and Uti...A Justification-based Semantic Framework for Representing, Evaluating and Uti...
A Justification-based Semantic Framework for Representing, Evaluating and Uti...Kerstin Forsberg
 
Lankade data Vinnova webbinarium
Lankade data Vinnova webbinarium Lankade data Vinnova webbinarium
Lankade data Vinnova webbinarium Kerstin Forsberg
 
Linked Data efforts for data standards in biopharma and healthcare
Linked Data efforts for data standards in biopharma and healthcareLinked Data efforts for data standards in biopharma and healthcare
Linked Data efforts for data standards in biopharma and healthcareKerstin Forsberg
 
FedCentric_Presentation
FedCentric_PresentationFedCentric_Presentation
FedCentric_PresentationYatpang Cheung
 
PubChem for chemical information literacy training
PubChem for chemical information literacy trainingPubChem for chemical information literacy training
PubChem for chemical information literacy trainingSunghwan Kim
 
re3data - Registry of Research Data Repositories
re3data -  Registry of Research Data Repositoriesre3data -  Registry of Research Data Repositories
re3data - Registry of Research Data RepositoriesHeinz Pampel
 
BioSHaRE: Opal and Mica: a software suite for data harmonization and federati...
BioSHaRE: Opal and Mica: a software suite for data harmonization and federati...BioSHaRE: Opal and Mica: a software suite for data harmonization and federati...
BioSHaRE: Opal and Mica: a software suite for data harmonization and federati...Lisette Giepmans
 
Pushing back, standards and standard organizations in a Semantic Web enabled ...
Pushing back, standards and standard organizations in a Semantic Web enabled ...Pushing back, standards and standard organizations in a Semantic Web enabled ...
Pushing back, standards and standard organizations in a Semantic Web enabled ...Kerstin Forsberg
 
How to run and maintain a popular biological data repository?
How to run and maintain a popular biological data repository?How to run and maintain a popular biological data repository?
How to run and maintain a popular biological data repository?Juan Antonio Vizcaino
 
Implications of the Fourth Paradigm
Implications of the Fourth ParadigmImplications of the Fourth Paradigm
Implications of the Fourth ParadigmPhilip Bourne
 
The Fourth Paradigm - Deltares Data Science Day, 31 October 2014
The Fourth Paradigm - Deltares Data Science Day, 31 October 2014The Fourth Paradigm - Deltares Data Science Day, 31 October 2014
The Fourth Paradigm - Deltares Data Science Day, 31 October 2014Microsoft Azure for Research
 

Mais procurados (20)

re3data.org – Registry of Research Data Repositories
re3data.org – Registry of Research Data Repositoriesre3data.org – Registry of Research Data Repositories
re3data.org – Registry of Research Data Repositories
 
Public proteomics data: a (mostly unexploited) gold mine for computational re...
Public proteomics data: a (mostly unexploited) gold mine for computational re...Public proteomics data: a (mostly unexploited) gold mine for computational re...
Public proteomics data: a (mostly unexploited) gold mine for computational re...
 
Role of Data Accessibility During Pandemic
Role of Data Accessibility During PandemicRole of Data Accessibility During Pandemic
Role of Data Accessibility During Pandemic
 
Scholze liber 2015-06-25_final
Scholze liber 2015-06-25_finalScholze liber 2015-06-25_final
Scholze liber 2015-06-25_final
 
20200130_Mannocci_OpenAIRE_ResearchGraph
20200130_Mannocci_OpenAIRE_ResearchGraph20200130_Mannocci_OpenAIRE_ResearchGraph
20200130_Mannocci_OpenAIRE_ResearchGraph
 
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
 
A Justification-based Semantic Framework for Representing, Evaluating and Uti...
A Justification-based Semantic Framework for Representing, Evaluating and Uti...A Justification-based Semantic Framework for Representing, Evaluating and Uti...
A Justification-based Semantic Framework for Representing, Evaluating and Uti...
 
Lankade data Vinnova webbinarium
Lankade data Vinnova webbinarium Lankade data Vinnova webbinarium
Lankade data Vinnova webbinarium
 
Linked Data efforts for data standards in biopharma and healthcare
Linked Data efforts for data standards in biopharma and healthcareLinked Data efforts for data standards in biopharma and healthcare
Linked Data efforts for data standards in biopharma and healthcare
 
FedCentric_Presentation
FedCentric_PresentationFedCentric_Presentation
FedCentric_Presentation
 
PubChem for chemical information literacy training
PubChem for chemical information literacy trainingPubChem for chemical information literacy training
PubChem for chemical information literacy training
 
Adding complex expert knowledge into chemical database and transforming surfa...
Adding complex expert knowledge into chemical database and transforming surfa...Adding complex expert knowledge into chemical database and transforming surfa...
Adding complex expert knowledge into chemical database and transforming surfa...
 
Structure identification by Mass Spectrometry Non-Targeted Analysis using the...
Structure identification by Mass Spectrometry Non-Targeted Analysis using the...Structure identification by Mass Spectrometry Non-Targeted Analysis using the...
Structure identification by Mass Spectrometry Non-Targeted Analysis using the...
 
re3data - Registry of Research Data Repositories
re3data -  Registry of Research Data Repositoriesre3data -  Registry of Research Data Repositories
re3data - Registry of Research Data Repositories
 
BioSHaRE: Opal and Mica: a software suite for data harmonization and federati...
BioSHaRE: Opal and Mica: a software suite for data harmonization and federati...BioSHaRE: Opal and Mica: a software suite for data harmonization and federati...
BioSHaRE: Opal and Mica: a software suite for data harmonization and federati...
 
Open Science - Global Perspectives/Simon Hodson
Open Science - Global Perspectives/Simon HodsonOpen Science - Global Perspectives/Simon Hodson
Open Science - Global Perspectives/Simon Hodson
 
Pushing back, standards and standard organizations in a Semantic Web enabled ...
Pushing back, standards and standard organizations in a Semantic Web enabled ...Pushing back, standards and standard organizations in a Semantic Web enabled ...
Pushing back, standards and standard organizations in a Semantic Web enabled ...
 
How to run and maintain a popular biological data repository?
How to run and maintain a popular biological data repository?How to run and maintain a popular biological data repository?
How to run and maintain a popular biological data repository?
 
Implications of the Fourth Paradigm
Implications of the Fourth ParadigmImplications of the Fourth Paradigm
Implications of the Fourth Paradigm
 
The Fourth Paradigm - Deltares Data Science Day, 31 October 2014
The Fourth Paradigm - Deltares Data Science Day, 31 October 2014The Fourth Paradigm - Deltares Data Science Day, 31 October 2014
The Fourth Paradigm - Deltares Data Science Day, 31 October 2014
 

Semelhante a Developing open data analysis pipelines in the cloud: Enabling the ‘big data’ revolution in proteomics

An overview of the PRIDE ecosystem of resources and computational tools for m...
An overview of the PRIDE ecosystem of resources and computational tools for m...An overview of the PRIDE ecosystem of resources and computational tools for m...
An overview of the PRIDE ecosystem of resources and computational tools for m...Juan Antonio Vizcaino
 
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...Juan Antonio Vizcaino
 
Mining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasetsMining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasetsJuan Antonio Vizcaino
 
PRIDE and ProteomeXchange: A golden age for working with public proteomics data
PRIDE and ProteomeXchange: A golden age for working with public proteomics dataPRIDE and ProteomeXchange: A golden age for working with public proteomics data
PRIDE and ProteomeXchange: A golden age for working with public proteomics dataJuan Antonio Vizcaino
 
ProteomeXchange_and_PRIDE_Semmeting_2015
ProteomeXchange_and_PRIDE_Semmeting_2015ProteomeXchange_and_PRIDE_Semmeting_2015
ProteomeXchange_and_PRIDE_Semmeting_2015Juan Antonio Vizcaino
 
Big Data in Genomics: Opportunities and Challenges
Big Data in Genomics: Opportunities and ChallengesBig Data in Genomics: Opportunities and Challenges
Big Data in Genomics: Opportunities and ChallengesMatthieu Schapranow
 
Big Data and its Role in Biomedical Research
Big Data and its Role in Biomedical ResearchBig Data and its Role in Biomedical Research
Big Data and its Role in Biomedical ResearchPhilip Bourne
 
cBioPortal Webinar Slides (3/3)
cBioPortal Webinar Slides (3/3)cBioPortal Webinar Slides (3/3)
cBioPortal Webinar Slides (3/3)Pistoia Alliance
 
Reusing and integrating public proteomics data to improve our knowledge of th...
Reusing and integrating public proteomics data to improve our knowledge of th...Reusing and integrating public proteomics data to improve our knowledge of th...
Reusing and integrating public proteomics data to improve our knowledge of th...Juan Antonio Vizcaino
 

Semelhante a Developing open data analysis pipelines in the cloud: Enabling the ‘big data’ revolution in proteomics (20)

ProteomeXchange update
ProteomeXchange updateProteomeXchange update
ProteomeXchange update
 
An overview of the PRIDE ecosystem of resources and computational tools for m...
An overview of the PRIDE ecosystem of resources and computational tools for m...An overview of the PRIDE ecosystem of resources and computational tools for m...
An overview of the PRIDE ecosystem of resources and computational tools for m...
 
Pride and ProteomeXchange
Pride and ProteomeXchangePride and ProteomeXchange
Pride and ProteomeXchange
 
Proteomics repositories
Proteomics repositoriesProteomics repositories
Proteomics repositories
 
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
 
Mining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasetsMining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasets
 
Proteomics repositories
Proteomics repositoriesProteomics repositories
Proteomics repositories
 
PRIDE-ProteomeXchange
PRIDE-ProteomeXchangePRIDE-ProteomeXchange
PRIDE-ProteomeXchange
 
PRIDE and ProteomeXchange: A golden age for working with public proteomics data
PRIDE and ProteomeXchange: A golden age for working with public proteomics dataPRIDE and ProteomeXchange: A golden age for working with public proteomics data
PRIDE and ProteomeXchange: A golden age for working with public proteomics data
 
Reuse of public proteomics data
Reuse of public proteomics dataReuse of public proteomics data
Reuse of public proteomics data
 
ProteomeXchange update HUPO 2016
ProteomeXchange update HUPO 2016ProteomeXchange update HUPO 2016
ProteomeXchange update HUPO 2016
 
Proteomics repositories
Proteomics repositoriesProteomics repositories
Proteomics repositories
 
ProteomeXchange_and_PRIDE_Semmeting_2015
ProteomeXchange_and_PRIDE_Semmeting_2015ProteomeXchange_and_PRIDE_Semmeting_2015
ProteomeXchange_and_PRIDE_Semmeting_2015
 
Big Data in Genomics: Opportunities and Challenges
Big Data in Genomics: Opportunities and ChallengesBig Data in Genomics: Opportunities and Challenges
Big Data in Genomics: Opportunities and Challenges
 
Big Data and its Role in Biomedical Research
Big Data and its Role in Biomedical ResearchBig Data and its Role in Biomedical Research
Big Data and its Role in Biomedical Research
 
cBioPortal Webinar Slides (3/3)
cBioPortal Webinar Slides (3/3)cBioPortal Webinar Slides (3/3)
cBioPortal Webinar Slides (3/3)
 
The ELIXIR Proteomics community
The ELIXIR Proteomics community The ELIXIR Proteomics community
The ELIXIR Proteomics community
 
Reusing and integrating public proteomics data to improve our knowledge of th...
Reusing and integrating public proteomics data to improve our knowledge of th...Reusing and integrating public proteomics data to improve our knowledge of th...
Reusing and integrating public proteomics data to improve our knowledge of th...
 
ProteomeXchange update
ProteomeXchange updateProteomeXchange update
ProteomeXchange update
 
Reuse of public proteomics data
Reuse of public proteomics dataReuse of public proteomics data
Reuse of public proteomics data
 

Mais de Juan Antonio Vizcaino

Introduction to the PSI standard data formats
Introduction to the PSI standard data formatsIntroduction to the PSI standard data formats
Introduction to the PSI standard data formatsJuan Antonio Vizcaino
 
Introduction to the Proteomics Bioinformatics Course 2018
Introduction to the Proteomics Bioinformatics Course 2018Introduction to the Proteomics Bioinformatics Course 2018
Introduction to the Proteomics Bioinformatics Course 2018Juan Antonio Vizcaino
 
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...Juan Antonio Vizcaino
 
A proteomics data “gold mine” at your disposal: Now that the data is there, w...
A proteomics data “gold mine” at your disposal: Now that the data is there, w...A proteomics data “gold mine” at your disposal: Now that the data is there, w...
A proteomics data “gold mine” at your disposal: Now that the data is there, w...Juan Antonio Vizcaino
 
The ProteomeXchange Consoritum: 2017 update
The ProteomeXchange Consoritum: 2017 updateThe ProteomeXchange Consoritum: 2017 update
The ProteomeXchange Consoritum: 2017 updateJuan Antonio Vizcaino
 
Introduction to the Proteomics Bioinformatics Course 2017
Introduction to the Proteomics Bioinformatics Course 2017Introduction to the Proteomics Bioinformatics Course 2017
Introduction to the Proteomics Bioinformatics Course 2017Juan Antonio Vizcaino
 
Is it feasible to identify novel biomarkers by mining public proteomics data?
Is it feasible to identify novel biomarkers by mining public proteomics data?Is it feasible to identify novel biomarkers by mining public proteomics data?
Is it feasible to identify novel biomarkers by mining public proteomics data?Juan Antonio Vizcaino
 
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...Juan Antonio Vizcaino
 
Enabling automated processing and analysis of large-scale proteomics data
Enabling automated processing and analysis of large-scale proteomics dataEnabling automated processing and analysis of large-scale proteomics data
Enabling automated processing and analysis of large-scale proteomics dataJuan Antonio Vizcaino
 
Introduction to EBI for Proteomics in ELIXIR
Introduction to EBI for Proteomics in ELIXIRIntroduction to EBI for Proteomics in ELIXIR
Introduction to EBI for Proteomics in ELIXIRJuan Antonio Vizcaino
 
The Proteomics Standards Initiative (PSI)
The Proteomics Standards Initiative (PSI)The Proteomics Standards Initiative (PSI)
The Proteomics Standards Initiative (PSI)Juan Antonio Vizcaino
 
Introduction to the Proteomics Bioinformatics Course 2016
Introduction to the Proteomics Bioinformatics Course 2016Introduction to the Proteomics Bioinformatics Course 2016
Introduction to the Proteomics Bioinformatics Course 2016Juan Antonio Vizcaino
 

Mais de Juan Antonio Vizcaino (20)

Introduction to the PSI standard data formats
Introduction to the PSI standard data formatsIntroduction to the PSI standard data formats
Introduction to the PSI standard data formats
 
PRIDE resources and ProteomeXchange
PRIDE resources and ProteomeXchangePRIDE resources and ProteomeXchange
PRIDE resources and ProteomeXchange
 
Introduction to the Proteomics Bioinformatics Course 2018
Introduction to the Proteomics Bioinformatics Course 2018Introduction to the Proteomics Bioinformatics Course 2018
Introduction to the Proteomics Bioinformatics Course 2018
 
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...
 
PSI-Proteome Informatics update
PSI-Proteome Informatics updatePSI-Proteome Informatics update
PSI-Proteome Informatics update
 
The ELIXIR Proteomics Community
The ELIXIR Proteomics CommunityThe ELIXIR Proteomics Community
The ELIXIR Proteomics Community
 
A proteomics data “gold mine” at your disposal: Now that the data is there, w...
A proteomics data “gold mine” at your disposal: Now that the data is there, w...A proteomics data “gold mine” at your disposal: Now that the data is there, w...
A proteomics data “gold mine” at your disposal: Now that the data is there, w...
 
The ProteomeXchange Consoritum: 2017 update
The ProteomeXchange Consoritum: 2017 updateThe ProteomeXchange Consoritum: 2017 update
The ProteomeXchange Consoritum: 2017 update
 
Reuse of public proteomics data
Reuse of public proteomics dataReuse of public proteomics data
Reuse of public proteomics data
 
Proteomics repositories
Proteomics repositoriesProteomics repositories
Proteomics repositories
 
Proteomics data standards
Proteomics data standardsProteomics data standards
Proteomics data standards
 
Introduction to the Proteomics Bioinformatics Course 2017
Introduction to the Proteomics Bioinformatics Course 2017Introduction to the Proteomics Bioinformatics Course 2017
Introduction to the Proteomics Bioinformatics Course 2017
 
Is it feasible to identify novel biomarkers by mining public proteomics data?
Is it feasible to identify novel biomarkers by mining public proteomics data?Is it feasible to identify novel biomarkers by mining public proteomics data?
Is it feasible to identify novel biomarkers by mining public proteomics data?
 
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
 
ProteomeXchange update 2017
ProteomeXchange update 2017ProteomeXchange update 2017
ProteomeXchange update 2017
 
Enabling automated processing and analysis of large-scale proteomics data
Enabling automated processing and analysis of large-scale proteomics dataEnabling automated processing and analysis of large-scale proteomics data
Enabling automated processing and analysis of large-scale proteomics data
 
Introduction to EBI for Proteomics in ELIXIR
Introduction to EBI for Proteomics in ELIXIRIntroduction to EBI for Proteomics in ELIXIR
Introduction to EBI for Proteomics in ELIXIR
 
The Proteomics Standards Initiative (PSI)
The Proteomics Standards Initiative (PSI)The Proteomics Standards Initiative (PSI)
The Proteomics Standards Initiative (PSI)
 
Introduction to the Proteomics Bioinformatics Course 2016
Introduction to the Proteomics Bioinformatics Course 2016Introduction to the Proteomics Bioinformatics Course 2016
Introduction to the Proteomics Bioinformatics Course 2016
 
Reuse of public data in proteomics
Reuse of public data in proteomicsReuse of public data in proteomics
Reuse of public data in proteomics
 

Último

SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxjana861314
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 

Último (20)

SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 

Developing open data analysis pipelines in the cloud: Enabling the ‘big data’ revolution in proteomics

  • 1. Developing open data analysis pipelines in the cloud: Enabling the ‘big data’ revolution in proteomics Dr. Juan Antonio Vizcaíno EMBL-European Bioinformatics Institute (EMBL-EBI) Hinxton, Cambridge, UK E-mail: juan@ebi.ac.uk
  • 2. Juan A. Vizcaíno juan@ebi.ac.uk BSPR Annual Scientific Meeting 2018 Bradford, 9 July 2018 Overview • Short introduction to PRIDE and ProteomeXchange • Enabling public data re-use • Open analysis pipelines for proteomics data
  • 3. Juan A. Vizcaíno juan@ebi.ac.uk BSPR Annual Scientific Meeting 2018 Bradford, 9 July 2018 •PRIDE stores mass spectrometry (MS)- based proteomics data: •Peptide and protein expression data (identification and quantification) •Post-translational modifications •Mass spectra (raw data and peak lists) •Technical and biological metadata •Any other related information •Full support for tandem MS approaches •Any type of data can be stored •From July 2017, an ELIXIR core resource PRIDE (PRoteomics IDEntifications) database http://www.ebi.ac.uk/pride/archive Martens et al., Proteomics, 2005 Vizcaíno et al., NAR, 2016
  • 4. Juan A. Vizcaíno juan@ebi.ac.uk BSPR Annual Scientific Meeting 2018 Bradford, 9 July 2018 Raw data ID& Quant Meta data ProteomeXchange: A global, distributed proteomics database Vizcaíno et al., Nat Biotechnol, 2014 Deutsch et al., NAR, 2017 http://www.proteomexchange.org Standard data submission and data dissemination practises between the main proteomics repositories New member Mandatory data deposition
  • 5. Juan A. Vizcaíno juan@ebi.ac.uk BSPR Annual Scientific Meeting 2018 Bradford, 9 July 2018 PRIDE data submissions and data growth May 2018 (320 datasets) was again a record month in terms of datasets submitted Datasets submitted per month PRIDE contains >85% of all ProteomeXchange datasets Dataset PXD010000 reached on June 1st
  • 6. Juan A. Vizcaíno juan@ebi.ac.uk BSPR Annual Scientific Meeting 2018 Bradford, 9 July 2018 Stats: Data growth in EMBL-EBI resources Sequence data Micro-array Metabolomics Proteomics
  • 7. Juan A. Vizcaíno juan@ebi.ac.uk BSPR Annual Scientific Meeting 2018 Bradford, 9 July 2018 We are already in a “Big data” age for proteomics ”Big data” is a term used to describe data collections where processing becomes problematic due to any combination of their size (Volume), frequency of update (Velocity), diversity (Variety), or level or inaccuracy (Veracity). Slide from: http://www.ibmbigdatahub.com/
  • 8. Juan A. Vizcaíno juan@ebi.ac.uk BSPR Annual Scientific Meeting 2018 Bradford, 9 July 2018 Overview • Short introduction to PRIDE and ProteomeXchange • Enabling public data re-use • Open analysis pipelines for proteomics data
  • 9. Juan A. Vizcaíno juan@ebi.ac.uk BSPR Annual Scientific Meeting 2018 Bradford, 9 July 2018 Public Data Reuse in Proteomics Vaudel et al., Proteomics, 2016
  • 10. Juan A. Vizcaíno juan@ebi.ac.uk BSPR Annual Scientific Meeting 2018 Bradford, 9 July 2018 Data re-use in proteomics keeps increasing Data download volume for PRIDE in 2017: 295 TB 0 50 100 150 200 250 300 350 2013 2014 2015 2016 2017 Downloads in TBs
  • 11. Juan A. Vizcaíno juan@ebi.ac.uk BSPR Annual Scientific Meeting 2018 Bradford, 9 July 2018 How can public data be re-used? Martens & Vizcaíno, Trends Bioch Sci, 2017 Vaudel et al., Proteomics, 2016
  • 12. Juan A. Vizcaíno juan@ebi.ac.uk BSPR Annual Scientific Meeting 2018 Bradford, 9 July 2018 Overview • Short introduction to PRIDE and ProteomeXchange • Enabling public data re-use • Open analysis pipelines for proteomics data
  • 13. Juan A. Vizcaíno juan@ebi.ac.uk BSPR Annual Scientific Meeting 2018 Bradford, 9 July 2018 Motivation • Make analysis scalable • Facilitate re-use of public data (e.g. PRIDE datasets) • Also applicable to datasets produced by individual labs, since they are becoming larger and larger • Make analysis reproducible • Make pipelines openly available to everyone in the community
  • 14. Juan A. Vizcaíno juan@ebi.ac.uk BSPR Annual Scientific Meeting 2018 Bradford, 9 July 2018 Infrastructure as a Service (IaaS): • Compute power is remote • Still from compute centres, but on a larger scale • Large local IT infrastructure is then not required • Provides scalability • Will enable wide-availability of the pipelines Moving pipelines to a cloud environment
  • 15. Juan A. Vizcaíno juan@ebi.ac.uk BSPR Annual Scientific Meeting 2018 Bradford, 9 July 2018 Cloud computing is increasingly used in genomics Simplified workflow launcher: • For AWS • Enabling co-analysis with the larger PanCancer dataset http://icgc.org/working-pancancer-data-aws
  • 16. Juan A. Vizcaíno juan@ebi.ac.uk BSPR Annual Scientific Meeting 2018 Bradford, 9 July 2018 Supporting Reproducible Science http://www.nature.com/nature/focus/reproducibility/
  • 17. Juan A. Vizcaíno juan@ebi.ac.uk BSPR Annual Scientific Meeting 2018 Bradford, 9 July 2018 Making data analysis pipelines reproducible • That means using: • Exactly the same software (including the same version), in the same order. • The same protein sequence database (including the same version). • If we use the same files as input to the software, we will get EXACTLY the same results. • If that’s not the case, something has gone wrong. • Computers are much more reliable than people. Picture from: http://www.drugdesigntech.com/how-maximize-the-reproducibility-of-your-research/
  • 18. Juan A. Vizcaíno juan@ebi.ac.uk BSPR Annual Scientific Meeting 2018 Bradford, 9 July 2018 Develop exemplary proteomics data analysis workflows and deploy them in the EMBL-EBI "Embassy Cloud”: (1) Standard identification workflow (2) Identification workflow for PTMs (3) Quantification (label-free/label-based approaches) (4) Quality Control (to aid data set interpretation/reanalysis evaluation) (5) Versions of quantification approaches (including PTMs) è Connected to public proteomics data from Starting with DDA data
  • 19. Juan A. Vizcaíno juan@ebi.ac.uk BSPR Annual Scientific Meeting 2018 Bradford, 9 July 2018 We are using the framework Features: • Tool modularisation • Solutions for data handover between tools with standardised (PSI) formats • Adapters for integrating third-party software (search engines, LuciPHOr, FIDO, percolator, etc.) • Integration into various workflow systems as a basis Analysis software used so far (more to come)
  • 20. Juan A. Vizcaíno juan@ebi.ac.uk BSPR Annual Scientific Meeting 2018 Bradford, 9 July 2018 Infrastructure Open source infrastructure: Galaxy, Docker, Kubernetes, R, OpenStack, etc Open source analysis software: OpenMS as a starting point
  • 21. Juan A. Vizcaíno juan@ebi.ac.uk BSPR Annual Scientific Meeting 2018 Bradford, 9 July 2018 Simple interface planned for parameter selection
  • 22. Juan A. Vizcaíno juan@ebi.ac.uk BSPR Annual Scientific Meeting 2018 Bradford, 9 July 2018 • The workflows have been deployed into the EMBL-EBI “Embassy Cloud” Portal and fitted with a dashboard as a proof of concept • PRIDE data connection into the cloud is being optimised (not URL available yet) • Only a first step. Lots of work to do: benchmarking, interface for users, technical optimisations, etc. Current status of the DDA pipelines
  • 23. Juan A. Vizcaíno juan@ebi.ac.uk BSPR Annual Scientific Meeting 2018 Bradford, 9 July 2018 Open analysis pipelines: DIA approaches • We have started working in the development of open data analysis pipelines for the analysis of DIA experiments (the main focus is put in SWATH-MS experiments). • In collaboration with the Stoller Center (Manchester) (B. Graham, T. Hubbard & P. Townsend). • Initial pipeline is based on OpenSWATH, developed in the Stoller. • We are currently migrating it to the EBI “Embassy Cloud” (same infrastructure developed for the DDA pipelines).
  • 24. Juan A. Vizcaíno juan@ebi.ac.uk BSPR Annual Scientific Meeting 2018 Bradford, 9 July 2018 Open analysis pipelines: proteogenomics • Work in collaboration with J. Choudhary (Institute of Cancer Research, London). Project just started on June 2018. • Two types of pipelines to be developed: • To improve genome annotation. • To enable personalised medicine approaches. • Pipelines based on the OpenMS framework. Weisser et al., JPR, 2016
  • 25. Juan A. Vizcaíno juan@ebi.ac.uk BSPR Annual Scientific Meeting 2018 Bradford, 9 July 2018 Summary • Public proteomics datasets are on the rise! Reliable (widely used) infrastructure now exists: PRIDE and ProteomeXchange • A lot of possibilities open for reuse of this data • Development of data analysis pipelines in the cloud: • Enable public data reuse, pipelines connected to PRIDE • Scalability • Reproducibility • Make them available to everyone in the community
  • 26. Juan A. Vizcaíno juan@ebi.ac.uk BSPR Annual Scientific Meeting 2018 Bradford, 9 July 2018 Aknowledgements: People Yasset Perez-Riverol Johannes Griss Suresh Hewapathirana Tobias Ternent Jingwen Bai Attila Csordas Deepti Jaiswal Andrew Jarnuczak Mathias Walzer Gerhard Mayer (de.NBI) Acknowledgements All data submitters !!! @pride_ebi @proteomexchange Bobby Graham, Simon Hubbard, Paul Brack and others (Stoller Centre) Jyoti Choudhary & James Wright (ICR)