Developing open data analysis pipelines in the cloud: Enabling the ‘big data’ revolution in proteomics
1. Developing open data analysis pipelines
in the cloud: Enabling the ‘big data’
revolution in proteomics
Dr. Juan Antonio Vizcaíno
EMBL-European Bioinformatics Institute (EMBL-EBI)
Hinxton, Cambridge, UK
E-mail: juan@ebi.ac.uk
2. Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
Overview
• Short introduction to PRIDE and ProteomeXchange
• Enabling public data re-use
• Open analysis pipelines for proteomics data
3. Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
•PRIDE stores mass spectrometry (MS)-
based proteomics data:
•Peptide and protein expression data
(identification and quantification)
•Post-translational modifications
•Mass spectra (raw data and peak lists)
•Technical and biological metadata
•Any other related information
•Full support for tandem MS approaches
•Any type of data can be stored
•From July 2017, an ELIXIR core resource
PRIDE (PRoteomics IDEntifications) database
http://www.ebi.ac.uk/pride/archive
Martens et al., Proteomics, 2005
Vizcaíno et al., NAR, 2016
4. Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
Raw
data
ID&
Quant
Meta
data
ProteomeXchange: A global, distributed proteomics database
Vizcaíno et al., Nat Biotechnol, 2014
Deutsch et al., NAR, 2017
http://www.proteomexchange.org
Standard data submission and data dissemination practises between the
main proteomics repositories
New
member
Mandatory data deposition
5. Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
PRIDE data submissions and data growth
May 2018 (320 datasets) was again a record month in terms of datasets submitted
Datasets submitted per month
PRIDE contains >85% of all ProteomeXchange datasets
Dataset PXD010000 reached on June 1st
6. Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
Stats: Data growth in EMBL-EBI resources
Sequence data
Micro-array
Metabolomics
Proteomics
7. Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
We are already in a “Big data” age for proteomics
”Big data” is a term used to describe data collections where processing becomes
problematic due to any combination of their size (Volume), frequency of update
(Velocity), diversity (Variety), or level or inaccuracy (Veracity).
Slide from: http://www.ibmbigdatahub.com/
8. Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
Overview
• Short introduction to PRIDE and ProteomeXchange
• Enabling public data re-use
• Open analysis pipelines for proteomics data
9. Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
Public Data Reuse in Proteomics
Vaudel et al., Proteomics, 2016
10. Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
Data re-use in proteomics keeps increasing
Data download volume for PRIDE in 2017: 295 TB
0
50
100
150
200
250
300
350
2013 2014 2015 2016 2017
Downloads in TBs
11. Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
How can public data be re-used?
Martens & Vizcaíno,
Trends Bioch Sci, 2017
Vaudel et al., Proteomics, 2016
12. Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
Overview
• Short introduction to PRIDE and ProteomeXchange
• Enabling public data re-use
• Open analysis pipelines for proteomics data
13. Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
Motivation
• Make analysis scalable
• Facilitate re-use of public data (e.g. PRIDE datasets)
• Also applicable to datasets produced by individual labs, since they
are becoming larger and larger
• Make analysis reproducible
• Make pipelines openly available to everyone in the community
14. Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
Infrastructure as a Service (IaaS):
• Compute power is remote
• Still from compute centres, but on a larger scale
• Large local IT infrastructure is then not required
• Provides scalability
• Will enable wide-availability of the pipelines
Moving pipelines to a cloud environment
15. Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
Cloud computing is increasingly used in genomics
Simplified workflow launcher:
• For AWS
• Enabling co-analysis with the larger PanCancer dataset
http://icgc.org/working-pancancer-data-aws
16. Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
Supporting Reproducible Science
http://www.nature.com/nature/focus/reproducibility/
17. Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
Making data analysis pipelines reproducible
• That means using:
• Exactly the same software (including the same version), in the
same order.
• The same protein sequence database (including the same
version).
• If we use the same files as input to the software, we will get
EXACTLY the same results.
• If that’s not the case, something has gone wrong.
• Computers are much more reliable than people.
Picture from: http://www.drugdesigntech.com/how-maximize-the-reproducibility-of-your-research/
18. Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
Develop exemplary proteomics data analysis workflows and deploy
them in the EMBL-EBI "Embassy Cloud”:
(1) Standard identification workflow
(2) Identification workflow for PTMs
(3) Quantification (label-free/label-based approaches)
(4) Quality Control (to aid data set interpretation/reanalysis evaluation)
(5) Versions of quantification approaches (including PTMs)
è Connected to public proteomics data from
Starting with DDA data
19. Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
We are using the framework
Features:
• Tool modularisation
• Solutions for data handover between tools with standardised
(PSI) formats
• Adapters for integrating third-party software (search engines,
LuciPHOr, FIDO, percolator, etc.)
• Integration into various workflow systems as a basis
Analysis software used so far (more to come)
20. Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
Infrastructure Open source infrastructure:
Galaxy, Docker, Kubernetes, R, OpenStack, etc
Open source analysis software: OpenMS
as a starting point
22. Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
• The workflows have been deployed into the EMBL-EBI
“Embassy Cloud” Portal and fitted with a dashboard as a proof
of concept
• PRIDE data connection into the cloud is being optimised (not
URL available yet)
• Only a first step. Lots of work to do: benchmarking, interface for
users, technical optimisations, etc.
Current status of the DDA pipelines
23. Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
Open analysis pipelines: DIA approaches
• We have started working in the development of open data analysis
pipelines for the analysis of DIA experiments (the main focus is put in
SWATH-MS experiments).
• In collaboration with the Stoller Center (Manchester) (B. Graham, T.
Hubbard & P. Townsend).
• Initial pipeline is based on OpenSWATH, developed in the Stoller.
• We are currently migrating it to the EBI “Embassy Cloud” (same
infrastructure developed for the DDA pipelines).
24. Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
Open analysis pipelines: proteogenomics
• Work in collaboration with J. Choudhary (Institute of Cancer Research,
London). Project just started on June 2018.
• Two types of pipelines to be developed:
• To improve genome annotation.
• To enable personalised medicine approaches.
• Pipelines based on the OpenMS framework.
Weisser et al., JPR, 2016
25. Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
Summary
• Public proteomics datasets are on the rise! Reliable (widely used)
infrastructure now exists: PRIDE and ProteomeXchange
• A lot of possibilities open for reuse of this data
• Development of data analysis pipelines in the cloud:
• Enable public data reuse, pipelines connected to PRIDE
• Scalability
• Reproducibility
• Make them available to everyone in the community
26. Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
Aknowledgements: People
Yasset Perez-Riverol
Johannes Griss
Suresh Hewapathirana
Tobias Ternent
Jingwen Bai
Attila Csordas
Deepti Jaiswal
Andrew Jarnuczak
Mathias Walzer
Gerhard Mayer (de.NBI)
Acknowledgements
All data submitters !!!
@pride_ebi
@proteomexchange
Bobby Graham, Simon Hubbard, Paul Brack and others (Stoller Centre)
Jyoti Choudhary & James Wright (ICR)