Developing open data analysis pipelines in the cloud: Enabling the ‘big data’ revolution in proteomics

Developing open data analysis pipelines
in the cloud: Enabling the ‘big data’
revolution in proteomics
Dr. Juan Antonio Vizcaíno
EMBL-European Bioinformatics Institute (EMBL-EBI)
Hinxton, Cambridge, UK
E-mail: juan@ebi.ac.uk

Juan A. Vizcaíno
juan@ebi.ac.uk
BSPR Annual Scientific Meeting 2018
Bradford, 9 July 2018
Overview
• Short introduction to PRIDE and ProteomeXchange
• Enabling public data re-use
• Open analysis pipelines for proteomics data

Juan A. Vizcaíno
juan@ebi.ac.uk
•PRIDE stores mass spectrometry (MS)-
based proteomics data:
•Peptide and protein expression data
(identification and quantification)
•Post-translational modifications
•Mass spectra (raw data and peak lists)
•Technical and biological metadata
•Any other related information
•Full support for tandem MS approaches
•Any type of data can be stored
•From July 2017, an ELIXIR core resource
PRIDE (PRoteomics IDEntifications) database
http://www.ebi.ac.uk/pride/archive
Martens et al., Proteomics, 2005
Vizcaíno et al., NAR, 2016

Juan A. Vizcaíno
juan@ebi.ac.uk
Raw
data
ID&
Quant
Meta
data
ProteomeXchange: A global, distributed proteomics database
Vizcaíno et al., Nat Biotechnol, 2014
Deutsch et al., NAR, 2017
http://www.proteomexchange.org
Standard data submission and data dissemination practises between the
main proteomics repositories
New
member
Mandatory data deposition

Juan A. Vizcaíno
juan@ebi.ac.uk
PRIDE data submissions and data growth
May 2018 (320 datasets) was again a record month in terms of datasets submitted
Datasets submitted per month
PRIDE contains >85% of all ProteomeXchange datasets
Dataset PXD010000 reached on June 1st

Juan A. Vizcaíno
juan@ebi.ac.uk
Stats: Data growth in EMBL-EBI resources
Sequence data
Micro-array
Metabolomics
Proteomics

Juan A. Vizcaíno
juan@ebi.ac.uk
We are already in a “Big data” age for proteomics
”Big data” is a term used to describe data collections where processing becomes
problematic due to any combination of their size (Volume), frequency of update
(Velocity), diversity (Variety), or level or inaccuracy (Veracity).
Slide from: http://www.ibmbigdatahub.com/

Juan A. Vizcaíno
juan@ebi.ac.uk
Public Data Reuse in Proteomics
Vaudel et al., Proteomics, 2016

Juan A. Vizcaíno
juan@ebi.ac.uk
Data re-use in proteomics keeps increasing
Data download volume for PRIDE in 2017: 295 TB
0
50
100
150
200
250
300
350
2013 2014 2015 2016 2017
Downloads in TBs

Juan A. Vizcaíno
juan@ebi.ac.uk
How can public data be re-used?
Martens & Vizcaíno,
Trends Bioch Sci, 2017
Vaudel et al., Proteomics, 2016

Juan A. Vizcaíno
juan@ebi.ac.uk
Motivation
• Make analysis scalable
• Facilitate re-use of public data (e.g. PRIDE datasets)
• Also applicable to datasets produced by individual labs, since they
are becoming larger and larger
• Make analysis reproducible
• Make pipelines openly available to everyone in the community

Juan A. Vizcaíno
juan@ebi.ac.uk
Infrastructure as a Service (IaaS):
• Compute power is remote
• Still from compute centres, but on a larger scale
• Large local IT infrastructure is then not required
• Provides scalability
• Will enable wide-availability of the pipelines
Moving pipelines to a cloud environment

Juan A. Vizcaíno
juan@ebi.ac.uk
Cloud computing is increasingly used in genomics
Simplified workflow launcher:
• For AWS
• Enabling co-analysis with the larger PanCancer dataset
http://icgc.org/working-pancancer-data-aws

Juan A. Vizcaíno
juan@ebi.ac.uk
Supporting Reproducible Science
http://www.nature.com/nature/focus/reproducibility/

Juan A. Vizcaíno
juan@ebi.ac.uk
Making data analysis pipelines reproducible
• That means using:
• Exactly the same software (including the same version), in the
same order.
• The same protein sequence database (including the same
version).
• If we use the same files as input to the software, we will get
EXACTLY the same results.
• If that’s not the case, something has gone wrong.
• Computers are much more reliable than people.
Picture from: http://www.drugdesigntech.com/how-maximize-the-reproducibility-of-your-research/

Juan A. Vizcaíno
juan@ebi.ac.uk
Develop exemplary proteomics data analysis workflows and deploy
them in the EMBL-EBI "Embassy Cloud”:
(1) Standard identification workflow
(2) Identification workflow for PTMs
(3) Quantification (label-free/label-based approaches)
(4) Quality Control (to aid data set interpretation/reanalysis evaluation)
(5) Versions of quantification approaches (including PTMs)
è Connected to public proteomics data from
Starting with DDA data

Juan A. Vizcaíno
juan@ebi.ac.uk
We are using the framework
Features:
• Tool modularisation
• Solutions for data handover between tools with standardised
(PSI) formats
• Adapters for integrating third-party software (search engines,
LuciPHOr, FIDO, percolator, etc.)
• Integration into various workflow systems as a basis
Analysis software used so far (more to come)

Juan A. Vizcaíno
juan@ebi.ac.uk
Infrastructure Open source infrastructure:
Galaxy, Docker, Kubernetes, R, OpenStack, etc
Open source analysis software: OpenMS
as a starting point

Juan A. Vizcaíno
juan@ebi.ac.uk
Simple interface planned for parameter selection

Juan A. Vizcaíno
juan@ebi.ac.uk
• The workflows have been deployed into the EMBL-EBI
“Embassy Cloud” Portal and fitted with a dashboard as a proof
of concept
• PRIDE data connection into the cloud is being optimised (not
URL available yet)
• Only a first step. Lots of work to do: benchmarking, interface for
users, technical optimisations, etc.
Current status of the DDA pipelines

Juan A. Vizcaíno
juan@ebi.ac.uk
Open analysis pipelines: DIA approaches
• We have started working in the development of open data analysis
pipelines for the analysis of DIA experiments (the main focus is put in
SWATH-MS experiments).
• In collaboration with the Stoller Center (Manchester) (B. Graham, T.
Hubbard & P. Townsend).
• Initial pipeline is based on OpenSWATH, developed in the Stoller.
• We are currently migrating it to the EBI “Embassy Cloud” (same
infrastructure developed for the DDA pipelines).

Juan A. Vizcaíno
juan@ebi.ac.uk
Open analysis pipelines: proteogenomics
• Work in collaboration with J. Choudhary (Institute of Cancer Research,
London). Project just started on June 2018.
• Two types of pipelines to be developed:
• To improve genome annotation.
• To enable personalised medicine approaches.
• Pipelines based on the OpenMS framework.
Weisser et al., JPR, 2016

Juan A. Vizcaíno
juan@ebi.ac.uk
Summary
• Public proteomics datasets are on the rise! Reliable (widely used)
infrastructure now exists: PRIDE and ProteomeXchange
• A lot of possibilities open for reuse of this data
• Development of data analysis pipelines in the cloud:
• Enable public data reuse, pipelines connected to PRIDE
• Scalability
• Reproducibility
• Make them available to everyone in the community

Juan A. Vizcaíno
juan@ebi.ac.uk
Aknowledgements: People
Yasset Perez-Riverol
Johannes Griss
Suresh Hewapathirana
Tobias Ternent
Jingwen Bai
Attila Csordas
Deepti Jaiswal
Andrew Jarnuczak
Mathias Walzer
Gerhard Mayer (de.NBI)
Acknowledgements
All data submitters !!!
@pride_ebi
@proteomexchange
Bobby Graham, Simon Hubbard, Paul Brack and others (Stoller Centre)
Jyoti Choudhary & James Wright (ICR)

Developing open data analysis pipelines in the cloud: Enabling the ‘big data’ revolution in proteomics

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Developing open data analysis pipelines in the cloud: Enabling the ‘big data’ revolution in proteomics

Semelhante a Developing open data analysis pipelines in the cloud: Enabling the ‘big data’ revolution in proteomics (20)

Mais de Juan Antonio Vizcaino

Mais de Juan Antonio Vizcaino (20)

Último

Último (20)

Developing open data analysis pipelines in the cloud: Enabling the ‘big data’ revolution in proteomics