Continuous modeling - automating model
building on high-performance e-Infrastructures
Ola Spjuth
Department of Pharmaceutical Biosciences and
Science for Life Laboratory, Uppsala, Sweden
Today: We have access to high-throughput technologies to
study biological phenomena
New challenges: Data management and analysis
• Storage
• Analysis methods, pipelines
• Scaling
• Automation
• Data integration, security
• Predictions
• …
My research focus
• Enabling high-throughput biology, from e-
infrastructures and up
– Massively parallel sequencing, metabolomics
– Predictive modeling in toxicology and pharmacology
• Particular focus in large-scale predictive modeling
– Tackle large problems
– Evaluate predictive performance
– Easy and secure sharing/consumption of models
– Automate re-building of models
Observations
• Predictive toxicology and
pharmacology are becoming data-
intensive
– High throughput technologies
• Drug/chemical screening
• Molecular biology (omics)
– More and bigger publicly available data
sources
• Data is continuously updated
QSAR modeling
• Signatures1 descriptor in CDK2
– Canonical representation of atom
environments
• Support Vector Machine (SVM)
– Robust modeling
1. Faulon, J.-L.; Visco, D. P.; Pophale, R. S. Journal of Chemical Information and
Computer Sciences, 2003, 43, 707-720
2. Steinbeck, C.; Han, Y.; Kuhn, S.; Horlacher, O.; Luttmann, E.; Willighagen, E.
Journal of Chemical Information and Computer Sciences, 2003,43, 493-500.
Lars Carlsson,
AstraZeneca R&D
Interpretation of nonlinear QSAR models
• Method
– Compute gradient of decision
function for prediction
– Extract descriptor(s) with largest
component in the gradient
• Demonstrated on RF, SVM, and
PLS
Carlsson, L., Helgee, E. A., and Boyer, S.
Interpretation of nonlinear qsar models applied to ames mutagenicity data.
J Chem Inf Model 49, 11 (Nov 2009), 2551–2558.
E. Ahlberg, O. Spjuth, C. Hasselgren, and L. Carlsson. Interpretation of Conformal Prediction
Classification Models. In Statistical Learning and Data Sciences, vol. 9047 of Lecture Notes in
Computer Science. Springer International Publishing, 2015, pp. 323–334.
Lars Carlsson,
AstraZeneca R&D
Bioclipse Decision Support
Modeling large number of observations on HPC
Aim: Measure predictive performance when
QSAR datasets get larger
Research questions:
• When do we need HPC?
• How can we work efficiently with HPC in
modeling?
• Are nonlinear methods required?
High-Performance Computing
• Computationally expensive problems call for high-
performance e-Infrastructures
• High-Performance Computing (HPC)
– Fast interconnect between compute nodes
• High-Throughput Computing (HTC)
– Fast interconnect not needed
• Cloud Computing (CC)
– Infrastructure as a Service (IaaS)
UPPMAX high-performance computing center
(Uppsala, Sweden)
• Get access to multiple nodes
– 16 compute cores per node
• Get access to large memory machines
– we have nodes with 128, 256, 512, or 2000 GB RAM
• OpenStack private cloud
• However on HPC:
– Only terminal usage, no web server allowed (scripting in bash, perl
and python common)
– Queuing system (e.g. SLURM, SGE)
– Limited job length (e.g. 10 days)
Project growth
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
100
200
300
400
Active Projects
Numberofactiveprojects
●
●
UPPMAX
UPPNEX
Bioinformatics has inefficient HPC usage
Levels of automation in sequence analysis
• Production: Can be fully automated
• Secondary analysis: Partly automated
• Researchers: Basic science not really
useful to automate, flexibility
Training large number of datasets on HPC
Aim: Build models for hundreds
or thousands of targets
– Challenge to automate data
assembly/integration
– Challenge to automate model
building
Hypothesis: Workflow systems
can enable agile large-scale
predictive modeling
Data sources
Samuel Lampa
What is a workflow system
The workflow landscape
Automating analysis on clusters
• Workflow systems can aid development and
deployment
• We extended Luigi system into SciLuigi
(https://github.com/samuell/sciluigi)
• Integrate with batch queuing system on HPC
Train and
assess model
Samuel Lampa
Modeling large datasets on HPC
Jonathan Alvarsson
Modeling large datasets on HPC
Jonathan Alvarsson
Publishing models
• Publish models for easy access
and consumption
• We use P2 (OSGi) provisioning
system
v. 1.3
v. 1.2
v. 1.1
Use models
Bioclipse and OpenTox
E. Willighagen N. Jeliazkova, B. Hardy, R. Grafström, and O. Spjuth
Computational toxicology using the OpenTox application programming interface and Bioclipse.
BMC Research Notes 2011, 4:487
Reactive/continuous modeling
Data sources
Coordinate
Integrate
Version
Monitor
Publish
models
Archive
models
User
Bioclipse
Train and
assess model
Could cloud computing improve/simplify modeling?
Modeling on Amazon Elastic Cloud
Number of cores
Time(hours)
1 2 4 8 16
5
50
100
150
200
220
20k
75k
150k
300k
B. T. Moghadam, J. Alvarsson, M. Holm, M.
Eklund, L. Carlsson, and O. Spjuth
Scaling predictive modeling in drug
development with cloud computing.
J. Chem. Inf. Model., 2015, 55 (1), pp 19-25
• H2020 infrastructure project (2015-2018)
• Platform for metabolomics data analysis –
study metabolites in primarily clinical
studies
• Integrating data and tools
• Data management, privacy
• Cloud/Microservices architecture
• Predictions
http://phenomenal-h2020.eu/
Could Big Data frameworks improve/simplify modeling?
• Map/Reduce, Hadoop, Spark, HDFS/distributed file
systems and others…
• Recently received a lot of attention
• Allow for massively parallel analysis
• How useful are they in pharmaceutical
bioinformatics?
Hadoop (MapReduce) for massively parallel analysis
Evaluating Hadoop for sequence analysis
• Compare Hadoop and HPC
– Create as identical pipelines as possible
– Investigate scaling and performance
– Shows the bottlenecks with current HPC
Alexey Siretskiy,
former Postdoc
A. Siretskiy, L. Pireddu, T. Sundqvist, and O. Spjuth.
A quantitative assessment of the Hadoop framework for
analyzing massively parallel DNA sequencing data.
Gigascience. 2015; 4:26.
Distributed modeling with Spark
• Appealing programming methodology
• Built-in data locality and in-memory
computing
– RDD (Resilient Distributed Dataset):
distributed large-scale dataset
abstraction
– MLlib: Spark-based distributed
implementation of many ML algorithms. Logistic regression in Hadoop
and Spark
Parallel Virtual Screening with Spark
Hypothesis: The Spark framework can be used for trivially
parallelizable problems in pharm. Bioinformatics
• Demonstrate on Virtual Screening
• Used OpenEye suite
Prel. results:
• Spark API allows for simple programmatic parallelization
• Good scalability in terms of speedup
• Lack of documentation
L. Ahmed, A. Edlund, E. Laure, O. Spjuth.
Using Iterative MapReduce for Parallel Virtual Screening. Cloud
Computing Technology and Science (Cloud- Com), 2013 IEEE 5th
International Conference on , vol.2, no., pp.27,32, 2-5, 2013
Laeeq Ahmed,
PhD Student
Valentin Georgiev,
Researcher
Conformal Prediction in Spark
• Evaluate confidence in predictions
• We implemented Inductive Conformal
Prediction (ICP) in Spark, extending MLlib
• Tested on 2 large data sets
– HIGGS: 11M examples. Task: distinguish between
Higgs boson signal process and background
process
– SUSY: 5M examples. Task: distinguish between
supersymmetric particle signal process and
background process
POSTER P-33
Marco Capuccini
PhD Student
Results:
• Valid predictions
• Good scalability
Conformal Prediction in Spark
M. Capuccini, L. Carlsson, U. Norinder and O. Spjuth.
Conformal Prediction in Spark: Large-Scale Machine Learning with Confidence.
Accepted in IEEE Transaction on Cloud Computing, 2015.
POSTER P-33
Marco Capuccini
PhD Student
Some conclusions
• Automation/continuous modeling is not trivial
– Data management, modeling, model management/governance
• Conformal prediction
– Predictions with confidence
• Large-scale problems requires computational power
– Cloud computing vs High-Performance Computing
• Workflows and Big Data frameworks
– Immature technologies, not well documented
– can be useful for large-scale analysis in pharmaceutical
bioinformatics, especially for automation
Some ongoing projects
• Augment Parallel virtual screening with Machine
Learning
• Further develop conformal predictions in distributed
settings
• Large-scale target predictions
• Continue evaluate Spark vs Workflows, Cloud vs HPC
– Still not reached a good agile system but we are getting
closer
• The group is open for collaborations.
Thank you
Ola Spjuth
ola.spjuth@farmbio.uu.se

Continuous modeling - automating model building on high-performance e-Infrastructures

  • 1.
    Continuous modeling -automating model building on high-performance e-Infrastructures Ola Spjuth Department of Pharmaceutical Biosciences and Science for Life Laboratory, Uppsala, Sweden
  • 2.
    Today: We haveaccess to high-throughput technologies to study biological phenomena
  • 3.
    New challenges: Datamanagement and analysis • Storage • Analysis methods, pipelines • Scaling • Automation • Data integration, security • Predictions • …
  • 4.
    My research focus •Enabling high-throughput biology, from e- infrastructures and up – Massively parallel sequencing, metabolomics – Predictive modeling in toxicology and pharmacology • Particular focus in large-scale predictive modeling – Tackle large problems – Evaluate predictive performance – Easy and secure sharing/consumption of models – Automate re-building of models
  • 5.
    Observations • Predictive toxicologyand pharmacology are becoming data- intensive – High throughput technologies • Drug/chemical screening • Molecular biology (omics) – More and bigger publicly available data sources • Data is continuously updated
  • 6.
    QSAR modeling • Signatures1descriptor in CDK2 – Canonical representation of atom environments • Support Vector Machine (SVM) – Robust modeling 1. Faulon, J.-L.; Visco, D. P.; Pophale, R. S. Journal of Chemical Information and Computer Sciences, 2003, 43, 707-720 2. Steinbeck, C.; Han, Y.; Kuhn, S.; Horlacher, O.; Luttmann, E.; Willighagen, E. Journal of Chemical Information and Computer Sciences, 2003,43, 493-500. Lars Carlsson, AstraZeneca R&D
  • 7.
    Interpretation of nonlinearQSAR models • Method – Compute gradient of decision function for prediction – Extract descriptor(s) with largest component in the gradient • Demonstrated on RF, SVM, and PLS Carlsson, L., Helgee, E. A., and Boyer, S. Interpretation of nonlinear qsar models applied to ames mutagenicity data. J Chem Inf Model 49, 11 (Nov 2009), 2551–2558. E. Ahlberg, O. Spjuth, C. Hasselgren, and L. Carlsson. Interpretation of Conformal Prediction Classification Models. In Statistical Learning and Data Sciences, vol. 9047 of Lecture Notes in Computer Science. Springer International Publishing, 2015, pp. 323–334. Lars Carlsson, AstraZeneca R&D
  • 8.
  • 9.
    Modeling large numberof observations on HPC Aim: Measure predictive performance when QSAR datasets get larger Research questions: • When do we need HPC? • How can we work efficiently with HPC in modeling? • Are nonlinear methods required?
  • 10.
    High-Performance Computing • Computationallyexpensive problems call for high- performance e-Infrastructures • High-Performance Computing (HPC) – Fast interconnect between compute nodes • High-Throughput Computing (HTC) – Fast interconnect not needed • Cloud Computing (CC) – Infrastructure as a Service (IaaS)
  • 11.
    UPPMAX high-performance computingcenter (Uppsala, Sweden) • Get access to multiple nodes – 16 compute cores per node • Get access to large memory machines – we have nodes with 128, 256, 512, or 2000 GB RAM • OpenStack private cloud • However on HPC: – Only terminal usage, no web server allowed (scripting in bash, perl and python common) – Queuing system (e.g. SLURM, SGE) – Limited job length (e.g. 10 days)
  • 12.
    Project growth 2004 20052006 2007 2008 2009 2010 2011 2012 2013 100 200 300 400 Active Projects Numberofactiveprojects ● ● UPPMAX UPPNEX
  • 13.
  • 14.
    Levels of automationin sequence analysis • Production: Can be fully automated • Secondary analysis: Partly automated • Researchers: Basic science not really useful to automate, flexibility
  • 15.
    Training large numberof datasets on HPC Aim: Build models for hundreds or thousands of targets – Challenge to automate data assembly/integration – Challenge to automate model building Hypothesis: Workflow systems can enable agile large-scale predictive modeling Data sources Samuel Lampa
  • 16.
    What is aworkflow system
  • 17.
  • 18.
    Automating analysis onclusters • Workflow systems can aid development and deployment • We extended Luigi system into SciLuigi (https://github.com/samuell/sciluigi) • Integrate with batch queuing system on HPC Train and assess model Samuel Lampa
  • 19.
    Modeling large datasetson HPC Jonathan Alvarsson
  • 20.
    Modeling large datasetson HPC Jonathan Alvarsson
  • 22.
    Publishing models • Publishmodels for easy access and consumption • We use P2 (OSGi) provisioning system v. 1.3 v. 1.2 v. 1.1 Use models
  • 23.
    Bioclipse and OpenTox E.Willighagen N. Jeliazkova, B. Hardy, R. Grafström, and O. Spjuth Computational toxicology using the OpenTox application programming interface and Bioclipse. BMC Research Notes 2011, 4:487
  • 24.
  • 25.
    Could cloud computingimprove/simplify modeling?
  • 26.
    Modeling on AmazonElastic Cloud Number of cores Time(hours) 1 2 4 8 16 5 50 100 150 200 220 20k 75k 150k 300k B. T. Moghadam, J. Alvarsson, M. Holm, M. Eklund, L. Carlsson, and O. Spjuth Scaling predictive modeling in drug development with cloud computing. J. Chem. Inf. Model., 2015, 55 (1), pp 19-25
  • 27.
    • H2020 infrastructureproject (2015-2018) • Platform for metabolomics data analysis – study metabolites in primarily clinical studies • Integrating data and tools • Data management, privacy • Cloud/Microservices architecture • Predictions http://phenomenal-h2020.eu/
  • 28.
    Could Big Dataframeworks improve/simplify modeling? • Map/Reduce, Hadoop, Spark, HDFS/distributed file systems and others… • Recently received a lot of attention • Allow for massively parallel analysis • How useful are they in pharmaceutical bioinformatics?
  • 29.
    Hadoop (MapReduce) formassively parallel analysis
  • 30.
    Evaluating Hadoop forsequence analysis • Compare Hadoop and HPC – Create as identical pipelines as possible – Investigate scaling and performance – Shows the bottlenecks with current HPC Alexey Siretskiy, former Postdoc A. Siretskiy, L. Pireddu, T. Sundqvist, and O. Spjuth. A quantitative assessment of the Hadoop framework for analyzing massively parallel DNA sequencing data. Gigascience. 2015; 4:26.
  • 31.
    Distributed modeling withSpark • Appealing programming methodology • Built-in data locality and in-memory computing – RDD (Resilient Distributed Dataset): distributed large-scale dataset abstraction – MLlib: Spark-based distributed implementation of many ML algorithms. Logistic regression in Hadoop and Spark
  • 32.
    Parallel Virtual Screeningwith Spark Hypothesis: The Spark framework can be used for trivially parallelizable problems in pharm. Bioinformatics • Demonstrate on Virtual Screening • Used OpenEye suite Prel. results: • Spark API allows for simple programmatic parallelization • Good scalability in terms of speedup • Lack of documentation L. Ahmed, A. Edlund, E. Laure, O. Spjuth. Using Iterative MapReduce for Parallel Virtual Screening. Cloud Computing Technology and Science (Cloud- Com), 2013 IEEE 5th International Conference on , vol.2, no., pp.27,32, 2-5, 2013 Laeeq Ahmed, PhD Student Valentin Georgiev, Researcher
  • 33.
    Conformal Prediction inSpark • Evaluate confidence in predictions • We implemented Inductive Conformal Prediction (ICP) in Spark, extending MLlib • Tested on 2 large data sets – HIGGS: 11M examples. Task: distinguish between Higgs boson signal process and background process – SUSY: 5M examples. Task: distinguish between supersymmetric particle signal process and background process POSTER P-33 Marco Capuccini PhD Student
  • 34.
    Results: • Valid predictions •Good scalability Conformal Prediction in Spark M. Capuccini, L. Carlsson, U. Norinder and O. Spjuth. Conformal Prediction in Spark: Large-Scale Machine Learning with Confidence. Accepted in IEEE Transaction on Cloud Computing, 2015. POSTER P-33 Marco Capuccini PhD Student
  • 35.
    Some conclusions • Automation/continuousmodeling is not trivial – Data management, modeling, model management/governance • Conformal prediction – Predictions with confidence • Large-scale problems requires computational power – Cloud computing vs High-Performance Computing • Workflows and Big Data frameworks – Immature technologies, not well documented – can be useful for large-scale analysis in pharmaceutical bioinformatics, especially for automation
  • 36.
    Some ongoing projects •Augment Parallel virtual screening with Machine Learning • Further develop conformal predictions in distributed settings • Large-scale target predictions • Continue evaluate Spark vs Workflows, Cloud vs HPC – Still not reached a good agile system but we are getting closer • The group is open for collaborations.
  • 37.

Notas do Editor

  • #5 Clusters and cloud Workflows Reactive modeling
  • #6 Keeping predictive models up to date is challenging Versioning of models not trivial
  • #9 Advantages: Fast: Run on local computers Interpretable results: Can be used for hypothesis generation General: Can integrate any modeling technique and be applied to any data set Extensible: Very easy to add new components
  • #24 European project for creating a interoperable framework for toxicity predictions Academia and industry Parts Ontology and API Query and invocation of predictive services Methods and algorithms Authentication and authorization