Disentangling the origin of chemical differences using GHOST
Agile large-scale machine-learning pipelines in drug discovery
1. Agile large-scale machine-learning
pipelines in drug discovery
Ola Spjuth
Department of Pharmaceutical Biosciences and Science for Life Laboratory
Uppsala University, Sweden
ola.spjuth@farmbio.uu.se
2. Outline
• My research in perspective
• Our approach to machine learning in ligand-based
modeling
• Challenges when data grows
• Automation workflows/pipelines
• HPC, Cloud Computing and Big Data Analytics
3. From data to insights
• We have access to a wealth
of information
• Data mining and predictive
modeling can be useful
4. History: Bioclipse – an open source
workbench for the life sciences
O. Spjuth, J. Alvarsson, A. Berg, M. Eklund, S. Kuhn, C. Mäsak, G. Torrance, J. Wagener, E.L. Willighagen, C. Steinbeck, and J.E.S. Wikberg.
Bioclipse 2: A scriptable integration platform for the life sciences. BMC Bioinformatics 2009, 10:397
Spjuth O, Helmus T, Willighagen EL, Kuhn S, Eklund M, Wagener J, Murray-Rust P, Steinbeck C, Wikberg JES: Bioclipse: an open source
workbench for chemo- and bioinformatics. BMC Bioinformatics 2007, 8:59.
5. How is the compound
metabolized?
Are any of its metabolites
reactive/toxic?
Here?
Here?
Is it toxic?
Chemical liabilities (drug safety, alerts)
Adverse effects?
Can we, based on existing experimental studies, IT,
and statistical models, predict the outcome for new
compounds?
6. Starting out in 2008 with a challenge:
• Build a system with predictive models which runs on
the client
– Initial problem: Site-of-metabolism prediction
7. Site-of-metabolism (SOM) predictions – MetaPrint2D
L. Carlsson, O. Spjuth, S. Adams, R. C. Glen, and S. Boyer. Use of historic metabolic biotransformation data as a means
of anticipating metabolic sites using MetaPrint2D and Bioclipse. BMC Bioinformatics 2010, 11:362
Boyer S, Arnby CH, Carlsson L, Smith J, Stein V, Glen RC. Reaction site mapping of xenobiotic biotransformations. J
Chem Inf Model. 2007 Mar-Apr;47(2):583-90.
Reaction
Database
MetaPrint2D
database
Circular
Fingerprints
Highest probability
of metabolism
Low probability of
metabolism
Medium probability
of metabolism
Mapping
9. Next challenge: Extend to general predictive models
• Fast predictive models, allow for instant updates
upon structural changes
• Span from virtual screening to lead optimization
10. Bioclipse Decision Support
• Integrate various predictive methods
– Similarity searches (InChI, signatures, fingerprints)
– Structural alerts (toxicophores)
– QSAR models (classification, regression)
• Visual interpretation
– Highlight important substructures
O. Spjuth, L. Carlsson, M. Eklund, E. Ahlberg Helgee, and Scott Boyer. Integrated decision support
for assessing chemical liabilities. Accepted in J. Chem. Inf. Model, 2011.
11. Ligand-based predictive modeling
Quantitative Structure-Activity
Relationship (QSAR)
– Start with a dataset of
chemical structures with
measured property to model
(inhibition, toxicity, etc)
– Describe chemicals using
descriptors
– Make use of statistical
modeling to relate chemical
structures to a response
13. QSAR modeling
• Signatures1 descriptor in CDK2
– Canonical representation of atom
environments
• Support Vector Machine (SVM)
– Robust modeling
1. Faulon, J.-L.; Visco, D. P.; Pophale, R. S. Journal of Chemical Information and
Computer Sciences, 2003, 43, 707-720
2. Steinbeck, C.; Han, Y.; Kuhn, S.; Horlacher, O.; Luttmann, E.; Willighagen, E.
Journal of Chemical Information and Computer Sciences, 2003,43, 493-500.
14. Local interpretation of nonlinear QSAR models
• Method
– Compute gradient of
decision function for
prediction
– Extract descriptor(s) with
largest component in the
gradient
• Demonstrated on RF, SVM,
and PLS
Carlsson, L., Helgee, E. A., and Boyer, S.
Interpretation of nonlinear qsar models applied to ames mutagenicity data.
J Chem Inf Model 49, 11 (Nov 2009), 2551–2558.
Lars Carlsson,
AstraZeneca R&D
16. Next challenge: Simple model building
• Build a solution where:
– Scientists can build accurate models without modeling
expertise, in order to aid their decision making
– Combine these models with other models
18. Next challenge: Predict using distributed services
• OpenTox - European project for creating a
interoperable framework for toxicity predictions
• Academia and industry
• Parts
– Ontology and API
– Query and invocation of predictive services
– Methods and algorithms
– Authentication and authorization
22. Summary of Bioclipse Decision Support
• Flexible, general method
– Apply to any collection of molecules
• State-of-the-art machine-learning methods
• Handles large data sets
• Fast predictions
23. Advantages with the DS method
• Fast: Can run on local computer
– “Instant predictions”, “calculate as you draw”
• Interpretable results: Can be used for
hypothesis generation
• General: Apply any modeling technique to any
data set
• Extensible: Very easy to add new components
• Open: Free, open source
24. Observations
• Predictive drug discovery is becoming
data-intensive
– High throughput technologies
• Drug/chemical screening
• Molecular biology (omics)
– More and bigger publicly available data
sources
• Data is continuously updated
We need scalable and automated
methods for predictive modeling
25. Challenges with bigger data sets for machine learning
• Modeling time increases
– Reduce/avoid parameter tuning
– Run on high-performance e-infrastructures
– Use approximate methods
• Not all implementations can handle dataset sizes
– Use sparse implementations
26. Determine parameter intervals
for modeling (sweetspot)
J. Alvarsson, M. Eklund, C. Andersson, L. Carlsson, O. Spjuth, and Jarl Wikberg.
Benchmarking study of parameter variation when using signature fingerprints together
with support vector machines. J Chem Inf Model. 2014, 54(11), pp 3211–3217.
SVM: Cost and Gamma parameters
Signatures: Heights
28. Example 2: Target predictions
Alvarsson J, Eklund M, Engkvist O, Spjuth O, Carlsson L,
Wikberg JE, Noeske T. Ligand-based target prediction
with signature fingerprints.
J Chem Inf Model. 2014 Oct 27;54(10):2647-53
29. Challenge with running on HPC
• Reduce manual work
– Automate data preprocessing and modeling
– Support modeling life cycle (build, validate, document,
version, publish, re-train …)
• Automating model building is not trivial
– Aim: Agile, component-based architecture
30. Example application: Training large
number of datasets
Aim: Build models for hundreds
of targets
– Challenge to extract
– Challenge to automate model
building
Data sources
Samuel Lampa
31. Automating analysis on HPC clusters
• Workflow systems can aid
development and deployment
• We used Luigi system
• Integrate with queuing system
(SLURM)
Train and
assess model
Samuel Lampa
https://github.com/spotify/luigi
35. Model building WFs on HPC is not trivial
• Many workflow systems exist
– DSLs vs APIs
– Dynamic input/output in e.g. cross-validation not
supported out of the box
• Time-consuming to create WFs
• Workflows can be useful but is not (yet) the silver
bullet we sought
O. Spjuth, E. Bongcam-Rudloff, G. C. Hernandez, L. Forer, M. Giovacchini, R. V. Guimera, A. Kallio, E. Korpelainen,
M. Kandula, M. Krachunov, D. P. Kreil, O. Kulev, P. P. Labaj, S. Lampa, L. Pireddu, S. Schönherr, A. Siretskiy, and D.
Vassilev. Experiences with workflows for automating data- intensive bioinformatics. Accepted in Biology Direct.
37. QSAR Modeling on Amazon Elastic Cloud
Number of cores
Time(hours)
1 2 4 8 16
5
50
100
150
200
220
20k
75k
150k
300k
B. Torabi, J. Alvarsson, M. Holm, M. Eklund, L. Carlsson,
and O. Spjuth. “Scaling predictive modeling in drug
development with cloud computing”.
J. Chem. Inf. Model., 2015, 55 (1), pp 19-25
38. Private clouds
• We set up an OpenStack system at UPPMAX (our HPC
center)
• Primarily Infrastructure as a Service (IaaS) – users can
run virtual machines
• Platform-as-a-Service (PaaS): Hadoop and Spark
– Our question: Can this be useful for model building?
39. • Open catalogue of VMIs
• Hosted at Uppsala University
M. Dahlö, F. Haziza, A. Kallio,
E. Korpelainen, E. Bongcam-
Rudloff, and O. Spjuth.
BioImg.org: A catalogue of
virtual machine images for
the life sciences. Accepted in
Bioinformatics and Biology
Insights.
www.bioimg.org
Managing Virtual Machine Images
40. Cloud computing enables Big Data Analytics
• Hadoop
– Open Source Map-Reduce, suited for massively parallel
tasks
– Distributed execution, high availability, fault tolerant, can
be run on commodity hardware
– E.g. Google, Facebook and Twitter use it
• Hadoop File System (HDFS) distributes data on
nodes, computing done in parallel
– “bring computations to data”
42. Evaluating Hadoop for next-generation sequencing
• Compare Hadoop and HPC
– Create as identical pipelines as possible
– Calculate efficiency as function of data size
– Conclusion: Hadoop pipeline scales better
than HPC and is economical for current data
sizes
Alexey Siretskiy, former
postdoc at UPPMAX
A. Siretskiy, L. Pireddu, T. Sundqvist, and O. Spjuth.
A quantitative assessment of the Hadoop framework for analyzing
massively parallel DNA sequencing data. Gigascience (2015) Jun 4; 4:26.
A. Siretskiy and O. Spjuth. HTSeq-Hadoop: Extending HTSeq for Massively
Parallel Sequencing Data Analysis using Hadoop. In e-Science, 2014 IEEE
10th International Conference on (2014), vol. 1, pp. 317–323.
43. SPARK
• Add caching to Hadoop
(MapReduce) – in memory
computing
• Good for iterative
algorithms
• We applied it for ligand-
based virtual screening
With Åke Edlund,
HPCViz, KTH
L. Ahmed, A. Edlund, E. Laure, O. Spjuth. Using Iterative
MapReduce for Parallel Virtual Screening. Cloud Computing
Technology and Science (Cloud- Com), 2013 IEEE 5th
International Conference on , vol.2, no., pp.27,32, 2-5, 2013
44. Large-scale machine learning on Spark
• Ongoing project: Create a large-scale machine
learning pipeline for QSAR using Spark ML as
alternative to Luigi workflow system
– Apply to large data sets
– Apply to many data sets
– Compare Spark with workflows on Batch system
– Aim: Use for Reactive Modeling
45. Some conclusions so far on cloud computing and
Hadoop/Spark for bioinformatics
• Cloud computing
– Easy provisioning of infrastructures, services and platforms
• Hadoop
– Scalable and efficient – but to the price of software incompatibility
• Spark
– improves over Hadoop with in-memory computing and more intuitive
interface
• Current working hypothesis: Spark more advantageous
compared to workflows on batch systems for machine
learning pipelines
46. Conformal prediction
Seek answer to: “How good is your prediction?”
• Traditional machine learning algorithms:
– Simple predictions (e.g. “Class A”, 8.45)
• Conformal predictions
– Prediction intervals for a given confidence level
– based on a consistent and well-defined mathematical
framework1
1 Vovk, V.; Gammerman, A.; Shafer, G. “Algorithmic learning in a random world”; Springer: New York, 2005.
47. Conformal predictions
Norinder, U., Carlsson, L., Boyer, S., and Eklund, M. Introducing conformal prediction in predictive modeling. a
transparent and flexible alternative to applicability domain determination. J Chem Inf Model 54, 6 (Jun 2014),
1596–603.
48. Some projects on Conformal Predictions
• CP Feature Highlighting
• CP in Spark
• Large-scale model building in
cheminformatics and virtual
screening
– Ongoing projects
Ahlberg E, Spjuth O, Hasselgren C, Carlsson L. Interpretation of Conformal
Prediction Classification Models. Statistical Learning and Data Sciences. Springer
International Publishing; 2015. pp. 323–334.
Capuccini M, Carlsson L, Norinder U., and Spjuth O. Conformal Prediction in
Spark: Large-Scale Machine Learning with Confidence. Submitted.
52. e-Science (cyberinfrastructure, “big data”)
“Systematic and advanced use of computers in
research”
– High-performance computing
– Distributed data, “Big data”
– Enabling science!
www.e-science.se www.essenceofescience.se
53. Acknowledgements
Workflows
Samuel Lampa
David Kreil
Maciej Kańduła
BioImg.org
Martin Dahlö
Frédéric Haziza
Mentell Design
Hadoop & Spark
Alexey Siretskiy
Åke Edlund
Izhar ul Hassan
Marco Cappucini
Staffan Arvidsson
Cloud computing
Frédéric Haziza
Tore Sundqvist
Behrooz Torabi
Salman Toor
Andreas Hellander
Predictive modeling
Lars Carlsson
Ernst Ahlberg-Helgee
Martin Eklund
Ulf Norinder
Wesley Schaal
Jonathan Alvarsson
Bioclipse
Arvid Berg
Egon Willighagen
All Bioclipse and CDK
contributors
Predicting SOM in silico
Cheap
Reasonably effective
Fast, can be used in earlier steps than optimization
Results on par with other tools
Used a lot at e.g. AstraZeneca
Wet lab experiments are slow, expensive and not exact
Mutagenicity: ability of a substance to induce mutations to DNA
Carcinogenic Potency Database (CPDB)
aryl hydrocarbon receptor (AHR), transcription factor involved in metabolizing enzymes, important target because of a promiscuous ligand binding site
Advantages:
Fast: Run on local computers
Interpretable results: Can be used for hypothesis generation
General: Can integrate any modeling technique and be applied to any data set
Extensible: Very easy to add new components
European project for creating a interoperable framework for toxicity predictions
Academia and industry
Parts
Ontology and API
Query and invocation of predictive services
Methods and algorithms
Authentication and authorization
Rich user interface fro OpenTox!
Screenshot with OpenTox predictions run in DSView…
Safety profiles, rich clients allows for rich gui, customizable
Keeping predictive models up to date is challenging
Versioning of models not trivial