SlideShare uma empresa Scribd logo
1 de 54
Agile large-scale machine-learning
pipelines in drug discovery
Ola Spjuth
Department of Pharmaceutical Biosciences and Science for Life Laboratory
Uppsala University, Sweden
ola.spjuth@farmbio.uu.se
Outline
• My research in perspective
• Our approach to machine learning in ligand-based
modeling
• Challenges when data grows
• Automation workflows/pipelines
• HPC, Cloud Computing and Big Data Analytics
From data to insights
• We have access to a wealth
of information
• Data mining and predictive
modeling can be useful
History: Bioclipse – an open source
workbench for the life sciences
O. Spjuth, J. Alvarsson, A. Berg, M. Eklund, S. Kuhn, C. Mäsak, G. Torrance, J. Wagener, E.L. Willighagen, C. Steinbeck, and J.E.S. Wikberg.
Bioclipse 2: A scriptable integration platform for the life sciences. BMC Bioinformatics 2009, 10:397
Spjuth O, Helmus T, Willighagen EL, Kuhn S, Eklund M, Wagener J, Murray-Rust P, Steinbeck C, Wikberg JES: Bioclipse: an open source
workbench for chemo- and bioinformatics. BMC Bioinformatics 2007, 8:59.
How is the compound
metabolized?
Are any of its metabolites
reactive/toxic?
Here?
Here?
Is it toxic?
Chemical liabilities (drug safety, alerts)
Adverse effects?
Can we, based on existing experimental studies, IT,
and statistical models, predict the outcome for new
compounds?
Starting out in 2008 with a challenge:
• Build a system with predictive models which runs on
the client
– Initial problem: Site-of-metabolism prediction
Site-of-metabolism (SOM) predictions – MetaPrint2D
L. Carlsson, O. Spjuth, S. Adams, R. C. Glen, and S. Boyer. Use of historic metabolic biotransformation data as a means
of anticipating metabolic sites using MetaPrint2D and Bioclipse. BMC Bioinformatics 2010, 11:362
Boyer S, Arnby CH, Carlsson L, Smith J, Stein V, Glen RC. Reaction site mapping of xenobiotic biotransformations. J
Chem Inf Model. 2007 Mar-Apr;47(2):583-90.
Reaction
Database
MetaPrint2D
database
Circular
Fingerprints
Highest probability
of metabolism
Low probability of
metabolism
Medium probability
of metabolism
Mapping
Bioclipse and MetaPrint2D
Next challenge: Extend to general predictive models
• Fast predictive models, allow for instant updates
upon structural changes
• Span from virtual screening to lead optimization
Bioclipse Decision Support
• Integrate various predictive methods
– Similarity searches (InChI, signatures, fingerprints)
– Structural alerts (toxicophores)
– QSAR models (classification, regression)
• Visual interpretation
– Highlight important substructures
O. Spjuth, L. Carlsson, M. Eklund, E. Ahlberg Helgee, and Scott Boyer. Integrated decision support
for assessing chemical liabilities. Accepted in J. Chem. Inf. Model, 2011.
Ligand-based predictive modeling
Quantitative Structure-Activity
Relationship (QSAR)
– Start with a dataset of
chemical structures with
measured property to model
(inhibition, toxicity, etc)
– Describe chemicals using
descriptors
– Make use of statistical
modeling to relate chemical
structures to a response
Machine learning pipelines
Preprocessing
Model building
Validation
Reporting
QSAR modeling
• Signatures1 descriptor in CDK2
– Canonical representation of atom
environments
• Support Vector Machine (SVM)
– Robust modeling
1. Faulon, J.-L.; Visco, D. P.; Pophale, R. S. Journal of Chemical Information and
Computer Sciences, 2003, 43, 707-720
2. Steinbeck, C.; Han, Y.; Kuhn, S.; Horlacher, O.; Luttmann, E.; Willighagen, E.
Journal of Chemical Information and Computer Sciences, 2003,43, 493-500.
Local interpretation of nonlinear QSAR models
• Method
– Compute gradient of
decision function for
prediction
– Extract descriptor(s) with
largest component in the
gradient
• Demonstrated on RF, SVM,
and PLS
Carlsson, L., Helgee, E. A., and Boyer, S.
Interpretation of nonlinear qsar models applied to ames mutagenicity data.
J Chem Inf Model 49, 11 (Nov 2009), 2551–2558.
Lars Carlsson,
AstraZeneca R&D
Bioclipse Decision Support
Next challenge: Simple model building
• Build a solution where:
– Scientists can build accurate models without modeling
expertise, in order to aid their decision making
– Combine these models with other models
Simple model building with graphical wizards
Next challenge: Predict using distributed services
• OpenTox - European project for creating a
interoperable framework for toxicity predictions
• Academia and industry
• Parts
– Ontology and API
– Query and invocation of predictive services
– Methods and algorithms
– Authentication and authorization
Bioclipse Decision Support
Model
discovery
predictions
Bioclipse and OpenTox
Collaboration with
OpenTox in Bioclipse
Summary of Bioclipse Decision Support
• Flexible, general method
– Apply to any collection of molecules
• State-of-the-art machine-learning methods
• Handles large data sets
• Fast predictions
Advantages with the DS method
• Fast: Can run on local computer
– “Instant predictions”, “calculate as you draw”
• Interpretable results: Can be used for
hypothesis generation
• General: Apply any modeling technique to any
data set
• Extensible: Very easy to add new components
• Open: Free, open source
Observations
• Predictive drug discovery is becoming
data-intensive
– High throughput technologies
• Drug/chemical screening
• Molecular biology (omics)
– More and bigger publicly available data
sources
• Data is continuously updated
 We need scalable and automated
methods for predictive modeling
Challenges with bigger data sets for machine learning
• Modeling time increases
– Reduce/avoid parameter tuning
– Run on high-performance e-infrastructures
– Use approximate methods
• Not all implementations can handle dataset sizes
– Use sparse implementations
Determine parameter intervals
for modeling (sweetspot)
J. Alvarsson, M. Eklund, C. Andersson, L. Carlsson, O. Spjuth, and Jarl Wikberg.
Benchmarking study of parameter variation when using signature fingerprints together
with support vector machines. J Chem Inf Model. 2014, 54(11), pp 3211–3217.
SVM: Cost and Gamma parameters
Signatures: Heights
Example 1: Modeling large number of observations
Jonathan Alvarsson
Example 2: Target predictions
Alvarsson J, Eklund M, Engkvist O, Spjuth O, Carlsson L,
Wikberg JE, Noeske T. Ligand-based target prediction
with signature fingerprints.
J Chem Inf Model. 2014 Oct 27;54(10):2647-53
Challenge with running on HPC
• Reduce manual work
– Automate data preprocessing and modeling
– Support modeling life cycle (build, validate, document,
version, publish, re-train …)
• Automating model building is not trivial
– Aim: Agile, component-based architecture
Example application: Training large
number of datasets
Aim: Build models for hundreds
of targets
– Challenge to extract
– Challenge to automate model
building
Data sources
Samuel Lampa
Automating analysis on HPC clusters
• Workflow systems can aid
development and deployment
• We used Luigi system
• Integrate with queuing system
(SLURM)
Train and
assess model
Samuel Lampa
https://github.com/spotify/luigi
Example ML pipeline
(unpublished data)
Publishing models
• Publish models for easy access
and consumption
• We used P2 (OSGi) provisioning
system
v. 1.3
v. 1.2
v. 1.1
Use models
Reactive/continuous modeling
Data sources
Coordinate
Integrate
Version
Monitor
Publish
models
Archive
models
Train and
assess model
User
Bioclipse
Model building WFs on HPC is not trivial
• Many workflow systems exist
– DSLs vs APIs
– Dynamic input/output in e.g. cross-validation not
supported out of the box
• Time-consuming to create WFs
• Workflows can be useful but is not (yet) the silver
bullet we sought
O. Spjuth, E. Bongcam-Rudloff, G. C. Hernandez, L. Forer, M. Giovacchini, R. V. Guimera, A. Kallio, E. Korpelainen,
M. Kandula, M. Krachunov, D. P. Kreil, O. Kulev, P. P. Labaj, S. Lampa, L. Pireddu, S. Schönherr, A. Siretskiy, and D.
Vassilev. Experiences with workflows for automating data- intensive bioinformatics. Accepted in Biology Direct.
Could cloud computing improve things?
QSAR Modeling on Amazon Elastic Cloud
Number of cores
Time(hours)
1 2 4 8 16
5
50
100
150
200
220
20k
75k
150k
300k
B. Torabi, J. Alvarsson, M. Holm, M. Eklund, L. Carlsson,
and O. Spjuth. “Scaling predictive modeling in drug
development with cloud computing”.
J. Chem. Inf. Model., 2015, 55 (1), pp 19-25
Private clouds
• We set up an OpenStack system at UPPMAX (our HPC
center)
• Primarily Infrastructure as a Service (IaaS) – users can
run virtual machines
• Platform-as-a-Service (PaaS): Hadoop and Spark
– Our question: Can this be useful for model building?
• Open catalogue of VMIs
• Hosted at Uppsala University
M. Dahlö, F. Haziza, A. Kallio,
E. Korpelainen, E. Bongcam-
Rudloff, and O. Spjuth.
BioImg.org: A catalogue of
virtual machine images for
the life sciences. Accepted in
Bioinformatics and Biology
Insights.
www.bioimg.org
Managing Virtual Machine Images
Cloud computing enables Big Data Analytics
• Hadoop
– Open Source Map-Reduce, suited for massively parallel
tasks
– Distributed execution, high availability, fault tolerant, can
be run on commodity hardware
– E.g. Google, Facebook and Twitter use it
• Hadoop File System (HDFS) distributes data on
nodes, computing done in parallel
– “bring computations to data”
Hadoop (MapReduce) for massively parallel analysis
Evaluating Hadoop for next-generation sequencing
• Compare Hadoop and HPC
– Create as identical pipelines as possible
– Calculate efficiency as function of data size
– Conclusion: Hadoop pipeline scales better
than HPC and is economical for current data
sizes
Alexey Siretskiy, former
postdoc at UPPMAX
A. Siretskiy, L. Pireddu, T. Sundqvist, and O. Spjuth.
A quantitative assessment of the Hadoop framework for analyzing
massively parallel DNA sequencing data. Gigascience (2015) Jun 4; 4:26.
A. Siretskiy and O. Spjuth. HTSeq-Hadoop: Extending HTSeq for Massively
Parallel Sequencing Data Analysis using Hadoop. In e-Science, 2014 IEEE
10th International Conference on (2014), vol. 1, pp. 317–323.
SPARK
• Add caching to Hadoop
(MapReduce) – in memory
computing
• Good for iterative
algorithms
• We applied it for ligand-
based virtual screening
With Åke Edlund,
HPCViz, KTH
L. Ahmed, A. Edlund, E. Laure, O. Spjuth. Using Iterative
MapReduce for Parallel Virtual Screening. Cloud Computing
Technology and Science (Cloud- Com), 2013 IEEE 5th
International Conference on , vol.2, no., pp.27,32, 2-5, 2013
Large-scale machine learning on Spark
• Ongoing project: Create a large-scale machine
learning pipeline for QSAR using Spark ML as
alternative to Luigi workflow system
– Apply to large data sets
– Apply to many data sets
– Compare Spark with workflows on Batch system
– Aim: Use for Reactive Modeling
Some conclusions so far on cloud computing and
Hadoop/Spark for bioinformatics
• Cloud computing
– Easy provisioning of infrastructures, services and platforms
• Hadoop
– Scalable and efficient – but to the price of software incompatibility
• Spark
– improves over Hadoop with in-memory computing and more intuitive
interface
• Current working hypothesis: Spark more advantageous
compared to workflows on batch systems for machine
learning pipelines
Conformal prediction
Seek answer to: “How good is your prediction?”
• Traditional machine learning algorithms:
– Simple predictions (e.g. “Class A”, 8.45)
• Conformal predictions
– Prediction intervals for a given confidence level
– based on a consistent and well-defined mathematical
framework1
1 Vovk, V.; Gammerman, A.; Shafer, G. “Algorithmic learning in a random world”; Springer: New York, 2005.
Conformal predictions
Norinder, U., Carlsson, L., Boyer, S., and Eklund, M. Introducing conformal prediction in predictive modeling. a
transparent and flexible alternative to applicability domain determination. J Chem Inf Model 54, 6 (Jun 2014),
1596–603.
Some projects on Conformal Predictions
• CP Feature Highlighting
• CP in Spark
• Large-scale model building in
cheminformatics and virtual
screening
– Ongoing projects
Ahlberg E, Spjuth O, Hasselgren C, Carlsson L. Interpretation of Conformal
Prediction Classification Models. Statistical Learning and Data Sciences. Springer
International Publishing; 2015. pp. 323–334.
Capuccini M, Carlsson L, Norinder U., and Spjuth O. Conformal Prediction in
Spark: Large-Scale Machine Learning with Confidence. Submitted.
Two pilots for clinical data management
CML, Lucia Cavelier
MDR, Åsa Melhus
e-Science (cyberinfrastructure, “big data”)
“Systematic and advanced use of computers in
research”
– High-performance computing
– Distributed data, “Big data”
– Enabling science!
www.e-science.se www.essenceofescience.se
Acknowledgements
Workflows
Samuel Lampa
David Kreil
Maciej Kańduła
BioImg.org
Martin Dahlö
Frédéric Haziza
Mentell Design
Hadoop & Spark
Alexey Siretskiy
Åke Edlund
Izhar ul Hassan
Marco Cappucini
Staffan Arvidsson
Cloud computing
Frédéric Haziza
Tore Sundqvist
Behrooz Torabi
Salman Toor
Andreas Hellander
Predictive modeling
Lars Carlsson
Ernst Ahlberg-Helgee
Martin Eklund
Ulf Norinder
Wesley Schaal
Jonathan Alvarsson
Bioclipse
Arvid Berg
Egon Willighagen
All Bioclipse and CDK
contributors
Thank you
ola.spjuth@farmbio.uu.se

Mais conteúdo relacionado

Mais procurados

Assessing Galaxy's ability to express scientific workflows in bioinformatics
Assessing Galaxy's ability to express scientific workflows in bioinformaticsAssessing Galaxy's ability to express scientific workflows in bioinformatics
Assessing Galaxy's ability to express scientific workflows in bioinformaticsPeter van Heusden
 
Databases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyDatabases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyYannick Pouliot
 
Big Data & ML for Clinical Data
Big Data & ML for Clinical DataBig Data & ML for Clinical Data
Big Data & ML for Clinical DataPaul Agapow
 
eNanoMapper - A Database and Ontology Framework for Nanomaterials Design and ...
eNanoMapper - A Database and Ontology Framework for Nanomaterials Design and ...eNanoMapper - A Database and Ontology Framework for Nanomaterials Design and ...
eNanoMapper - A Database and Ontology Framework for Nanomaterials Design and ...Barry Hardy
 
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...Carole Goble
 
2019 03-11 bio it-world west genepattern notebook slides
2019 03-11 bio it-world west genepattern notebook slides2019 03-11 bio it-world west genepattern notebook slides
2019 03-11 bio it-world west genepattern notebook slidesMichael Reich
 
Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017
Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017
Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017Deborah McGuinness
 
Reproducible research: First steps.
Reproducible research: First steps. Reproducible research: First steps.
Reproducible research: First steps. Richard Layton
 
Usage-Based vs. Citation-Based Recommenders in a Digital Library
Usage-Based vs. Citation-Based Recommenders in a Digital LibraryUsage-Based vs. Citation-Based Recommenders in a Digital Library
Usage-Based vs. Citation-Based Recommenders in a Digital LibraryAndre Vellino
 
Reproducibility and Scientific Research: why, what, where, when, who, how
Reproducibility and Scientific Research: why, what, where, when, who, how Reproducibility and Scientific Research: why, what, where, when, who, how
Reproducibility and Scientific Research: why, what, where, when, who, how Carole Goble
 
resume v 5.0
resume v 5.0resume v 5.0
resume v 5.0Ye Xu
 
Ilya Kupershmidt speaks at the Molecular Medicine Tri-Conference
Ilya Kupershmidt speaks at the Molecular Medicine Tri-ConferenceIlya Kupershmidt speaks at the Molecular Medicine Tri-Conference
Ilya Kupershmidt speaks at the Molecular Medicine Tri-ConferenceNextBio
 

Mais procurados (13)

Assessing Galaxy's ability to express scientific workflows in bioinformatics
Assessing Galaxy's ability to express scientific workflows in bioinformaticsAssessing Galaxy's ability to express scientific workflows in bioinformatics
Assessing Galaxy's ability to express scientific workflows in bioinformatics
 
Databases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyDatabases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems Immunology
 
Big Data & ML for Clinical Data
Big Data & ML for Clinical DataBig Data & ML for Clinical Data
Big Data & ML for Clinical Data
 
eNanoMapper - A Database and Ontology Framework for Nanomaterials Design and ...
eNanoMapper - A Database and Ontology Framework for Nanomaterials Design and ...eNanoMapper - A Database and Ontology Framework for Nanomaterials Design and ...
eNanoMapper - A Database and Ontology Framework for Nanomaterials Design and ...
 
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
 
2019 03-11 bio it-world west genepattern notebook slides
2019 03-11 bio it-world west genepattern notebook slides2019 03-11 bio it-world west genepattern notebook slides
2019 03-11 bio it-world west genepattern notebook slides
 
Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017
Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017
Ontologies For the Modern Age - McGuinness' Keynote at ISWC 2017
 
Reproducible research: First steps.
Reproducible research: First steps. Reproducible research: First steps.
Reproducible research: First steps.
 
Usage-Based vs. Citation-Based Recommenders in a Digital Library
Usage-Based vs. Citation-Based Recommenders in a Digital LibraryUsage-Based vs. Citation-Based Recommenders in a Digital Library
Usage-Based vs. Citation-Based Recommenders in a Digital Library
 
Reproducibility and Scientific Research: why, what, where, when, who, how
Reproducibility and Scientific Research: why, what, where, when, who, how Reproducibility and Scientific Research: why, what, where, when, who, how
Reproducibility and Scientific Research: why, what, where, when, who, how
 
Resume 2016 detailed
Resume 2016 detailedResume 2016 detailed
Resume 2016 detailed
 
resume v 5.0
resume v 5.0resume v 5.0
resume v 5.0
 
Ilya Kupershmidt speaks at the Molecular Medicine Tri-Conference
Ilya Kupershmidt speaks at the Molecular Medicine Tri-ConferenceIlya Kupershmidt speaks at the Molecular Medicine Tri-Conference
Ilya Kupershmidt speaks at the Molecular Medicine Tri-Conference
 

Destaque

Catalysing Innovation in Pharma IT: Keeping AstraZeneca Ahead of Disruptive T...
Catalysing Innovation in Pharma IT: Keeping AstraZeneca Ahead of Disruptive T...Catalysing Innovation in Pharma IT: Keeping AstraZeneca Ahead of Disruptive T...
Catalysing Innovation in Pharma IT: Keeping AstraZeneca Ahead of Disruptive T...Nick Brown
 
Artificial Intelligence, Predictive Modelling and Chatbots: Applications in P...
Artificial Intelligence, Predictive Modelling and Chatbots: Applications in P...Artificial Intelligence, Predictive Modelling and Chatbots: Applications in P...
Artificial Intelligence, Predictive Modelling and Chatbots: Applications in P...Nick Brown
 
Continuous modeling - automating model building on high-performance e-Infrast...
Continuous modeling - automating model building on high-performance e-Infrast...Continuous modeling - automating model building on high-performance e-Infrast...
Continuous modeling - automating model building on high-performance e-Infrast...Ola Spjuth
 
Python Generators - Talk at PySthlm meetup #15
Python Generators - Talk at PySthlm meetup #15Python Generators - Talk at PySthlm meetup #15
Python Generators - Talk at PySthlm meetup #15Samuel Lampa
 
Improved Predictions in Structure-Based Drug Design Using CART and Bayesian M...
Improved Predictions in Structure-Based Drug Design Using CART and Bayesian M...Improved Predictions in Structure-Based Drug Design Using CART and Bayesian M...
Improved Predictions in Structure-Based Drug Design Using CART and Bayesian M...Salford Systems
 
Data drivenapproach to medicinalchemistry
Data drivenapproach to medicinalchemistryData drivenapproach to medicinalchemistry
Data drivenapproach to medicinalchemistryAnn-Marie Roche
 
2nd Proj. Update: Integrating SWI-Prolog for Semantic Reasoning in Bioclipse
2nd Proj. Update: Integrating SWI-Prolog for Semantic Reasoning in Bioclipse2nd Proj. Update: Integrating SWI-Prolog for Semantic Reasoning in Bioclipse
2nd Proj. Update: Integrating SWI-Prolog for Semantic Reasoning in BioclipseSamuel Lampa
 
Hooking up Semantic MediaWiki with external tools via SPARQL
Hooking up Semantic MediaWiki with external tools via SPARQLHooking up Semantic MediaWiki with external tools via SPARQL
Hooking up Semantic MediaWiki with external tools via SPARQLSamuel Lampa
 
Development and sharing of ADME/Tox and Drug Discovery Machine learning models
Development and sharing of ADME/Tox and Drug Discovery Machine learning modelsDevelopment and sharing of ADME/Tox and Drug Discovery Machine learning models
Development and sharing of ADME/Tox and Drug Discovery Machine learning modelsSean Ekins
 
Chemical Spaces: Modeling, Exploration & Understanding
Chemical Spaces: Modeling, Exploration & UnderstandingChemical Spaces: Modeling, Exploration & Understanding
Chemical Spaces: Modeling, Exploration & UnderstandingRajarshi Guha
 
Exploiting bigger data and collaborative tools for predictive drug discovery
Exploiting bigger data and collaborative tools for predictive drug discovery Exploiting bigger data and collaborative tools for predictive drug discovery
Exploiting bigger data and collaborative tools for predictive drug discovery Sean Ekins
 
SciPipe - A light-weight workflow library inspired by flow-based programming
SciPipe - A light-weight workflow library inspired by flow-based programmingSciPipe - A light-weight workflow library inspired by flow-based programming
SciPipe - A light-weight workflow library inspired by flow-based programmingSamuel Lampa
 
iRODS Rule Language Cheat Sheet
iRODS Rule Language Cheat SheetiRODS Rule Language Cheat Sheet
iRODS Rule Language Cheat SheetSamuel Lampa
 
Flow based programming an overview
Flow based programming   an overviewFlow based programming   an overview
Flow based programming an overviewSamuel Lampa
 
Cheminformatics in drug design
Cheminformatics in drug designCheminformatics in drug design
Cheminformatics in drug designSurmil Shah
 
Chemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & ReproducibleChemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & ReproducibleRajarshi Guha
 
AddisDev Meetup ii: Golang and Flow-based Programming
AddisDev Meetup ii: Golang and Flow-based ProgrammingAddisDev Meetup ii: Golang and Flow-based Programming
AddisDev Meetup ii: Golang and Flow-based ProgrammingSamuel Lampa
 
Reproducibility in Scientific Data Analysis - BioScience Seminar
Reproducibility in Scientific Data Analysis - BioScience SeminarReproducibility in Scientific Data Analysis - BioScience Seminar
Reproducibility in Scientific Data Analysis - BioScience SeminarSamuel Lampa
 
Introduction to the drug discovery process
Introduction to the drug discovery processIntroduction to the drug discovery process
Introduction to the drug discovery processThanh Truong
 

Destaque (20)

Catalysing Innovation in Pharma IT: Keeping AstraZeneca Ahead of Disruptive T...
Catalysing Innovation in Pharma IT: Keeping AstraZeneca Ahead of Disruptive T...Catalysing Innovation in Pharma IT: Keeping AstraZeneca Ahead of Disruptive T...
Catalysing Innovation in Pharma IT: Keeping AstraZeneca Ahead of Disruptive T...
 
Artificial Intelligence, Predictive Modelling and Chatbots: Applications in P...
Artificial Intelligence, Predictive Modelling and Chatbots: Applications in P...Artificial Intelligence, Predictive Modelling and Chatbots: Applications in P...
Artificial Intelligence, Predictive Modelling and Chatbots: Applications in P...
 
Continuous modeling - automating model building on high-performance e-Infrast...
Continuous modeling - automating model building on high-performance e-Infrast...Continuous modeling - automating model building on high-performance e-Infrast...
Continuous modeling - automating model building on high-performance e-Infrast...
 
Python Generators - Talk at PySthlm meetup #15
Python Generators - Talk at PySthlm meetup #15Python Generators - Talk at PySthlm meetup #15
Python Generators - Talk at PySthlm meetup #15
 
Improved Predictions in Structure-Based Drug Design Using CART and Bayesian M...
Improved Predictions in Structure-Based Drug Design Using CART and Bayesian M...Improved Predictions in Structure-Based Drug Design Using CART and Bayesian M...
Improved Predictions in Structure-Based Drug Design Using CART and Bayesian M...
 
Data drivenapproach to medicinalchemistry
Data drivenapproach to medicinalchemistryData drivenapproach to medicinalchemistry
Data drivenapproach to medicinalchemistry
 
2nd Proj. Update: Integrating SWI-Prolog for Semantic Reasoning in Bioclipse
2nd Proj. Update: Integrating SWI-Prolog for Semantic Reasoning in Bioclipse2nd Proj. Update: Integrating SWI-Prolog for Semantic Reasoning in Bioclipse
2nd Proj. Update: Integrating SWI-Prolog for Semantic Reasoning in Bioclipse
 
Hooking up Semantic MediaWiki with external tools via SPARQL
Hooking up Semantic MediaWiki with external tools via SPARQLHooking up Semantic MediaWiki with external tools via SPARQL
Hooking up Semantic MediaWiki with external tools via SPARQL
 
Development and sharing of ADME/Tox and Drug Discovery Machine learning models
Development and sharing of ADME/Tox and Drug Discovery Machine learning modelsDevelopment and sharing of ADME/Tox and Drug Discovery Machine learning models
Development and sharing of ADME/Tox and Drug Discovery Machine learning models
 
Chemical Spaces: Modeling, Exploration & Understanding
Chemical Spaces: Modeling, Exploration & UnderstandingChemical Spaces: Modeling, Exploration & Understanding
Chemical Spaces: Modeling, Exploration & Understanding
 
Exploiting bigger data and collaborative tools for predictive drug discovery
Exploiting bigger data and collaborative tools for predictive drug discovery Exploiting bigger data and collaborative tools for predictive drug discovery
Exploiting bigger data and collaborative tools for predictive drug discovery
 
SciPipe - A light-weight workflow library inspired by flow-based programming
SciPipe - A light-weight workflow library inspired by flow-based programmingSciPipe - A light-weight workflow library inspired by flow-based programming
SciPipe - A light-weight workflow library inspired by flow-based programming
 
iRODS Rule Language Cheat Sheet
iRODS Rule Language Cheat SheetiRODS Rule Language Cheat Sheet
iRODS Rule Language Cheat Sheet
 
Flow based programming an overview
Flow based programming   an overviewFlow based programming   an overview
Flow based programming an overview
 
Dispensing error
Dispensing errorDispensing error
Dispensing error
 
Cheminformatics in drug design
Cheminformatics in drug designCheminformatics in drug design
Cheminformatics in drug design
 
Chemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & ReproducibleChemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & Reproducible
 
AddisDev Meetup ii: Golang and Flow-based Programming
AddisDev Meetup ii: Golang and Flow-based ProgrammingAddisDev Meetup ii: Golang and Flow-based Programming
AddisDev Meetup ii: Golang and Flow-based Programming
 
Reproducibility in Scientific Data Analysis - BioScience Seminar
Reproducibility in Scientific Data Analysis - BioScience SeminarReproducibility in Scientific Data Analysis - BioScience Seminar
Reproducibility in Scientific Data Analysis - BioScience Seminar
 
Introduction to the drug discovery process
Introduction to the drug discovery processIntroduction to the drug discovery process
Introduction to the drug discovery process
 

Semelhante a Agile large-scale machine-learning pipelines in drug discovery

Chemical decision support in toxicology and pharmacology (OpenToxEU 2013)
Chemical decision support in toxicology and pharmacology (OpenToxEU 2013)Chemical decision support in toxicology and pharmacology (OpenToxEU 2013)
Chemical decision support in toxicology and pharmacology (OpenToxEU 2013)Ola Spjuth
 
The case for cloud computing in Life Sciences
The case for cloud computing in Life SciencesThe case for cloud computing in Life Sciences
The case for cloud computing in Life SciencesOla Spjuth
 
Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...Ola Spjuth
 
Docker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker, Inc.
 
Machine Learning in Modern Medicine with Erin LeDell at Stanford Med
Machine Learning in Modern Medicine with Erin LeDell at Stanford MedMachine Learning in Modern Medicine with Erin LeDell at Stanford Med
Machine Learning in Modern Medicine with Erin LeDell at Stanford MedSri Ambati
 
Software tools to facilitate materials science research
Software tools to facilitate materials science researchSoftware tools to facilitate materials science research
Software tools to facilitate materials science researchAnubhav Jain
 
Analyzing Big Data in Medicine with Virtual Research Environments and Microse...
Analyzing Big Data in Medicine with Virtual Research Environments and Microse...Analyzing Big Data in Medicine with Virtual Research Environments and Microse...
Analyzing Big Data in Medicine with Virtual Research Environments and Microse...Ola Spjuth
 
Modelling physiological uncertainty
Modelling physiological uncertaintyModelling physiological uncertainty
Modelling physiological uncertaintyNatal van Riel
 
Introduction to FAIRDOM
Introduction to FAIRDOMIntroduction to FAIRDOM
Introduction to FAIRDOMCarole Goble
 
ExaLearn Overview - ECP Co-Design Center for Machine Learning
ExaLearn Overview - ECP Co-Design Center for Machine LearningExaLearn Overview - ECP Co-Design Center for Machine Learning
ExaLearn Overview - ECP Co-Design Center for Machine Learninginside-BigData.com
 
From Bugs to Decision Support - Selected Research Highlights
From Bugs to Decision Support - Selected Research HighlightsFrom Bugs to Decision Support - Selected Research Highlights
From Bugs to Decision Support - Selected Research HighlightsMarkus Borg
 
PhD dissertation Luis Marco Ruiz
PhD dissertation Luis Marco RuizPhD dissertation Luis Marco Ruiz
PhD dissertation Luis Marco RuizLuis Marco Ruiz
 
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...William Gunn
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsAnubhav Jain
 
20170110_IOuellette_CV
20170110_IOuellette_CV20170110_IOuellette_CV
20170110_IOuellette_CVIan Ouellette
 
Data-intensive bioinformatics on HPC and Cloud
Data-intensive bioinformatics on HPC and CloudData-intensive bioinformatics on HPC and Cloud
Data-intensive bioinformatics on HPC and CloudOla Spjuth
 

Semelhante a Agile large-scale machine-learning pipelines in drug discovery (20)

Chemical decision support in toxicology and pharmacology (OpenToxEU 2013)
Chemical decision support in toxicology and pharmacology (OpenToxEU 2013)Chemical decision support in toxicology and pharmacology (OpenToxEU 2013)
Chemical decision support in toxicology and pharmacology (OpenToxEU 2013)
 
The case for cloud computing in Life Sciences
The case for cloud computing in Life SciencesThe case for cloud computing in Life Sciences
The case for cloud computing in Life Sciences
 
Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...
 
Docker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce Hoff
 
Machine Learning in Modern Medicine with Erin LeDell at Stanford Med
Machine Learning in Modern Medicine with Erin LeDell at Stanford MedMachine Learning in Modern Medicine with Erin LeDell at Stanford Med
Machine Learning in Modern Medicine with Erin LeDell at Stanford Med
 
CV_10/17
CV_10/17CV_10/17
CV_10/17
 
Cv long
Cv longCv long
Cv long
 
DR KL CV v5
DR KL CV v5DR KL CV v5
DR KL CV v5
 
Software tools to facilitate materials science research
Software tools to facilitate materials science researchSoftware tools to facilitate materials science research
Software tools to facilitate materials science research
 
Analyzing Big Data in Medicine with Virtual Research Environments and Microse...
Analyzing Big Data in Medicine with Virtual Research Environments and Microse...Analyzing Big Data in Medicine with Virtual Research Environments and Microse...
Analyzing Big Data in Medicine with Virtual Research Environments and Microse...
 
AI for Science
AI for ScienceAI for Science
AI for Science
 
Modelling physiological uncertainty
Modelling physiological uncertaintyModelling physiological uncertainty
Modelling physiological uncertainty
 
Introduction to FAIRDOM
Introduction to FAIRDOMIntroduction to FAIRDOM
Introduction to FAIRDOM
 
ExaLearn Overview - ECP Co-Design Center for Machine Learning
ExaLearn Overview - ECP Co-Design Center for Machine LearningExaLearn Overview - ECP Co-Design Center for Machine Learning
ExaLearn Overview - ECP Co-Design Center for Machine Learning
 
From Bugs to Decision Support - Selected Research Highlights
From Bugs to Decision Support - Selected Research HighlightsFrom Bugs to Decision Support - Selected Research Highlights
From Bugs to Decision Support - Selected Research Highlights
 
PhD dissertation Luis Marco Ruiz
PhD dissertation Luis Marco RuizPhD dissertation Luis Marco Ruiz
PhD dissertation Luis Marco Ruiz
 
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data sets
 
20170110_IOuellette_CV
20170110_IOuellette_CV20170110_IOuellette_CV
20170110_IOuellette_CV
 
Data-intensive bioinformatics on HPC and Cloud
Data-intensive bioinformatics on HPC and CloudData-intensive bioinformatics on HPC and Cloud
Data-intensive bioinformatics on HPC and Cloud
 

Mais de Ola Spjuth

Automating cell-based screening with open source, robotics and AI
Automating cell-based screening with open source, robotics and AIAutomating cell-based screening with open source, robotics and AI
Automating cell-based screening with open source, robotics and AIOla Spjuth
 
Combining Prediction Intervals on Multi-Source Non-Disclosed Regression Datasets
Combining Prediction Intervals on Multi-Source Non-Disclosed Regression DatasetsCombining Prediction Intervals on Multi-Source Non-Disclosed Regression Datasets
Combining Prediction Intervals on Multi-Source Non-Disclosed Regression DatasetsOla Spjuth
 
Storage and Analysis of Sensitive Large-Scale Biomedical Data in Sweden
Storage and Analysis of Sensitive Large-Scale Biomedical Data in SwedenStorage and Analysis of Sensitive Large-Scale Biomedical Data in Sweden
Storage and Analysis of Sensitive Large-Scale Biomedical Data in SwedenOla Spjuth
 
Enabling Translational Medicine with e-Science
Enabling Translational Medicine with e-ScienceEnabling Translational Medicine with e-Science
Enabling Translational Medicine with e-ScienceOla Spjuth
 
Interoperability and scalability with microservices in science
Interoperability and scalability with microservices in scienceInteroperability and scalability with microservices in science
Interoperability and scalability with microservices in scienceOla Spjuth
 
Accessing and scripting CDK from Bioclipse
Accessing and scripting CDK from BioclipseAccessing and scripting CDK from Bioclipse
Accessing and scripting CDK from BioclipseOla Spjuth
 

Mais de Ola Spjuth (6)

Automating cell-based screening with open source, robotics and AI
Automating cell-based screening with open source, robotics and AIAutomating cell-based screening with open source, robotics and AI
Automating cell-based screening with open source, robotics and AI
 
Combining Prediction Intervals on Multi-Source Non-Disclosed Regression Datasets
Combining Prediction Intervals on Multi-Source Non-Disclosed Regression DatasetsCombining Prediction Intervals on Multi-Source Non-Disclosed Regression Datasets
Combining Prediction Intervals on Multi-Source Non-Disclosed Regression Datasets
 
Storage and Analysis of Sensitive Large-Scale Biomedical Data in Sweden
Storage and Analysis of Sensitive Large-Scale Biomedical Data in SwedenStorage and Analysis of Sensitive Large-Scale Biomedical Data in Sweden
Storage and Analysis of Sensitive Large-Scale Biomedical Data in Sweden
 
Enabling Translational Medicine with e-Science
Enabling Translational Medicine with e-ScienceEnabling Translational Medicine with e-Science
Enabling Translational Medicine with e-Science
 
Interoperability and scalability with microservices in science
Interoperability and scalability with microservices in scienceInteroperability and scalability with microservices in science
Interoperability and scalability with microservices in science
 
Accessing and scripting CDK from Bioclipse
Accessing and scripting CDK from BioclipseAccessing and scripting CDK from Bioclipse
Accessing and scripting CDK from Bioclipse
 

Último

Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 

Último (20)

Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 

Agile large-scale machine-learning pipelines in drug discovery

  • 1. Agile large-scale machine-learning pipelines in drug discovery Ola Spjuth Department of Pharmaceutical Biosciences and Science for Life Laboratory Uppsala University, Sweden ola.spjuth@farmbio.uu.se
  • 2. Outline • My research in perspective • Our approach to machine learning in ligand-based modeling • Challenges when data grows • Automation workflows/pipelines • HPC, Cloud Computing and Big Data Analytics
  • 3. From data to insights • We have access to a wealth of information • Data mining and predictive modeling can be useful
  • 4. History: Bioclipse – an open source workbench for the life sciences O. Spjuth, J. Alvarsson, A. Berg, M. Eklund, S. Kuhn, C. Mäsak, G. Torrance, J. Wagener, E.L. Willighagen, C. Steinbeck, and J.E.S. Wikberg. Bioclipse 2: A scriptable integration platform for the life sciences. BMC Bioinformatics 2009, 10:397 Spjuth O, Helmus T, Willighagen EL, Kuhn S, Eklund M, Wagener J, Murray-Rust P, Steinbeck C, Wikberg JES: Bioclipse: an open source workbench for chemo- and bioinformatics. BMC Bioinformatics 2007, 8:59.
  • 5. How is the compound metabolized? Are any of its metabolites reactive/toxic? Here? Here? Is it toxic? Chemical liabilities (drug safety, alerts) Adverse effects? Can we, based on existing experimental studies, IT, and statistical models, predict the outcome for new compounds?
  • 6. Starting out in 2008 with a challenge: • Build a system with predictive models which runs on the client – Initial problem: Site-of-metabolism prediction
  • 7. Site-of-metabolism (SOM) predictions – MetaPrint2D L. Carlsson, O. Spjuth, S. Adams, R. C. Glen, and S. Boyer. Use of historic metabolic biotransformation data as a means of anticipating metabolic sites using MetaPrint2D and Bioclipse. BMC Bioinformatics 2010, 11:362 Boyer S, Arnby CH, Carlsson L, Smith J, Stein V, Glen RC. Reaction site mapping of xenobiotic biotransformations. J Chem Inf Model. 2007 Mar-Apr;47(2):583-90. Reaction Database MetaPrint2D database Circular Fingerprints Highest probability of metabolism Low probability of metabolism Medium probability of metabolism Mapping
  • 9. Next challenge: Extend to general predictive models • Fast predictive models, allow for instant updates upon structural changes • Span from virtual screening to lead optimization
  • 10. Bioclipse Decision Support • Integrate various predictive methods – Similarity searches (InChI, signatures, fingerprints) – Structural alerts (toxicophores) – QSAR models (classification, regression) • Visual interpretation – Highlight important substructures O. Spjuth, L. Carlsson, M. Eklund, E. Ahlberg Helgee, and Scott Boyer. Integrated decision support for assessing chemical liabilities. Accepted in J. Chem. Inf. Model, 2011.
  • 11. Ligand-based predictive modeling Quantitative Structure-Activity Relationship (QSAR) – Start with a dataset of chemical structures with measured property to model (inhibition, toxicity, etc) – Describe chemicals using descriptors – Make use of statistical modeling to relate chemical structures to a response
  • 12. Machine learning pipelines Preprocessing Model building Validation Reporting
  • 13. QSAR modeling • Signatures1 descriptor in CDK2 – Canonical representation of atom environments • Support Vector Machine (SVM) – Robust modeling 1. Faulon, J.-L.; Visco, D. P.; Pophale, R. S. Journal of Chemical Information and Computer Sciences, 2003, 43, 707-720 2. Steinbeck, C.; Han, Y.; Kuhn, S.; Horlacher, O.; Luttmann, E.; Willighagen, E. Journal of Chemical Information and Computer Sciences, 2003,43, 493-500.
  • 14. Local interpretation of nonlinear QSAR models • Method – Compute gradient of decision function for prediction – Extract descriptor(s) with largest component in the gradient • Demonstrated on RF, SVM, and PLS Carlsson, L., Helgee, E. A., and Boyer, S. Interpretation of nonlinear qsar models applied to ames mutagenicity data. J Chem Inf Model 49, 11 (Nov 2009), 2551–2558. Lars Carlsson, AstraZeneca R&D
  • 16. Next challenge: Simple model building • Build a solution where: – Scientists can build accurate models without modeling expertise, in order to aid their decision making – Combine these models with other models
  • 17. Simple model building with graphical wizards
  • 18. Next challenge: Predict using distributed services • OpenTox - European project for creating a interoperable framework for toxicity predictions • Academia and industry • Parts – Ontology and API – Query and invocation of predictive services – Methods and algorithms – Authentication and authorization
  • 22. Summary of Bioclipse Decision Support • Flexible, general method – Apply to any collection of molecules • State-of-the-art machine-learning methods • Handles large data sets • Fast predictions
  • 23. Advantages with the DS method • Fast: Can run on local computer – “Instant predictions”, “calculate as you draw” • Interpretable results: Can be used for hypothesis generation • General: Apply any modeling technique to any data set • Extensible: Very easy to add new components • Open: Free, open source
  • 24. Observations • Predictive drug discovery is becoming data-intensive – High throughput technologies • Drug/chemical screening • Molecular biology (omics) – More and bigger publicly available data sources • Data is continuously updated  We need scalable and automated methods for predictive modeling
  • 25. Challenges with bigger data sets for machine learning • Modeling time increases – Reduce/avoid parameter tuning – Run on high-performance e-infrastructures – Use approximate methods • Not all implementations can handle dataset sizes – Use sparse implementations
  • 26. Determine parameter intervals for modeling (sweetspot) J. Alvarsson, M. Eklund, C. Andersson, L. Carlsson, O. Spjuth, and Jarl Wikberg. Benchmarking study of parameter variation when using signature fingerprints together with support vector machines. J Chem Inf Model. 2014, 54(11), pp 3211–3217. SVM: Cost and Gamma parameters Signatures: Heights
  • 27. Example 1: Modeling large number of observations Jonathan Alvarsson
  • 28. Example 2: Target predictions Alvarsson J, Eklund M, Engkvist O, Spjuth O, Carlsson L, Wikberg JE, Noeske T. Ligand-based target prediction with signature fingerprints. J Chem Inf Model. 2014 Oct 27;54(10):2647-53
  • 29. Challenge with running on HPC • Reduce manual work – Automate data preprocessing and modeling – Support modeling life cycle (build, validate, document, version, publish, re-train …) • Automating model building is not trivial – Aim: Agile, component-based architecture
  • 30. Example application: Training large number of datasets Aim: Build models for hundreds of targets – Challenge to extract – Challenge to automate model building Data sources Samuel Lampa
  • 31. Automating analysis on HPC clusters • Workflow systems can aid development and deployment • We used Luigi system • Integrate with queuing system (SLURM) Train and assess model Samuel Lampa https://github.com/spotify/luigi
  • 33. Publishing models • Publish models for easy access and consumption • We used P2 (OSGi) provisioning system v. 1.3 v. 1.2 v. 1.1 Use models
  • 35. Model building WFs on HPC is not trivial • Many workflow systems exist – DSLs vs APIs – Dynamic input/output in e.g. cross-validation not supported out of the box • Time-consuming to create WFs • Workflows can be useful but is not (yet) the silver bullet we sought O. Spjuth, E. Bongcam-Rudloff, G. C. Hernandez, L. Forer, M. Giovacchini, R. V. Guimera, A. Kallio, E. Korpelainen, M. Kandula, M. Krachunov, D. P. Kreil, O. Kulev, P. P. Labaj, S. Lampa, L. Pireddu, S. Schönherr, A. Siretskiy, and D. Vassilev. Experiences with workflows for automating data- intensive bioinformatics. Accepted in Biology Direct.
  • 36. Could cloud computing improve things?
  • 37. QSAR Modeling on Amazon Elastic Cloud Number of cores Time(hours) 1 2 4 8 16 5 50 100 150 200 220 20k 75k 150k 300k B. Torabi, J. Alvarsson, M. Holm, M. Eklund, L. Carlsson, and O. Spjuth. “Scaling predictive modeling in drug development with cloud computing”. J. Chem. Inf. Model., 2015, 55 (1), pp 19-25
  • 38. Private clouds • We set up an OpenStack system at UPPMAX (our HPC center) • Primarily Infrastructure as a Service (IaaS) – users can run virtual machines • Platform-as-a-Service (PaaS): Hadoop and Spark – Our question: Can this be useful for model building?
  • 39. • Open catalogue of VMIs • Hosted at Uppsala University M. Dahlö, F. Haziza, A. Kallio, E. Korpelainen, E. Bongcam- Rudloff, and O. Spjuth. BioImg.org: A catalogue of virtual machine images for the life sciences. Accepted in Bioinformatics and Biology Insights. www.bioimg.org Managing Virtual Machine Images
  • 40. Cloud computing enables Big Data Analytics • Hadoop – Open Source Map-Reduce, suited for massively parallel tasks – Distributed execution, high availability, fault tolerant, can be run on commodity hardware – E.g. Google, Facebook and Twitter use it • Hadoop File System (HDFS) distributes data on nodes, computing done in parallel – “bring computations to data”
  • 41. Hadoop (MapReduce) for massively parallel analysis
  • 42. Evaluating Hadoop for next-generation sequencing • Compare Hadoop and HPC – Create as identical pipelines as possible – Calculate efficiency as function of data size – Conclusion: Hadoop pipeline scales better than HPC and is economical for current data sizes Alexey Siretskiy, former postdoc at UPPMAX A. Siretskiy, L. Pireddu, T. Sundqvist, and O. Spjuth. A quantitative assessment of the Hadoop framework for analyzing massively parallel DNA sequencing data. Gigascience (2015) Jun 4; 4:26. A. Siretskiy and O. Spjuth. HTSeq-Hadoop: Extending HTSeq for Massively Parallel Sequencing Data Analysis using Hadoop. In e-Science, 2014 IEEE 10th International Conference on (2014), vol. 1, pp. 317–323.
  • 43. SPARK • Add caching to Hadoop (MapReduce) – in memory computing • Good for iterative algorithms • We applied it for ligand- based virtual screening With Åke Edlund, HPCViz, KTH L. Ahmed, A. Edlund, E. Laure, O. Spjuth. Using Iterative MapReduce for Parallel Virtual Screening. Cloud Computing Technology and Science (Cloud- Com), 2013 IEEE 5th International Conference on , vol.2, no., pp.27,32, 2-5, 2013
  • 44. Large-scale machine learning on Spark • Ongoing project: Create a large-scale machine learning pipeline for QSAR using Spark ML as alternative to Luigi workflow system – Apply to large data sets – Apply to many data sets – Compare Spark with workflows on Batch system – Aim: Use for Reactive Modeling
  • 45. Some conclusions so far on cloud computing and Hadoop/Spark for bioinformatics • Cloud computing – Easy provisioning of infrastructures, services and platforms • Hadoop – Scalable and efficient – but to the price of software incompatibility • Spark – improves over Hadoop with in-memory computing and more intuitive interface • Current working hypothesis: Spark more advantageous compared to workflows on batch systems for machine learning pipelines
  • 46. Conformal prediction Seek answer to: “How good is your prediction?” • Traditional machine learning algorithms: – Simple predictions (e.g. “Class A”, 8.45) • Conformal predictions – Prediction intervals for a given confidence level – based on a consistent and well-defined mathematical framework1 1 Vovk, V.; Gammerman, A.; Shafer, G. “Algorithmic learning in a random world”; Springer: New York, 2005.
  • 47. Conformal predictions Norinder, U., Carlsson, L., Boyer, S., and Eklund, M. Introducing conformal prediction in predictive modeling. a transparent and flexible alternative to applicability domain determination. J Chem Inf Model 54, 6 (Jun 2014), 1596–603.
  • 48. Some projects on Conformal Predictions • CP Feature Highlighting • CP in Spark • Large-scale model building in cheminformatics and virtual screening – Ongoing projects Ahlberg E, Spjuth O, Hasselgren C, Carlsson L. Interpretation of Conformal Prediction Classification Models. Statistical Learning and Data Sciences. Springer International Publishing; 2015. pp. 323–334. Capuccini M, Carlsson L, Norinder U., and Spjuth O. Conformal Prediction in Spark: Large-Scale Machine Learning with Confidence. Submitted.
  • 49. Two pilots for clinical data management
  • 52. e-Science (cyberinfrastructure, “big data”) “Systematic and advanced use of computers in research” – High-performance computing – Distributed data, “Big data” – Enabling science! www.e-science.se www.essenceofescience.se
  • 53. Acknowledgements Workflows Samuel Lampa David Kreil Maciej Kańduła BioImg.org Martin Dahlö Frédéric Haziza Mentell Design Hadoop & Spark Alexey Siretskiy Åke Edlund Izhar ul Hassan Marco Cappucini Staffan Arvidsson Cloud computing Frédéric Haziza Tore Sundqvist Behrooz Torabi Salman Toor Andreas Hellander Predictive modeling Lars Carlsson Ernst Ahlberg-Helgee Martin Eklund Ulf Norinder Wesley Schaal Jonathan Alvarsson Bioclipse Arvid Berg Egon Willighagen All Bioclipse and CDK contributors

Notas do Editor

  1. Open source Eclipse plugin architecture
  2. Predicting SOM in silico Cheap Reasonably effective Fast, can be used in earlier steps than optimization Results on par with other tools Used a lot at e.g. AstraZeneca Wet lab experiments are slow, expensive and not exact
  3. Mutagenicity: ability of a substance to induce mutations to DNA Carcinogenic Potency Database (CPDB) aryl hydrocarbon receptor (AHR), transcription factor involved in metabolizing enzymes, important target because of a promiscuous ligand binding site
  4. Advantages: Fast: Run on local computers Interpretable results: Can be used for hypothesis generation General: Can integrate any modeling technique and be applied to any data set Extensible: Very easy to add new components
  5. European project for creating a interoperable framework for toxicity predictions Academia and industry Parts Ontology and API Query and invocation of predictive services Methods and algorithms Authentication and authorization
  6. Rich user interface fro OpenTox! Screenshot with OpenTox predictions run in DSView… Safety profiles, rich clients allows for rich gui, customizable
  7. Keeping predictive models up to date is challenging Versioning of models not trivial