SlideShare uma empresa Scribd logo
1 de 25
Baixar para ler offline
Leveling up chemical information
for the era of big data
Alex M. Clark
chemical informatics
Background
◈ Big Data: more like Large Data (+ inconvenience)
◈ We can have aspirations
◈ For any established subset of chemistry: vast
amounts of data exists...
◈ ... but not in a way computers can use it
2
chemical informatics
History
◈ Early 20th century: science was a small town,
everyone knew each other
◈ The dead tree model of publishing scaled nicely
◈ 1990s: my grad advisor (W.R. Roper) knew all the
players in organometallics, read everything
◈ 2000s: my postdoc advisor (C.A. Reed) opined we
were drowning in peer reviewed literature
3
chemical informatics
Grains of sand on a beach
◈ From my graduate
research...
◈ ... and a hundred
more much like it.
4
◈ Pure academic research: to see if we can
◈ Published in respectable journals: a few citations,
mostly forgotten in dusty tomes & PDF files
◈ All that time & effort & funding... for what?
chemical informatics
To be a datapoint
◈ We need to rethink the target audience: it's not
humans... it's machines
◈ They can handle the scale...
◈ ... but they cannot interpret the input data.
5
◈ Every unit of science has
potential value, but no way to be
sure when or to whom
◈ Nobody can read it all, but the
solution is not to publish less
chemical informatics
Digital trees
◈ Computers for writing & reading since 1990s
◈ Yet we use tech to pretend we're with Newton &
friends in early days of the Royal Society
◈ Each step is digital, but the process is modelled on
typewriters and wood carvings
◈ What little data existed in the intermediate documents
is destroyed during the final production
6
chemical informatics
Progress?
◈ Most chemical data is silo'd and opaque
◈ Drug discovery is very motivated: there are structure-
activity datasets that are free, large, useful and
accurate (previously: pick zero)
◈ Now: ChEMBL, PubChem & others (SAR data)
◈ Improved accessibility:
▷ open access is growing (still not data though)
▷ APIs to the patent literature
▷ FAIR data movement: building steam
7
chemical informatics
Dark data
◈ Almost all published chemical research is visible only to
humans, and effectively dark to machines
◈ This is almost as true now as it was in 1700
◈ Solutions have distinct categories:
▷ fix the past: curate old publications using descriptive
data formats - machine learning and text mining can
help, but still need expensive expert humans
▷ fix the present (+ future): publish new data using file
formats that target machines and humans - need
better tools and different attitudes
8
http://molnote.com
Case 1: Figures
◈ Chemists painstakingly draw molecules & reactions
9
◈ Interpretability is high for other chemists...
◈ ... and surprisingly low for machines.
http://molnote.com
Cheminformatics first
◈ Similar amount of effort to
draw fully defined content
◈ e.g. Molecular Notebook
◈ Draw structures & reaction
components for computers
◈ Let algorithms do the
rendering...
10
http://molpress.com
Internet-era
manuscripts
◈ Using standard wordprocessor
is not good for data
◈ WordPress plugin MolPress:
▷ insert & retain raw data
▷ plugin handles rendering
▷ open source
◈ Use blogging software as an
electronic lab notebook
11
http://assaycentral.org
Case 2: Assay Central
◈ Rare & neglected diseases: long tail, few resources
◈ Premise: the data is out there
◈ Pragmatism: grab & fix, by any means necessary
◈ Sources:
▷ ChEMBL, PubChem (parts thereof)
▷ curation, collaboration, importing, scraping
◈ Goal: seed all targets with only good data
12
http://assaycentral.org
Provenance
◈ Some data is good: just have to label (e.g. target),
others must sample, validate by flagging
◈ Merging: standardise, group by, build models
◈ Project ➔ GitHub repository: a molecular build system
13
http://assaycentral.org
Visualisation
◈ Deployable web app, data & models:
14
http://assaycentral.org
Virtuous cycle
◈ Initial data integration is labour intensive
◈ Build models & other decision support
◈ Can generate prescribed molecules
▷ high predictions for desired target(s)
▷ low predictions for undesirable off-targets
◈ Collaborators can order & test these candidates, and
return the data in model-ready format
15
http://collaborativedrug.com
Case 3: CDD Vault
◈ Entry point is an import process: effort is up front...
16
◈ ... after that, everything else just works.
http://collaborativedrug.com
Case 3: CDD Vault
◈ Entry point is an import process: effort is up front...
16
◈ ... after that, everything else just works.
http://collaborativedrug.com
Case 4: Mixtures
◈ Mixtures of compounds: ad hoc is the way, e.g.
▷ osmium tetroxide 2.5 wt. % in tert-butanol
▷ n-butyllithium 2.5M in hexanes
◈ Long overdue for a well defined open format
▷ Mixfile
◈ High priority: text-extraction importer
◈ Working with IUPAC: data feedstock
17
Molfile ➔ InChI
Mixfile ➔ MInChI
http://collaborativedrug.com
Mixfile editor
◈ Verbose
◈ Hierarchical
◈ Cross-platform
◈ Open source
◈ JSON-based
18
http://bioassayexpress.com
Case 5: Bioassay protocols
◈ Assay protocols usually just text: enrich by describing
them with semantic web terms (e.g. BAO, DTO, CLO...)
19
◈ Curated assays are
machine readable
◈ Legacy: ML-support
◈ Current: ELN-style
◈ Began 2014
◈ See Wednesday
chemical informatics
Case 6: Data are valuable
◈ The only way forward: machine readability at source
◈ Scientists must be “persuaded” to get on board
◈ Better tools can help, but mostly social engineering
◈ Who controls much of the culture of science?
20
chemical informatics
Journal publishers
◈ Only the human-readable parts of a manuscript are
peer reviewed (text & figures)
◈ What if the data were part of the formal process...
◈ ... if an algorithm can't understand it, send it back.
21
◈ Supplementary information becomes
the most important part of the article
◈ Open it to everyone's model
chemical informatics
Example
◈ Can start small, e.g. with organic/drug discovery
22
◈ Author provides the files, middleware does basic
validation, reviewer opens and verified
1.mol
2.mol
3.mol
ID,ALK2IC50,LE,LiPE,TPSA
-,uM,uM,uM,uM
1,8.20,0.45,4.79,7014
2,18.2,0.39,4.43,59.28
3,2.30,0.46,5.38,70.14
data.csv
chemical informatics
What are you waiting for?
◈ A whole industry & community exists for support
◈ Best of breed tools have open source options
◈ One publisher has to go first...
◈ ... who?
23
chemical informatics
Acknowledgments
24
alex@collaborativedrug.com
@aclarkxyz
cheminf20.org
◈ Royal Society of Chemistry & ACS CINF
◈ Barry Bunin (CDD)
◈ Peter Gedeck, Hande Küçük-McGinty (BAE)
◈ Sean Ekins, Kim Zorn (CPI)
◈ Leah McEwen (IUPAC)

Mais conteúdo relacionado

Mais procurados

Introducción a la web semántica - Linkatu - irekia 2012
Introducción a la web semántica - Linkatu - irekia 2012Introducción a la web semántica - Linkatu - irekia 2012
Introducción a la web semántica - Linkatu - irekia 2012Alberto Labarga
 
How links can make your open data even greater
How links can make your open data even greaterHow links can make your open data even greater
How links can make your open data even greaterCristina Sarasua
 
DBpedia: A Public Data Infrastructure for the Web of Data
DBpedia: A Public Data Infrastructure for the Web of DataDBpedia: A Public Data Infrastructure for the Web of Data
DBpedia: A Public Data Infrastructure for the Web of DataSebastian Hellmann
 
KESW2012 Hackathon St Petersburg
KESW2012 Hackathon St PetersburgKESW2012 Hackathon St Petersburg
KESW2012 Hackathon St PetersburgAI4BD GmbH
 
Aallbibframe em-20130714
Aallbibframe em-20130714Aallbibframe em-20130714
Aallbibframe em-20130714zepheiraorg
 
Biblissima's prototype on Medieval Manuscripts Illuminations and their Context
Biblissima's prototype on Medieval Manuscripts Illuminations and their ContextBiblissima's prototype on Medieval Manuscripts Illuminations and their Context
Biblissima's prototype on Medieval Manuscripts Illuminations and their ContextEquipex Biblissima
 
Data, Data Everywhere 2010 Annual Meeting
Data, Data Everywhere 2010 Annual MeetingData, Data Everywhere 2010 Annual Meeting
Data, Data Everywhere 2010 Annual MeetingCrossref
 

Mais procurados (8)

Introducción a la web semántica - Linkatu - irekia 2012
Introducción a la web semántica - Linkatu - irekia 2012Introducción a la web semántica - Linkatu - irekia 2012
Introducción a la web semántica - Linkatu - irekia 2012
 
How links can make your open data even greater
How links can make your open data even greaterHow links can make your open data even greater
How links can make your open data even greater
 
Linking knowledge spaces
Linking knowledge spacesLinking knowledge spaces
Linking knowledge spaces
 
DBpedia: A Public Data Infrastructure for the Web of Data
DBpedia: A Public Data Infrastructure for the Web of DataDBpedia: A Public Data Infrastructure for the Web of Data
DBpedia: A Public Data Infrastructure for the Web of Data
 
KESW2012 Hackathon St Petersburg
KESW2012 Hackathon St PetersburgKESW2012 Hackathon St Petersburg
KESW2012 Hackathon St Petersburg
 
Aallbibframe em-20130714
Aallbibframe em-20130714Aallbibframe em-20130714
Aallbibframe em-20130714
 
Biblissima's prototype on Medieval Manuscripts Illuminations and their Context
Biblissima's prototype on Medieval Manuscripts Illuminations and their ContextBiblissima's prototype on Medieval Manuscripts Illuminations and their Context
Biblissima's prototype on Medieval Manuscripts Illuminations and their Context
 
Data, Data Everywhere 2010 Annual Meeting
Data, Data Everywhere 2010 Annual MeetingData, Data Everywhere 2010 Annual Meeting
Data, Data Everywhere 2010 Annual Meeting
 

Semelhante a ACS CINF Luncheon talk (Boston 2018)

BioTeam Trends from the Trenches - NIH, April 2014
BioTeam Trends from the Trenches - NIH, April 2014BioTeam Trends from the Trenches - NIH, April 2014
BioTeam Trends from the Trenches - NIH, April 2014Ari Berman
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
Portland Common Data Model (PCDM): Creating and Sharing Complex Digital Objects
Portland Common Data Model (PCDM): Creating and Sharing Complex Digital ObjectsPortland Common Data Model (PCDM): Creating and Sharing Complex Digital Objects
Portland Common Data Model (PCDM): Creating and Sharing Complex Digital ObjectsKaren Estlund
 
From the Benchtop to the Datacenter: IT and Converged Infrastructure in Life ...
From the Benchtop to the Datacenter: IT and Converged Infrastructure in Life ...From the Benchtop to the Datacenter: IT and Converged Infrastructure in Life ...
From the Benchtop to the Datacenter: IT and Converged Infrastructure in Life ...Ari Berman
 
The culture of researchData
The culture of researchDataThe culture of researchData
The culture of researchDatapetermurrayrust
 
The culture of researchData
The culture of researchData The culture of researchData
The culture of researchData TheContentMine
 
The Culture of Research Data, by Peter Murray-Rust
The Culture of Research Data, by Peter Murray-RustThe Culture of Research Data, by Peter Murray-Rust
The Culture of Research Data, by Peter Murray-RustLEARN Project
 
Reproducible research - to infinity
Reproducible research - to infinityReproducible research - to infinity
Reproducible research - to infinityPeterMorrell4
 
Converged IT and Data Commons
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data CommonsSimon Twigger
 
Defrosting the Digital Library: A survey of bibliographic tools for the next ...
Defrosting the Digital Library: A survey of bibliographic tools for the next ...Defrosting the Digital Library: A survey of bibliographic tools for the next ...
Defrosting the Digital Library: A survey of bibliographic tools for the next ...Duncan Hull
 
From the Benchtop to the Datacenter: HPC Requirements in Life Science Research
From the Benchtop to the Datacenter: HPC Requirements in Life Science ResearchFrom the Benchtop to the Datacenter: HPC Requirements in Life Science Research
From the Benchtop to the Datacenter: HPC Requirements in Life Science ResearchAri Berman
 
Is a Biological Database Really Different than a Biological Journal?
Is a Biological Database Really Different than a Biological Journal?Is a Biological Database Really Different than a Biological Journal?
Is a Biological Database Really Different than a Biological Journal?Philip Bourne
 
From Open Access to Open Data
From Open Access to Open DataFrom Open Access to Open Data
From Open Access to Open DataBrian Hole
 
I want to know more about compuerized text analysis
I want to know more about   compuerized text analysisI want to know more about   compuerized text analysis
I want to know more about compuerized text analysisLuke Czarnecki
 
Overview of the data pilot and OpenAIRE tools, Elly Dijk and Marjan Grootveld...
Overview of the data pilot and OpenAIRE tools, Elly Dijk and Marjan Grootveld...Overview of the data pilot and OpenAIRE tools, Elly Dijk and Marjan Grootveld...
Overview of the data pilot and OpenAIRE tools, Elly Dijk and Marjan Grootveld...OpenAIRE
 
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...Chris Dagdigian
 

Semelhante a ACS CINF Luncheon talk (Boston 2018) (20)

BioTeam Trends from the Trenches - NIH, April 2014
BioTeam Trends from the Trenches - NIH, April 2014BioTeam Trends from the Trenches - NIH, April 2014
BioTeam Trends from the Trenches - NIH, April 2014
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
Portland Common Data Model (PCDM): Creating and Sharing Complex Digital Objects
Portland Common Data Model (PCDM): Creating and Sharing Complex Digital ObjectsPortland Common Data Model (PCDM): Creating and Sharing Complex Digital Objects
Portland Common Data Model (PCDM): Creating and Sharing Complex Digital Objects
 
Intro to Web Science (Fall 2013)
Intro to Web Science (Fall 2013)Intro to Web Science (Fall 2013)
Intro to Web Science (Fall 2013)
 
From the Benchtop to the Datacenter: IT and Converged Infrastructure in Life ...
From the Benchtop to the Datacenter: IT and Converged Infrastructure in Life ...From the Benchtop to the Datacenter: IT and Converged Infrastructure in Life ...
From the Benchtop to the Datacenter: IT and Converged Infrastructure in Life ...
 
The culture of researchData
The culture of researchDataThe culture of researchData
The culture of researchData
 
The culture of researchData
The culture of researchData The culture of researchData
The culture of researchData
 
The Culture of Research Data, by Peter Murray-Rust
The Culture of Research Data, by Peter Murray-RustThe Culture of Research Data, by Peter Murray-Rust
The Culture of Research Data, by Peter Murray-Rust
 
Reproducible research - to infinity
Reproducible research - to infinityReproducible research - to infinity
Reproducible research - to infinity
 
Converged IT and Data Commons
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data Commons
 
Defrosting the Digital Library: A survey of bibliographic tools for the next ...
Defrosting the Digital Library: A survey of bibliographic tools for the next ...Defrosting the Digital Library: A survey of bibliographic tools for the next ...
Defrosting the Digital Library: A survey of bibliographic tools for the next ...
 
From the Benchtop to the Datacenter: HPC Requirements in Life Science Research
From the Benchtop to the Datacenter: HPC Requirements in Life Science ResearchFrom the Benchtop to the Datacenter: HPC Requirements in Life Science Research
From the Benchtop to the Datacenter: HPC Requirements in Life Science Research
 
When?
When?When?
When?
 
Is a Biological Database Really Different than a Biological Journal?
Is a Biological Database Really Different than a Biological Journal?Is a Biological Database Really Different than a Biological Journal?
Is a Biological Database Really Different than a Biological Journal?
 
Kohlmeier "Innovations in Academic Search & Discovery - A Case Study From the...
Kohlmeier "Innovations in Academic Search & Discovery - A Case Study From the...Kohlmeier "Innovations in Academic Search & Discovery - A Case Study From the...
Kohlmeier "Innovations in Academic Search & Discovery - A Case Study From the...
 
Wang "Library support for text and data mining - A case study at Carnegie Mel...
Wang "Library support for text and data mining - A case study at Carnegie Mel...Wang "Library support for text and data mining - A case study at Carnegie Mel...
Wang "Library support for text and data mining - A case study at Carnegie Mel...
 
From Open Access to Open Data
From Open Access to Open DataFrom Open Access to Open Data
From Open Access to Open Data
 
I want to know more about compuerized text analysis
I want to know more about   compuerized text analysisI want to know more about   compuerized text analysis
I want to know more about compuerized text analysis
 
Overview of the data pilot and OpenAIRE tools, Elly Dijk and Marjan Grootveld...
Overview of the data pilot and OpenAIRE tools, Elly Dijk and Marjan Grootveld...Overview of the data pilot and OpenAIRE tools, Elly Dijk and Marjan Grootveld...
Overview of the data pilot and OpenAIRE tools, Elly Dijk and Marjan Grootveld...
 
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
 

Mais de Alex Clark

Mixtures QSAR: modelling collections of chemicals
Mixtures QSAR: modelling collections of chemicalsMixtures QSAR: modelling collections of chemicals
Mixtures QSAR: modelling collections of chemicalsAlex Clark
 
Mixtures InChI: a story of how standards drive upstream products
Mixtures InChI: a story of how standards drive upstream productsMixtures InChI: a story of how standards drive upstream products
Mixtures InChI: a story of how standards drive upstream productsAlex Clark
 
Mixtures as first class citizens in the realm of informatics
Mixtures as first class citizens in the realm of informaticsMixtures as first class citizens in the realm of informatics
Mixtures as first class citizens in the realm of informaticsAlex Clark
 
Mixtures: informatics for formulations and consumer products
Mixtures: informatics for formulations and consumer productsMixtures: informatics for formulations and consumer products
Mixtures: informatics for formulations and consumer productsAlex Clark
 
Coordination InChI (2019)
Coordination InChI (2019)Coordination InChI (2019)
Coordination InChI (2019)Alex Clark
 
Chemical mixtures: File format, open source tools, example data, and mixtures...
Chemical mixtures: File format, open source tools, example data, and mixtures...Chemical mixtures: File format, open source tools, example data, and mixtures...
Chemical mixtures: File format, open source tools, example data, and mixtures...Alex Clark
 
Bringing bioassay protocols to the world of informatics, using semantic annot...
Bringing bioassay protocols to the world of informatics, using semantic annot...Bringing bioassay protocols to the world of informatics, using semantic annot...
Bringing bioassay protocols to the world of informatics, using semantic annot...Alex Clark
 
Autonomous model building with a preponderance of well annotated assay protocols
Autonomous model building with a preponderance of well annotated assay protocolsAutonomous model building with a preponderance of well annotated assay protocols
Autonomous model building with a preponderance of well annotated assay protocolsAlex Clark
 
Representing molecules with minimalism: A solution to the entropy of informatics
Representing molecules with minimalism: A solution to the entropy of informaticsRepresenting molecules with minimalism: A solution to the entropy of informatics
Representing molecules with minimalism: A solution to the entropy of informaticsAlex Clark
 
CDD BioAssay Express: Expanding the target dimension: How to visualize a lot ...
CDD BioAssay Express: Expanding the target dimension: How to visualize a lot ...CDD BioAssay Express: Expanding the target dimension: How to visualize a lot ...
CDD BioAssay Express: Expanding the target dimension: How to visualize a lot ...Alex Clark
 
BioAssay Express
BioAssay ExpressBioAssay Express
BioAssay ExpressAlex Clark
 
SLAS2016: Why have one model when you could have thousands?
SLAS2016: Why have one model when you could have thousands?SLAS2016: Why have one model when you could have thousands?
SLAS2016: Why have one model when you could have thousands?Alex Clark
 
The anatomy of a chemical reaction: Dissection by machine learning algorithms
The anatomy of a chemical reaction: Dissection by machine learning algorithmsThe anatomy of a chemical reaction: Dissection by machine learning algorithms
The anatomy of a chemical reaction: Dissection by machine learning algorithmsAlex Clark
 
Compact models for compact devices: Visualisation of SAR using mobile apps
Compact models for compact devices: Visualisation of SAR using mobile appsCompact models for compact devices: Visualisation of SAR using mobile apps
Compact models for compact devices: Visualisation of SAR using mobile appsAlex Clark
 
Green chemistry in chemical reactions: informatics by design
Green chemistry in chemical reactions: informatics by designGreen chemistry in chemical reactions: informatics by design
Green chemistry in chemical reactions: informatics by designAlex Clark
 
ICCE 2014: The Green Lab Notebook
ICCE 2014: The Green Lab NotebookICCE 2014: The Green Lab Notebook
ICCE 2014: The Green Lab NotebookAlex Clark
 
Cloud hosted APIs for cheminformatics on mobile devices (ACS Dallas 2014)
Cloud hosted APIs for cheminformatics on mobile devices (ACS Dallas 2014)Cloud hosted APIs for cheminformatics on mobile devices (ACS Dallas 2014)
Cloud hosted APIs for cheminformatics on mobile devices (ACS Dallas 2014)Alex Clark
 
Building a mobile reaction lab notebook (ACS Dallas 2014)
Building a mobile reaction lab notebook (ACS Dallas 2014)Building a mobile reaction lab notebook (ACS Dallas 2014)
Building a mobile reaction lab notebook (ACS Dallas 2014)Alex Clark
 
Reaction Lab Notebooks for Mobile Devices - Alex M. Clark - GDCh 2013
Reaction Lab Notebooks for Mobile Devices - Alex M. Clark - GDCh 2013Reaction Lab Notebooks for Mobile Devices - Alex M. Clark - GDCh 2013
Reaction Lab Notebooks for Mobile Devices - Alex M. Clark - GDCh 2013Alex Clark
 
Alex Clark : NETTAB 2013
Alex Clark : NETTAB 2013Alex Clark : NETTAB 2013
Alex Clark : NETTAB 2013Alex Clark
 

Mais de Alex Clark (20)

Mixtures QSAR: modelling collections of chemicals
Mixtures QSAR: modelling collections of chemicalsMixtures QSAR: modelling collections of chemicals
Mixtures QSAR: modelling collections of chemicals
 
Mixtures InChI: a story of how standards drive upstream products
Mixtures InChI: a story of how standards drive upstream productsMixtures InChI: a story of how standards drive upstream products
Mixtures InChI: a story of how standards drive upstream products
 
Mixtures as first class citizens in the realm of informatics
Mixtures as first class citizens in the realm of informaticsMixtures as first class citizens in the realm of informatics
Mixtures as first class citizens in the realm of informatics
 
Mixtures: informatics for formulations and consumer products
Mixtures: informatics for formulations and consumer productsMixtures: informatics for formulations and consumer products
Mixtures: informatics for formulations and consumer products
 
Coordination InChI (2019)
Coordination InChI (2019)Coordination InChI (2019)
Coordination InChI (2019)
 
Chemical mixtures: File format, open source tools, example data, and mixtures...
Chemical mixtures: File format, open source tools, example data, and mixtures...Chemical mixtures: File format, open source tools, example data, and mixtures...
Chemical mixtures: File format, open source tools, example data, and mixtures...
 
Bringing bioassay protocols to the world of informatics, using semantic annot...
Bringing bioassay protocols to the world of informatics, using semantic annot...Bringing bioassay protocols to the world of informatics, using semantic annot...
Bringing bioassay protocols to the world of informatics, using semantic annot...
 
Autonomous model building with a preponderance of well annotated assay protocols
Autonomous model building with a preponderance of well annotated assay protocolsAutonomous model building with a preponderance of well annotated assay protocols
Autonomous model building with a preponderance of well annotated assay protocols
 
Representing molecules with minimalism: A solution to the entropy of informatics
Representing molecules with minimalism: A solution to the entropy of informaticsRepresenting molecules with minimalism: A solution to the entropy of informatics
Representing molecules with minimalism: A solution to the entropy of informatics
 
CDD BioAssay Express: Expanding the target dimension: How to visualize a lot ...
CDD BioAssay Express: Expanding the target dimension: How to visualize a lot ...CDD BioAssay Express: Expanding the target dimension: How to visualize a lot ...
CDD BioAssay Express: Expanding the target dimension: How to visualize a lot ...
 
BioAssay Express
BioAssay ExpressBioAssay Express
BioAssay Express
 
SLAS2016: Why have one model when you could have thousands?
SLAS2016: Why have one model when you could have thousands?SLAS2016: Why have one model when you could have thousands?
SLAS2016: Why have one model when you could have thousands?
 
The anatomy of a chemical reaction: Dissection by machine learning algorithms
The anatomy of a chemical reaction: Dissection by machine learning algorithmsThe anatomy of a chemical reaction: Dissection by machine learning algorithms
The anatomy of a chemical reaction: Dissection by machine learning algorithms
 
Compact models for compact devices: Visualisation of SAR using mobile apps
Compact models for compact devices: Visualisation of SAR using mobile appsCompact models for compact devices: Visualisation of SAR using mobile apps
Compact models for compact devices: Visualisation of SAR using mobile apps
 
Green chemistry in chemical reactions: informatics by design
Green chemistry in chemical reactions: informatics by designGreen chemistry in chemical reactions: informatics by design
Green chemistry in chemical reactions: informatics by design
 
ICCE 2014: The Green Lab Notebook
ICCE 2014: The Green Lab NotebookICCE 2014: The Green Lab Notebook
ICCE 2014: The Green Lab Notebook
 
Cloud hosted APIs for cheminformatics on mobile devices (ACS Dallas 2014)
Cloud hosted APIs for cheminformatics on mobile devices (ACS Dallas 2014)Cloud hosted APIs for cheminformatics on mobile devices (ACS Dallas 2014)
Cloud hosted APIs for cheminformatics on mobile devices (ACS Dallas 2014)
 
Building a mobile reaction lab notebook (ACS Dallas 2014)
Building a mobile reaction lab notebook (ACS Dallas 2014)Building a mobile reaction lab notebook (ACS Dallas 2014)
Building a mobile reaction lab notebook (ACS Dallas 2014)
 
Reaction Lab Notebooks for Mobile Devices - Alex M. Clark - GDCh 2013
Reaction Lab Notebooks for Mobile Devices - Alex M. Clark - GDCh 2013Reaction Lab Notebooks for Mobile Devices - Alex M. Clark - GDCh 2013
Reaction Lab Notebooks for Mobile Devices - Alex M. Clark - GDCh 2013
 
Alex Clark : NETTAB 2013
Alex Clark : NETTAB 2013Alex Clark : NETTAB 2013
Alex Clark : NETTAB 2013
 

Último

Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx023NiWayanAnggiSriWa
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptJoemSTuliba
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxJorenAcuavera1
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptArshadWarsi13
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayupadhyaymani499
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 

Último (20)

Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.ppt
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptx
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.ppt
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort ServiceHot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyay
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdf
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
 

ACS CINF Luncheon talk (Boston 2018)

  • 1. Leveling up chemical information for the era of big data Alex M. Clark
  • 2. chemical informatics Background ◈ Big Data: more like Large Data (+ inconvenience) ◈ We can have aspirations ◈ For any established subset of chemistry: vast amounts of data exists... ◈ ... but not in a way computers can use it 2
  • 3. chemical informatics History ◈ Early 20th century: science was a small town, everyone knew each other ◈ The dead tree model of publishing scaled nicely ◈ 1990s: my grad advisor (W.R. Roper) knew all the players in organometallics, read everything ◈ 2000s: my postdoc advisor (C.A. Reed) opined we were drowning in peer reviewed literature 3
  • 4. chemical informatics Grains of sand on a beach ◈ From my graduate research... ◈ ... and a hundred more much like it. 4 ◈ Pure academic research: to see if we can ◈ Published in respectable journals: a few citations, mostly forgotten in dusty tomes & PDF files ◈ All that time & effort & funding... for what?
  • 5. chemical informatics To be a datapoint ◈ We need to rethink the target audience: it's not humans... it's machines ◈ They can handle the scale... ◈ ... but they cannot interpret the input data. 5 ◈ Every unit of science has potential value, but no way to be sure when or to whom ◈ Nobody can read it all, but the solution is not to publish less
  • 6. chemical informatics Digital trees ◈ Computers for writing & reading since 1990s ◈ Yet we use tech to pretend we're with Newton & friends in early days of the Royal Society ◈ Each step is digital, but the process is modelled on typewriters and wood carvings ◈ What little data existed in the intermediate documents is destroyed during the final production 6
  • 7. chemical informatics Progress? ◈ Most chemical data is silo'd and opaque ◈ Drug discovery is very motivated: there are structure- activity datasets that are free, large, useful and accurate (previously: pick zero) ◈ Now: ChEMBL, PubChem & others (SAR data) ◈ Improved accessibility: ▷ open access is growing (still not data though) ▷ APIs to the patent literature ▷ FAIR data movement: building steam 7
  • 8. chemical informatics Dark data ◈ Almost all published chemical research is visible only to humans, and effectively dark to machines ◈ This is almost as true now as it was in 1700 ◈ Solutions have distinct categories: ▷ fix the past: curate old publications using descriptive data formats - machine learning and text mining can help, but still need expensive expert humans ▷ fix the present (+ future): publish new data using file formats that target machines and humans - need better tools and different attitudes 8
  • 9. http://molnote.com Case 1: Figures ◈ Chemists painstakingly draw molecules & reactions 9 ◈ Interpretability is high for other chemists... ◈ ... and surprisingly low for machines.
  • 10. http://molnote.com Cheminformatics first ◈ Similar amount of effort to draw fully defined content ◈ e.g. Molecular Notebook ◈ Draw structures & reaction components for computers ◈ Let algorithms do the rendering... 10
  • 11. http://molpress.com Internet-era manuscripts ◈ Using standard wordprocessor is not good for data ◈ WordPress plugin MolPress: ▷ insert & retain raw data ▷ plugin handles rendering ▷ open source ◈ Use blogging software as an electronic lab notebook 11
  • 12. http://assaycentral.org Case 2: Assay Central ◈ Rare & neglected diseases: long tail, few resources ◈ Premise: the data is out there ◈ Pragmatism: grab & fix, by any means necessary ◈ Sources: ▷ ChEMBL, PubChem (parts thereof) ▷ curation, collaboration, importing, scraping ◈ Goal: seed all targets with only good data 12
  • 13. http://assaycentral.org Provenance ◈ Some data is good: just have to label (e.g. target), others must sample, validate by flagging ◈ Merging: standardise, group by, build models ◈ Project ➔ GitHub repository: a molecular build system 13
  • 15. http://assaycentral.org Virtuous cycle ◈ Initial data integration is labour intensive ◈ Build models & other decision support ◈ Can generate prescribed molecules ▷ high predictions for desired target(s) ▷ low predictions for undesirable off-targets ◈ Collaborators can order & test these candidates, and return the data in model-ready format 15
  • 16. http://collaborativedrug.com Case 3: CDD Vault ◈ Entry point is an import process: effort is up front... 16 ◈ ... after that, everything else just works.
  • 17. http://collaborativedrug.com Case 3: CDD Vault ◈ Entry point is an import process: effort is up front... 16 ◈ ... after that, everything else just works.
  • 18. http://collaborativedrug.com Case 4: Mixtures ◈ Mixtures of compounds: ad hoc is the way, e.g. ▷ osmium tetroxide 2.5 wt. % in tert-butanol ▷ n-butyllithium 2.5M in hexanes ◈ Long overdue for a well defined open format ▷ Mixfile ◈ High priority: text-extraction importer ◈ Working with IUPAC: data feedstock 17 Molfile ➔ InChI Mixfile ➔ MInChI
  • 19. http://collaborativedrug.com Mixfile editor ◈ Verbose ◈ Hierarchical ◈ Cross-platform ◈ Open source ◈ JSON-based 18
  • 20. http://bioassayexpress.com Case 5: Bioassay protocols ◈ Assay protocols usually just text: enrich by describing them with semantic web terms (e.g. BAO, DTO, CLO...) 19 ◈ Curated assays are machine readable ◈ Legacy: ML-support ◈ Current: ELN-style ◈ Began 2014 ◈ See Wednesday
  • 21. chemical informatics Case 6: Data are valuable ◈ The only way forward: machine readability at source ◈ Scientists must be “persuaded” to get on board ◈ Better tools can help, but mostly social engineering ◈ Who controls much of the culture of science? 20
  • 22. chemical informatics Journal publishers ◈ Only the human-readable parts of a manuscript are peer reviewed (text & figures) ◈ What if the data were part of the formal process... ◈ ... if an algorithm can't understand it, send it back. 21 ◈ Supplementary information becomes the most important part of the article ◈ Open it to everyone's model
  • 23. chemical informatics Example ◈ Can start small, e.g. with organic/drug discovery 22 ◈ Author provides the files, middleware does basic validation, reviewer opens and verified 1.mol 2.mol 3.mol ID,ALK2IC50,LE,LiPE,TPSA -,uM,uM,uM,uM 1,8.20,0.45,4.79,7014 2,18.2,0.39,4.43,59.28 3,2.30,0.46,5.38,70.14 data.csv
  • 24. chemical informatics What are you waiting for? ◈ A whole industry & community exists for support ◈ Best of breed tools have open source options ◈ One publisher has to go first... ◈ ... who? 23
  • 25. chemical informatics Acknowledgments 24 alex@collaborativedrug.com @aclarkxyz cheminf20.org ◈ Royal Society of Chemistry & ACS CINF ◈ Barry Bunin (CDD) ◈ Peter Gedeck, Hande Küçük-McGinty (BAE) ◈ Sean Ekins, Kim Zorn (CPI) ◈ Leah McEwen (IUPAC)