Making hidden data discoverable: How to build effective drug discovery engines?
Sebastian Radestock (Elsevier, Germany)
In a complex IT environment comprising dozens if not hundreds of databases and likely as many user interfaces it becomes difficult if not impossible to find all the relevant information needed to make informed decisions. Historical data get lost, not normalized data cannot be compared and maintenance becomes a nightmare. We will discuss a new approach to address this issue by showing various examples and use cases on how in-house data and public data can be integrated in various ways to address the unique and individual needs of companies to keep the competitive edge.
ICIC 2013 Conference Proceedings Sebastian Radestock
1. DR. S. RADESTOCK | 14 OCTOBER 2013 | MAKING HIDDEN DATA DISCOVERABLE
DRUG DISCOVERY TODAY
ELN
Known ligands
Knowledge
survey
Therapeutic
target
Chemistry
Generate
chemistry ideas
Analyze
SAR
DBs
Check chemical
feasibility
In-house
Synthesize
or buy
Report
Test
Check
ADME/Tox
Journals &
Patents
ELN
Flatfiles
Biology
Docs
Journals
1
2. Dr. Sebastian
Radestock
Product Manager Reaxys
Elsevier Information
Systems GmbH
Frankfurt am Main
Germany
MAKING HIDDEN DATA
DISCOVERABLE:
HOW TO BUILD
EFFECTIVE DRUG
DISCOVERY ENGINES?
2
3. In-house
DRUG DISCOVERY TOMORROW
DBs
Chemistry
Federated
search system
Known ligands
Knowledge
survey
Biology
integrator
Therapeutic
target
Generate
chemistry ideas
Check chemical
feasibility
Analyze
SAR
Journals &
Patents
ELN
Synthesize
or buy
Report
Literature
management
tool
Chemistry
integrator
Test
Check
ADME/Tox
Biology
Integrator for
biomedical
data
Integrator for
drug safety
Docs
ELN
3
Flatfiles
4. DR. S. RADESTOCK | 14 OCTOBER 2013 | MAKING HIDDEN DATA DISCOVERABLE
DRUG DISCOVERY TOMORROW
TWO APPROACHES TO SOLVING THE CHALLENGE OF DATA ACCESS
FEDERATED MODEL
WAREHOUSE APPROACH
CHEMISTRY
INTEGRATOR
Storage and capture
ELN
Analysis
system
User
Capture
Storage
ELN
4
5. DR. S. RADESTOCK | 14 OCTOBER 2013 | MAKING HIDDEN DATA DISCOVERABLE
EXAMPLE IMPLEMENTATION OF THE FEDERATED MODEL
FEDERATED MODEL
CHEMISTRY
INTEGRATOR
Storage and capture
In-house
Analysis
system
In-house
Expansion of the
existing system by
integrating Reaxys
Existing system for
consolidating in-house
structure and
bioactivity data
User
5
6. DR. S. RADESTOCK | 14 OCTOBER 2013 | MAKING HIDDEN DATA DISCOVERABLE
THE REAXYS DATA BASE
• Compounds, substance property
data, preparations, reactions, and
bibliographic information…
…from 400 core chemistry journals
…from relevant chemistry patents
• Manual extraction of all the data
• Coverage of 500 property data fields,
from basics like boiling point or
melting point, via crystal data and
magnetic properties, to spectra
• All together 750 million data points
Scientific data model
CONTAINS ALL PUBLISHED AND HISTORICAL CHEMISTRY DATA
• Historical chemistry data…
…dating back to 1771
6
7. DR. S. RADESTOCK | 14 OCTOBER 2013 | MAKING HIDDEN DATA DISCOVERABLE
EXPANDED BIBLIOGRAPHIC CONTENT IN REAXYS
SUPPORTING MULTI-DISCIPLINARY RESEARCH
AGRICULTURAL AND
BIOLOGICAL SCIENCES
BIOCHEMISTRY, GENETICS
AND MOLECULAR BIOLOGY
CHEMICAL ENGINEERING
DENTISTRY
EARTH AND PLANETARY SCIENCES
• Bibliographic data from 16.000
periodicals covering chemistry and
related sciences have been loaded
into Reaxys
• This goes beyond journals and
patents, it includes conference
proceedings, business articles,
reviews etc.
old
new
ENERGY
ENGINEERING
ENVIRONMENTAL SCIENCE
IMMUNOLOGY AND MICROBIOLOGY
MATERIALS SCIENCE
MEDICINE
NEUROSCIENCE
PHARMACOLOGY, TOXICOLOGY
AND PHARMACEUTICS
PHYSICS AND ASTRONOMY
VETERINARY
ETC.
7
8. DR. S. RADESTOCK | 14 OCTOBER 2013 | MAKING HIDDEN DATA DISCOVERABLE
REAXYS-TREE AND AUTOMATIC INDEXING
THE NEXT STEP… COMING 2014
Title/Abstract
Step 1
Index for
chemistry terms
Step 2
Add
chemistry relevant
keywords and
identified
chemical entities
The Biosynthesis of Aristeromycin. Conversion of Neplanocin A to
Aristeromycin by a Novel Enzymatic Reduction
Partially purified cell-free extracts of the aristeromycin producer
Streptomyces citricolor have been shown to catalyze the NADPHdependent reduction of neplanocin A to aristeromycin. Stereochemical
studies revealed that the reduction proceeds with anti-geometry and
involves transfer of the 4 pro-R hydrogen atom of NADPH to the 6'β
position of aristeromycin.
Reaxys Keywords: Aristeromycin – biosynthesis, enzymatic
reduction, Neplanocin A
Step 3
Translate chemical
names into
structures and make
them searchable
8
9. DR. S. RADESTOCK | 14 OCTOBER 2013 | MAKING HIDDEN DATA DISCOVERABLE
FEDERATED SEARCH SYSTEM WITH REAXYS LOOK-UP
ACCESS TO REAXYS IS VIA THE REAXYS APPLICATION PROGRAMMING INTERFACE (API)
How should
the API look
like?
9
10. DR. S. RADESTOCK | 14 OCTOBER 2013 | MAKING HIDDEN DATA DISCOVERABLE
THE REAXYS APPLICATION PROGRAMMING INTERFACE (API)
LESSONS LEARNED
• Customers want to have access to all data in Reaxys
• Substances and substance property data
• Reactions and reaction details
• Citations
• Customers want to have access to all functionality of Reaxys
• Exact structure and reaction searching, similarity and substructure searching
• Factual queries
• Further processing of hitsets
• The Reaxys API was designed to be based on exchanging XML code between the user
and the Reaxys server via HTML POST requests
• Security and usage tracking is an issue
• Secure communication via HTTPS POST is supported
• The Reaxys API is stateful, and login is required
10
11. DR. S. RADESTOCK | 14 OCTOBER 2013 | MAKING HIDDEN DATA DISCOVERABLE
LIMITATIONS OF THE FEDERATED MODEL
SOME DISADVANTAGES TO CONSIDER
FEDERATED MODEL
CHEMISTRY
INTEGRATOR
Storage and capture
In-house
Analysis
system
Performance and
availability of the
system is dependent on
the source systems
No clean-up and
normalization of the
data
In-house
ELN
User
Architecture allows easy
expansion when new
data source becomes
available
11
12. DR. S. RADESTOCK | 14 OCTOBER 2013 | MAKING HIDDEN DATA DISCOVERABLE
EXAMPLE IMPLEMENTATION OF THE WAREHOUSE APPROACH
Bioactivity data was
normalized, and
structures were deduplicated
WAREHOUSE APPROACH
CHEMISTRY
INTEGRATOR
Structures
A customer set up a
system that contains
structure and bioactivity
data, and IP information
All data was extracted,
translated and loaded to
fit into one unified data
model (UDM)
Analysis
system
Capture
Storage
User
In-house
12
13. DR. S. RADESTOCK | 14 OCTOBER 2013 | MAKING HIDDEN DATA DISCOVERABLE
PLATFORM FOR STRUCTURE AND BIOACTIVITY DATA
STRUCTURE DATA FROM REAXYS COMES FROM THE REAXYS STRUCTURE FLAT FILE
13
14. DR. S. RADESTOCK | 14 OCTOBER 2013 | MAKING HIDDEN DATA DISCOVERABLE
LIMITATIONS OF THE OF THE WAREHOUSE APPROACH
SOME DISADVANTAGES TO CONSIDER
WAREHOUSE APPROACH
CHEMISTRY
INTEGRATOR
Structures
Long implementation
time and associated high
cost
Difficult to
accommodate
differences or changes
in data types
Analysis
system
Capture
Storage
User
In-house
14
15. DR. S. RADESTOCK | 14 OCTOBER 2013 | MAKING HIDDEN DATA DISCOVERABLE
EXAMPLE IMPLEMENTATION OF A FLEXIBLE APPROACH
A CONTENT INTEGRATION SOLUTION THAT IS NOW AVAILABLE TO ALL REAXYS USERS
FEDERATED MODEL
Real-time commercial
availability and pricing
information comes via
an eMolecules API
WAREHOUSE APPROACH
CHEMISTRY
INTEGRATOR
Capture
Storage and capture
Storage
Structures
Pricing
Structures
Structures from
PubChem and
eMolecules have been
integrated into Reaxys
User
Using multiple
storage systems
eliminates the
need for one UDM
15
16. Results are available from
different data source tabs
The substance crosslinking
icon allows to switch to
corresponding substances in
other data sources
16
17. Results are available from
different data source tabs
Filter for substances that
are/aren’t contained in
other data sources
17
18. Results are available from
different data source tabs
Filter for substances that
are/aren’t contained in
other data sources
18
21. DR. S. RADESTOCK | 14 OCTOBER 2013 | MAKING HIDDEN DATA DISCOVERABLE
FLEXIBLE APPROACH FOR INTEGRATION
LESSONS LEARNED
• Reaxys has proven to be extremely powerful as analysis and database system
• Separation of the data from different data sources into multiple storage
systems is the way to go…
• … if a powerful crosslinking mechanism is in place
• Some pieces of information that are subject to frequent updates should be
integrated using the federated model
21
22. DR. S. RADESTOCK | 14 OCTOBER 2013 | MAKING HIDDEN DATA DISCOVERABLE
EXAMPLE IMPLEMENTATION OF A FLEXIBLE APPROACH
A CONTENT INTEGRATION SOLUTION THAT ELSEVIER BUILT FOR ROCHE
FEDERATED MODEL
Reaxys with PubChem
and eMolecules
integrated
WAREHOUSE APPROACH
CHEMISTRY
INTEGRATOR
Capture
Storage and capture
Storage
Structures
Pricing
Structures
Integration of Roche
proprietary data on
chemistry experiments
User
ELN
In-house
Reactions
22
23. DR. S. RADESTOCK | 14 OCTOBER 2013 | MAKING HIDDEN DATA DISCOVERABLE
INTEGRATION OF ROCHE IN-HOUSE DATA
THE SITUATION AT ROCHE YESTERDAY… AND TODAY
23
24. https://reaxys.roche.com
Up to four Roche reaction
data sources are supported
Roche-specific data is included in
the Output (PDF, MS-Word etc.)
Filters on
Roche-specific
data fields
Data on references
and/or experiments,
including PDF links
Normalized
Roche-specific
reaction data
24
25. https://reaxys.roche.com
A Roche icon allows
to switch to a Roche
in-house repository
The reaction crosslinking icon
allows to switch to
corresponding reactions in other
data sources
25
31. DR. S. RADESTOCK | 14 OCTOBER 2013 | MAKING HIDDEN DATA DISCOVERABLE
INTEGRATION OF ROCHE IN-HOUSE DATA
CUSTOMER FEEDBACK
• Usability and acceptance tests by Roche
showed:
• Increased productivity of researchers at
Roche
• Increased discoverability of the Roche
reaction content
• Reduced maintenance effort for Roche:
• Legacy systems were decommissioned
• Roche gets on-going maintenance and
functionality improvements by Elsevier
• Not compromise in security
• Flexible approach:
• Additional data sources have been
added
31
32. In-house
DRUG DISCOVERY TOMORROW
DBs
Chemistry
Federated
search system
Known ligands
Knowledge
survey
Therapeutic
target
Generate
chemistry ideas
Check chemical
feasibility
Analyze
SAR
ELN
Synthesize
or buy
Report
Test
Check
ADME/Tox
Biology
MedScan
Journals &
Patents
Docs
ELN
32
Flatfiles
33. THANK YOU – QUESTIONS?
Dr. Sebastian Radestock
Product Manager Reaxys
Elsevier Information Systems GmbH
Frankfurt am Main, Germany
s.radestock@elsevier.com
33