DevEX - reference for building teams, processes, and platforms
Code camp 2014 Talk Scientific Thinking
1. Geek Meets Science: ChemIDplus,
an Example of Scientific Thinking
Mitch Miller
Scientific Thinking
2. Overview
➲Introduce myself
➲My definition of a scientific geek consultant
➲An fast overview of Cheminformatics
➲Overview of the ChemIDplus project
➲The scientific geek's role in ChemIDplus
3. Introduction: who am I?
➲Ph.D. chemist with 20+ years of experience in
scientific information management
➲Currently independent consultant
➲Application developer, database person,
requirements analyst, application first-aid
➲Main areas of focus:
●Chemical structure database management
●Managing data from high-throughput
research
4. The perspective of the scientific-geek-
consultant
➲Is the scientific-geek-consultant's perspective on
technology different from other geeks'?
➲Learn new technologies/frameworks/paradigms
and take them in stride
➲What gets me excited is seeing a user able to do
something that the user could not do yesterday
➲This talk is about one project in scientific
information management and what I've done to
give users access to what they could not do
before
6. Representing Chemical Structures
➲This discussion is restricted to 2 dimensional (2D)
structures which establish identity
➲Chemical structures can be represented graphically
in a variety of ways.
➲
➲
➲
➲
➲
➲To make structures searchable, you need a
mathematical representation of the atoms and bonds:
a connection table
7. Searching for structures
➲Search for matches based on a graphic
chemical system
●Start with a chemical of interest
●Find others like it
➲Several definitions of what makes one structure
like another
●Exact match: find same molecule user input
●Substructure
●'Similarity' fuzzy match
➲Analogy: Word search for 'store'
8. Substructure matches for Aspirin
➲Each of these
➲structures contains
➲the query structure
➲
➲
➲
➲
➲Word analogy results:
●Store
●drugstore
●stores
●stored
●restore
9. Non-matching structure
➲4-(acetyloxy)-benzoic acid is not a substructure match
for aspirin because it does not contain the same
arrangement of atoms and bonds
➲
➲
➲
➲
➲
➲Non-hits for Word search analogy:
●story
●storm
●'stoor'
➲Can be found using similarity search
10. Structure search software
➲Standalone programs
●Ran on server or desktop
➲Client-server architectures
➲Database cartridges
●Provide chemical structure searching within a relational
database
●Commercially available
●Add operators to store, search, retrieve and transform
chemical structures within SQL
●e.g. SELECT ID, MOLDEPICTION(STRUCT) FROM
OUR_STRUCTURE_TABLE WHERE SUBSTRUCT(STRUCT,
'CC(=O)Oc1ccccc1C(=O)O') =1
●Client application must have a tool that can display connection
tables as graphic chemical structures
11. Structure database operations
➲Data stored in tables
➲Data loading typically requires specialized
software
➲Indexing is non-typical
➲Search operators are specific to the cartridge
12. How can you search a million
chemical structures in seconds?
➲Chemical databases have sizes in 100's of thousands or
millions
➲Comparing atoms and bonds takes time!
➲Users want answers quickly.
➲Solution: rapid screen-out step before looking at atom and
bonds.
●Based on structure 'fingerprints'
●Analyze input structures for features such as rings,
atoms, connection patterns (O-X-X-N).
●Create a bit string
●Compare bit string of query structure with bit strings in
database.
●Bit string comparisons are very fast
14. ChemIDplus
➲“Dictionary of over 400,000 chemicals (names,
synonyms, and structures) … (with) links to
NLM and other databases and resources”
➲Maintained by the Division of Specialized
Information Services within the National Library
of Medicine within National Institutes of Health
➲Used by people in industry, academia and
government who handle drugs and chemicals
and access environmental and safety data plus
other biomedical information
15. ChemIDplus
➲Part of a system of databases called 'Toxnet' at
National Library of Medicine http://toxnet.nlm.nih.gov/
➲Focus:
●Chemical Information
●Environmental Health and Toxicology
●HIV / AIDS
●Disaster Information
➲Available on the web in 3 'flavors':
●Full: http://chem.sis.nlm.nih.gov/chemidplus/
●'Lite:' http://chem.sis.nlm.nih.gov/chemidplus/chemidlite.jsp
●Ultralite: http://druginfo.nlm.nih.gov/drugportal/drugportal.jsp
➲
16. ChemIDplus Team
➲George (Mike) Hazard – team leader
➲Shannon Jordan
➲Michael Chambers - developer
➲Chuchu Lan – system administrator/DBA
➲Jenny Fang
➲Stefanie Publicker
➲Larry Callahan, Frank Switzer – FDA liaisons
17. Historical Note
➲ChemIDplus was one of the first structure-searchable
databases on the worldwide web
➲Started in 1998
➲Original developer
18. Server Architecture
Tomcat Server
Servlets, JSPs,
JS libraries
Database Server
Chemical
Data Cartridge
Database
(Oracle)
Structures
Names
Links
Properties
20. Scientific Geek's role in ChemIDplus
➲Developer of the original system in 1998-9 in a since-retired technology
➲Database administrator for structures
●Upgrade between versions of the chemical search software
●Periodic reindexing of the structures for performance
●Batch updates
●Help clean up invalid data
➲Tester
●Performed load testing when the application was migrated to Java
servlets
➲Liaison with other governmental agencies
●Share structures with NCI, PubChem
➲Structure orientation application
●Tool to help ensure that series of chemical compounds look similar
➲
21. Structure table synchronization
The old way
➲Monthly manual process
●Query structures recently added or
changed
●Extract to disk files
●Generated data based on structure: InChI,
SMILES, 3D coordinates
●Registered each item separately
➲Took a couple of hours each month
➲This was repetitious work
22. New system
➲Database trigger detects a change when a value is
inserted or updated to a chemical structure field
➲Computes and stores InChI and SMILES immediately
➲Submits a batch job (DBMS_JOB package) for 3D
●Deletes old 3D structure
●Writes 2D structure to disk
●Invokes Corina (Molecular Networks) program
to generate 3D structure
●Reads 3D structure into separate table
23. Orienting Structures Consistently
➲Databases often contain 'families' of related
compounds
➲Example molecule and hits
➲
➲
➲
➲
➲
➲
➲Manually manipulating
structures takes time!
➲
24. Solution: 'StructClean' Utility
➲Accepts a template structure + molecular weight
●Locates all molecules in the DB that contain
the template under the molecular weight cutoff
●Without the cutoff, you'd might have huge
molecules that contain a small template
➲All hits are oriented to match the template
➲User reviews hits
●Selects/deselects items
●Commits changes
➲Utility is a Java servlet
25. Conclusion
➲ChemIDplus is a valuable resource to those
looking for chemical information on the web
➲Scientific-geek-consultants use a variety of
technologies to provide service to research clients
➲We are similar to regular geeks in many ways
➲The differences are interesting!