SlideShare a Scribd company logo
1 of 25
Geek Meets Science: ChemIDplus, 
an Example of Scientific Thinking 
Mitch Miller 
Scientific Thinking
Overview 
➲Introduce myself 
➲My definition of a scientific geek consultant 
➲An fast overview of Cheminformatics 
➲Overview of the ChemIDplus project 
➲The scientific geek's role in ChemIDplus
Introduction: who am I? 
➲Ph.D. chemist with 20+ years of experience in 
scientific information management 
➲Currently independent consultant 
➲Application developer, database person, 
requirements analyst, application first-aid 
➲Main areas of focus: 
●Chemical structure database management 
●Managing data from high-throughput 
research
The perspective of the scientific-geek- 
consultant 
➲Is the scientific-geek-consultant's perspective on 
technology different from other geeks'? 
➲Learn new technologies/frameworks/paradigms 
and take them in stride 
➲What gets me excited is seeing a user able to do 
something that the user could not do yesterday 
➲This talk is about one project in scientific 
information management and what I've done to 
give users access to what they could not do 
before
Quick Introduction to Chemical Databases
Representing Chemical Structures 
➲This discussion is restricted to 2 dimensional (2D) 
structures which establish identity 
➲Chemical structures can be represented graphically 
in a variety of ways. 
➲ 
➲ 
➲ 
➲ 
➲ 
➲To make structures searchable, you need a 
mathematical representation of the atoms and bonds: 
a connection table
Searching for structures 
➲Search for matches based on a graphic 
chemical system 
●Start with a chemical of interest 
●Find others like it 
➲Several definitions of what makes one structure 
like another 
●Exact match: find same molecule user input 
●Substructure 
●'Similarity' fuzzy match 
➲Analogy: Word search for 'store'
Substructure matches for Aspirin 
➲Each of these 
➲structures contains 
➲the query structure 
➲ 
➲ 
➲ 
➲ 
➲Word analogy results: 
●Store 
●drugstore 
●stores 
●stored 
●restore
Non-matching structure 
➲4-(acetyloxy)-benzoic acid is not a substructure match 
for aspirin because it does not contain the same 
arrangement of atoms and bonds 
➲ 
➲ 
➲ 
➲ 
➲ 
➲Non-hits for Word search analogy: 
●story 
●storm 
●'stoor' 
➲Can be found using similarity search
Structure search software 
➲Standalone programs 
●Ran on server or desktop 
➲Client-server architectures 
➲Database cartridges 
●Provide chemical structure searching within a relational 
database 
●Commercially available 
●Add operators to store, search, retrieve and transform 
chemical structures within SQL 
●e.g. SELECT ID, MOLDEPICTION(STRUCT) FROM 
OUR_STRUCTURE_TABLE WHERE SUBSTRUCT(STRUCT, 
'CC(=O)Oc1ccccc1C(=O)O') =1 
●Client application must have a tool that can display connection 
tables as graphic chemical structures
Structure database operations 
➲Data stored in tables 
➲Data loading typically requires specialized 
software 
➲Indexing is non-typical 
➲Search operators are specific to the cartridge
How can you search a million 
chemical structures in seconds? 
➲Chemical databases have sizes in 100's of thousands or 
millions 
➲Comparing atoms and bonds takes time! 
➲Users want answers quickly. 
➲Solution: rapid screen-out step before looking at atom and 
bonds. 
●Based on structure 'fingerprints' 
●Analyze input structures for features such as rings, 
atoms, connection patterns (O-X-X-N). 
●Create a bit string 
●Compare bit string of query structure with bit strings in 
database. 
●Bit string comparisons are very fast
The ChemIDplus project
ChemIDplus 
➲“Dictionary of over 400,000 chemicals (names, 
synonyms, and structures) … (with) links to 
NLM and other databases and resources” 
➲Maintained by the Division of Specialized 
Information Services within the National Library 
of Medicine within National Institutes of Health 
➲Used by people in industry, academia and 
government who handle drugs and chemicals 
and access environmental and safety data plus 
other biomedical information
ChemIDplus 
➲Part of a system of databases called 'Toxnet' at 
National Library of Medicine http://toxnet.nlm.nih.gov/ 
➲Focus: 
●Chemical Information 
●Environmental Health and Toxicology 
●HIV / AIDS 
●Disaster Information 
➲Available on the web in 3 'flavors': 
●Full: http://chem.sis.nlm.nih.gov/chemidplus/ 
●'Lite:' http://chem.sis.nlm.nih.gov/chemidplus/chemidlite.jsp 
●Ultralite: http://druginfo.nlm.nih.gov/drugportal/drugportal.jsp 
➲
ChemIDplus Team 
➲George (Mike) Hazard – team leader 
➲Shannon Jordan 
➲Michael Chambers - developer 
➲Chuchu Lan – system administrator/DBA 
➲Jenny Fang 
➲Stefanie Publicker 
➲Larry Callahan, Frank Switzer – FDA liaisons
Historical Note 
➲ChemIDplus was one of the first structure-searchable 
databases on the worldwide web 
➲Started in 1998 
➲Original developer
Server Architecture 
Tomcat Server 
Servlets, JSPs, 
JS libraries 
Database Server 
Chemical 
Data Cartridge 
Database 
(Oracle) 
Structures 
Names 
Links 
Properties
And now, the demo...
Scientific Geek's role in ChemIDplus 
➲Developer of the original system in 1998-9 in a since-retired technology 
➲Database administrator for structures 
●Upgrade between versions of the chemical search software 
●Periodic reindexing of the structures for performance 
●Batch updates 
●Help clean up invalid data 
➲Tester 
●Performed load testing when the application was migrated to Java 
servlets 
➲Liaison with other governmental agencies 
●Share structures with NCI, PubChem 
➲Structure orientation application 
●Tool to help ensure that series of chemical compounds look similar 
➲
Structure table synchronization 
The old way 
➲Monthly manual process 
●Query structures recently added or 
changed 
●Extract to disk files 
●Generated data based on structure: InChI, 
SMILES, 3D coordinates 
●Registered each item separately 
➲Took a couple of hours each month 
➲This was repetitious work
New system 
➲Database trigger detects a change when a value is 
inserted or updated to a chemical structure field 
➲Computes and stores InChI and SMILES immediately 
➲Submits a batch job (DBMS_JOB package) for 3D 
●Deletes old 3D structure 
●Writes 2D structure to disk 
●Invokes Corina (Molecular Networks) program 
to generate 3D structure 
●Reads 3D structure into separate table
Orienting Structures Consistently 
➲Databases often contain 'families' of related 
compounds 
➲Example molecule and hits 
➲ 
➲ 
➲ 
➲ 
➲ 
➲ 
➲Manually manipulating 
structures takes time! 
➲
Solution: 'StructClean' Utility 
➲Accepts a template structure + molecular weight 
●Locates all molecules in the DB that contain 
the template under the molecular weight cutoff 
●Without the cutoff, you'd might have huge 
molecules that contain a small template 
➲All hits are oriented to match the template 
➲User reviews hits 
●Selects/deselects items 
●Commits changes 
➲Utility is a Java servlet
Conclusion 
➲ChemIDplus is a valuable resource to those 
looking for chemical information on the web 
➲Scientific-geek-consultants use a variety of 
technologies to provide service to research clients 
➲We are similar to regular geeks in many ways 
➲The differences are interesting!

More Related Content

Viewers also liked

Prediction Of Bioactivity From Chemical Structure
Prediction Of Bioactivity From Chemical StructurePrediction Of Bioactivity From Chemical Structure
Prediction Of Bioactivity From Chemical StructureJeremy Besnard
 
Molecular Representation, Similarity and Search
Molecular Representation, Similarity and SearchMolecular Representation, Similarity and Search
Molecular Representation, Similarity and SearchRajarshi Guha
 
Molecular similarity searching methods, seminar
Molecular similarity searching methods, seminarMolecular similarity searching methods, seminar
Molecular similarity searching methods, seminarHaitham Hijazi
 
Chemoinformatics and information management
Chemoinformatics and information managementChemoinformatics and information management
Chemoinformatics and information managementDuncan Hull
 
Cheminformatics
CheminformaticsCheminformatics
Cheminformaticsbaoilleach
 
Cheminformatics
CheminformaticsCheminformatics
Cheminformaticsbaoilleach
 
Cheminformatics in drug design
Cheminformatics in drug designCheminformatics in drug design
Cheminformatics in drug designSurmil Shah
 
Application of graph theory in drug design
Application of graph theory in drug designApplication of graph theory in drug design
Application of graph theory in drug designReihaneh Safavi
 
Chemoinformatics Practical
Chemoinformatics PracticalChemoinformatics Practical
Chemoinformatics PracticalPuneet Kacker
 
Applications Of Bioinformatics In Drug Discovery And Process
Applications Of Bioinformatics In Drug Discovery And ProcessApplications Of Bioinformatics In Drug Discovery And Process
Applications Of Bioinformatics In Drug Discovery And ProcessProf. Dr. Basavaraj Nanjwade
 
Cheminformatics: An overview
Cheminformatics: An overviewCheminformatics: An overview
Cheminformatics: An overviewsubhasis banerjee
 

Viewers also liked (15)

Chem spider introduction spring 2011
Chem spider introduction spring 2011Chem spider introduction spring 2011
Chem spider introduction spring 2011
 
Prediction Of Bioactivity From Chemical Structure
Prediction Of Bioactivity From Chemical StructurePrediction Of Bioactivity From Chemical Structure
Prediction Of Bioactivity From Chemical Structure
 
Molecular Representation, Similarity and Search
Molecular Representation, Similarity and SearchMolecular Representation, Similarity and Search
Molecular Representation, Similarity and Search
 
Molecular similarity searching methods, seminar
Molecular similarity searching methods, seminarMolecular similarity searching methods, seminar
Molecular similarity searching methods, seminar
 
Chemoinformatics and information management
Chemoinformatics and information managementChemoinformatics and information management
Chemoinformatics and information management
 
Chemoinformatic
Chemoinformatic Chemoinformatic
Chemoinformatic
 
Cheminformatics
CheminformaticsCheminformatics
Cheminformatics
 
Cheminformatics
CheminformaticsCheminformatics
Cheminformatics
 
Cheminformatics in drug design
Cheminformatics in drug designCheminformatics in drug design
Cheminformatics in drug design
 
Application of graph theory in drug design
Application of graph theory in drug designApplication of graph theory in drug design
Application of graph theory in drug design
 
Bioinformatics and Drug Discovery
Bioinformatics and Drug DiscoveryBioinformatics and Drug Discovery
Bioinformatics and Drug Discovery
 
Chemoinformatics Practical
Chemoinformatics PracticalChemoinformatics Practical
Chemoinformatics Practical
 
Applications Of Bioinformatics In Drug Discovery And Process
Applications Of Bioinformatics In Drug Discovery And ProcessApplications Of Bioinformatics In Drug Discovery And Process
Applications Of Bioinformatics In Drug Discovery And Process
 
3D Searching by ruchi
3D Searching by ruchi3D Searching by ruchi
3D Searching by ruchi
 
Cheminformatics: An overview
Cheminformatics: An overviewCheminformatics: An overview
Cheminformatics: An overview
 

Similar to Code camp 2014 Talk Scientific Thinking

Leveraging Knowledge Graphs in your Enterprise Knowledge Management System
Leveraging Knowledge Graphs in your Enterprise Knowledge Management SystemLeveraging Knowledge Graphs in your Enterprise Knowledge Management System
Leveraging Knowledge Graphs in your Enterprise Knowledge Management SystemSemantic Web Company
 
Project Focused Activity And Knowledge Tracker A Unified Data Analysis Collab...
Project Focused Activity And Knowledge Tracker A Unified Data Analysis Collab...Project Focused Activity And Knowledge Tracker A Unified Data Analysis Collab...
Project Focused Activity And Knowledge Tracker A Unified Data Analysis Collab...brosiusad
 
Efficient biomolecular structural data handling and analysis - Webinar with D...
Efficient biomolecular structural data handling and analysis - Webinar with D...Efficient biomolecular structural data handling and analysis - Webinar with D...
Efficient biomolecular structural data handling and analysis - Webinar with D...ChemAxon
 
Efficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked DataEfficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked DataeXascale Infolab
 
SKOS as the focal point of linked data strategies
SKOS as the focal point of linked data strategiesSKOS as the focal point of linked data strategies
SKOS as the focal point of linked data strategiesSemantic Web Company
 
Exploring SAR between Patents and PubChem
Exploring SAR between Patents and PubChemExploring SAR between Patents and PubChem
Exploring SAR between Patents and PubChemChris Southan
 
Mining Drug Targets, Structures and Activity Data
Mining Drug Targets, Structures and Activity DataMining Drug Targets, Structures and Activity Data
Mining Drug Targets, Structures and Activity DataChris Southan
 
MarkLogic Semantic use cases
MarkLogic Semantic use cases MarkLogic Semantic use cases
MarkLogic Semantic use cases Fernando Mesa
 
2013_06_27 Dotmatics UGM
2013_06_27 Dotmatics UGM2013_06_27 Dotmatics UGM
2013_06_27 Dotmatics UGMBob Coner
 
Text Analytics for Non-Experts
Text Analytics for Non-ExpertsText Analytics for Non-Experts
Text Analytics for Non-ExpertsSynaptica, LLC
 
IDMP value beyond compliance
IDMP value beyond complianceIDMP value beyond compliance
IDMP value beyond complianceeCTDconsultancy
 
Neue Lösungen für Life Sciences und die Pharmaindustrie mit Graphdatenbanken
Neue Lösungen für Life Sciences und die Pharmaindustrie mit GraphdatenbankenNeue Lösungen für Life Sciences und die Pharmaindustrie mit Graphdatenbanken
Neue Lösungen für Life Sciences und die Pharmaindustrie mit GraphdatenbankenNeo4j
 
Integrating scientific laboratories into the cloud
Integrating scientific laboratories into the cloudIntegrating scientific laboratories into the cloud
Integrating scientific laboratories into the cloudData Finder
 
Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal GreenplumSimplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal GreenplumVMware Tanzu
 
EUGM 2014 - Richard Bolton (GlaxoSmithKline): GlaxoSmithKline: 5 years with ...
EUGM 2014 -  Richard Bolton (GlaxoSmithKline): GlaxoSmithKline: 5 years with ...EUGM 2014 -  Richard Bolton (GlaxoSmithKline): GlaxoSmithKline: 5 years with ...
EUGM 2014 - Richard Bolton (GlaxoSmithKline): GlaxoSmithKline: 5 years with ...ChemAxon
 
Chemoinformatic File Format.pptx
Chemoinformatic File Format.pptxChemoinformatic File Format.pptx
Chemoinformatic File Format.pptxwadhava gurumeet
 

Similar to Code camp 2014 Talk Scientific Thinking (20)

Leveraging Knowledge Graphs in your Enterprise Knowledge Management System
Leveraging Knowledge Graphs in your Enterprise Knowledge Management SystemLeveraging Knowledge Graphs in your Enterprise Knowledge Management System
Leveraging Knowledge Graphs in your Enterprise Knowledge Management System
 
IC-SDV 2019: OntoChem
IC-SDV 2019: OntoChemIC-SDV 2019: OntoChem
IC-SDV 2019: OntoChem
 
Project Focused Activity And Knowledge Tracker A Unified Data Analysis Collab...
Project Focused Activity And Knowledge Tracker A Unified Data Analysis Collab...Project Focused Activity And Knowledge Tracker A Unified Data Analysis Collab...
Project Focused Activity And Knowledge Tracker A Unified Data Analysis Collab...
 
Efficient biomolecular structural data handling and analysis - Webinar with D...
Efficient biomolecular structural data handling and analysis - Webinar with D...Efficient biomolecular structural data handling and analysis - Webinar with D...
Efficient biomolecular structural data handling and analysis - Webinar with D...
 
Efficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked DataEfficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked Data
 
SKOS as the focal point of linked data strategies
SKOS as the focal point of linked data strategiesSKOS as the focal point of linked data strategies
SKOS as the focal point of linked data strategies
 
Presentation of ChemSPider at PubChem Public Meeting
Presentation of ChemSPider at PubChem Public MeetingPresentation of ChemSPider at PubChem Public Meeting
Presentation of ChemSPider at PubChem Public Meeting
 
Exploring SAR between Patents and PubChem
Exploring SAR between Patents and PubChemExploring SAR between Patents and PubChem
Exploring SAR between Patents and PubChem
 
Mining Drug Targets, Structures and Activity Data
Mining Drug Targets, Structures and Activity DataMining Drug Targets, Structures and Activity Data
Mining Drug Targets, Structures and Activity Data
 
MarkLogic Semantic use cases
MarkLogic Semantic use cases MarkLogic Semantic use cases
MarkLogic Semantic use cases
 
2013_06_27 Dotmatics UGM
2013_06_27 Dotmatics UGM2013_06_27 Dotmatics UGM
2013_06_27 Dotmatics UGM
 
Text Analytics for Non-Experts
Text Analytics for Non-ExpertsText Analytics for Non-Experts
Text Analytics for Non-Experts
 
IDMP value beyond compliance
IDMP value beyond complianceIDMP value beyond compliance
IDMP value beyond compliance
 
Neue Lösungen für Life Sciences und die Pharmaindustrie mit Graphdatenbanken
Neue Lösungen für Life Sciences und die Pharmaindustrie mit GraphdatenbankenNeue Lösungen für Life Sciences und die Pharmaindustrie mit Graphdatenbanken
Neue Lösungen für Life Sciences und die Pharmaindustrie mit Graphdatenbanken
 
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
 
Integrating scientific laboratories into the cloud
Integrating scientific laboratories into the cloudIntegrating scientific laboratories into the cloud
Integrating scientific laboratories into the cloud
 
Data and model management in Systems Biology
Data and model management in Systems BiologyData and model management in Systems Biology
Data and model management in Systems Biology
 
Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal GreenplumSimplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
 
EUGM 2014 - Richard Bolton (GlaxoSmithKline): GlaxoSmithKline: 5 years with ...
EUGM 2014 -  Richard Bolton (GlaxoSmithKline): GlaxoSmithKline: 5 years with ...EUGM 2014 -  Richard Bolton (GlaxoSmithKline): GlaxoSmithKline: 5 years with ...
EUGM 2014 - Richard Bolton (GlaxoSmithKline): GlaxoSmithKline: 5 years with ...
 
Chemoinformatic File Format.pptx
Chemoinformatic File Format.pptxChemoinformatic File Format.pptx
Chemoinformatic File Format.pptx
 

Recently uploaded

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Recently uploaded (20)

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

Code camp 2014 Talk Scientific Thinking

  • 1. Geek Meets Science: ChemIDplus, an Example of Scientific Thinking Mitch Miller Scientific Thinking
  • 2. Overview ➲Introduce myself ➲My definition of a scientific geek consultant ➲An fast overview of Cheminformatics ➲Overview of the ChemIDplus project ➲The scientific geek's role in ChemIDplus
  • 3. Introduction: who am I? ➲Ph.D. chemist with 20+ years of experience in scientific information management ➲Currently independent consultant ➲Application developer, database person, requirements analyst, application first-aid ➲Main areas of focus: ●Chemical structure database management ●Managing data from high-throughput research
  • 4. The perspective of the scientific-geek- consultant ➲Is the scientific-geek-consultant's perspective on technology different from other geeks'? ➲Learn new technologies/frameworks/paradigms and take them in stride ➲What gets me excited is seeing a user able to do something that the user could not do yesterday ➲This talk is about one project in scientific information management and what I've done to give users access to what they could not do before
  • 5. Quick Introduction to Chemical Databases
  • 6. Representing Chemical Structures ➲This discussion is restricted to 2 dimensional (2D) structures which establish identity ➲Chemical structures can be represented graphically in a variety of ways. ➲ ➲ ➲ ➲ ➲ ➲To make structures searchable, you need a mathematical representation of the atoms and bonds: a connection table
  • 7. Searching for structures ➲Search for matches based on a graphic chemical system ●Start with a chemical of interest ●Find others like it ➲Several definitions of what makes one structure like another ●Exact match: find same molecule user input ●Substructure ●'Similarity' fuzzy match ➲Analogy: Word search for 'store'
  • 8. Substructure matches for Aspirin ➲Each of these ➲structures contains ➲the query structure ➲ ➲ ➲ ➲ ➲Word analogy results: ●Store ●drugstore ●stores ●stored ●restore
  • 9. Non-matching structure ➲4-(acetyloxy)-benzoic acid is not a substructure match for aspirin because it does not contain the same arrangement of atoms and bonds ➲ ➲ ➲ ➲ ➲ ➲Non-hits for Word search analogy: ●story ●storm ●'stoor' ➲Can be found using similarity search
  • 10. Structure search software ➲Standalone programs ●Ran on server or desktop ➲Client-server architectures ➲Database cartridges ●Provide chemical structure searching within a relational database ●Commercially available ●Add operators to store, search, retrieve and transform chemical structures within SQL ●e.g. SELECT ID, MOLDEPICTION(STRUCT) FROM OUR_STRUCTURE_TABLE WHERE SUBSTRUCT(STRUCT, 'CC(=O)Oc1ccccc1C(=O)O') =1 ●Client application must have a tool that can display connection tables as graphic chemical structures
  • 11. Structure database operations ➲Data stored in tables ➲Data loading typically requires specialized software ➲Indexing is non-typical ➲Search operators are specific to the cartridge
  • 12. How can you search a million chemical structures in seconds? ➲Chemical databases have sizes in 100's of thousands or millions ➲Comparing atoms and bonds takes time! ➲Users want answers quickly. ➲Solution: rapid screen-out step before looking at atom and bonds. ●Based on structure 'fingerprints' ●Analyze input structures for features such as rings, atoms, connection patterns (O-X-X-N). ●Create a bit string ●Compare bit string of query structure with bit strings in database. ●Bit string comparisons are very fast
  • 14. ChemIDplus ➲“Dictionary of over 400,000 chemicals (names, synonyms, and structures) … (with) links to NLM and other databases and resources” ➲Maintained by the Division of Specialized Information Services within the National Library of Medicine within National Institutes of Health ➲Used by people in industry, academia and government who handle drugs and chemicals and access environmental and safety data plus other biomedical information
  • 15. ChemIDplus ➲Part of a system of databases called 'Toxnet' at National Library of Medicine http://toxnet.nlm.nih.gov/ ➲Focus: ●Chemical Information ●Environmental Health and Toxicology ●HIV / AIDS ●Disaster Information ➲Available on the web in 3 'flavors': ●Full: http://chem.sis.nlm.nih.gov/chemidplus/ ●'Lite:' http://chem.sis.nlm.nih.gov/chemidplus/chemidlite.jsp ●Ultralite: http://druginfo.nlm.nih.gov/drugportal/drugportal.jsp ➲
  • 16. ChemIDplus Team ➲George (Mike) Hazard – team leader ➲Shannon Jordan ➲Michael Chambers - developer ➲Chuchu Lan – system administrator/DBA ➲Jenny Fang ➲Stefanie Publicker ➲Larry Callahan, Frank Switzer – FDA liaisons
  • 17. Historical Note ➲ChemIDplus was one of the first structure-searchable databases on the worldwide web ➲Started in 1998 ➲Original developer
  • 18. Server Architecture Tomcat Server Servlets, JSPs, JS libraries Database Server Chemical Data Cartridge Database (Oracle) Structures Names Links Properties
  • 19. And now, the demo...
  • 20. Scientific Geek's role in ChemIDplus ➲Developer of the original system in 1998-9 in a since-retired technology ➲Database administrator for structures ●Upgrade between versions of the chemical search software ●Periodic reindexing of the structures for performance ●Batch updates ●Help clean up invalid data ➲Tester ●Performed load testing when the application was migrated to Java servlets ➲Liaison with other governmental agencies ●Share structures with NCI, PubChem ➲Structure orientation application ●Tool to help ensure that series of chemical compounds look similar ➲
  • 21. Structure table synchronization The old way ➲Monthly manual process ●Query structures recently added or changed ●Extract to disk files ●Generated data based on structure: InChI, SMILES, 3D coordinates ●Registered each item separately ➲Took a couple of hours each month ➲This was repetitious work
  • 22. New system ➲Database trigger detects a change when a value is inserted or updated to a chemical structure field ➲Computes and stores InChI and SMILES immediately ➲Submits a batch job (DBMS_JOB package) for 3D ●Deletes old 3D structure ●Writes 2D structure to disk ●Invokes Corina (Molecular Networks) program to generate 3D structure ●Reads 3D structure into separate table
  • 23. Orienting Structures Consistently ➲Databases often contain 'families' of related compounds ➲Example molecule and hits ➲ ➲ ➲ ➲ ➲ ➲ ➲Manually manipulating structures takes time! ➲
  • 24. Solution: 'StructClean' Utility ➲Accepts a template structure + molecular weight ●Locates all molecules in the DB that contain the template under the molecular weight cutoff ●Without the cutoff, you'd might have huge molecules that contain a small template ➲All hits are oriented to match the template ➲User reviews hits ●Selects/deselects items ●Commits changes ➲Utility is a Java servlet
  • 25. Conclusion ➲ChemIDplus is a valuable resource to those looking for chemical information on the web ➲Scientific-geek-consultants use a variety of technologies to provide service to research clients ➲We are similar to regular geeks in many ways ➲The differences are interesting!