SlideShare uma empresa Scribd logo
1 de 48
Baixar para ler offline
RDKit: where did we come from and where are
we going?
Greg Landrum (@dr_greg_landrum)
12th International Conference on Chemical Structures
12 June, 2022
The Trustees of the CSA Trust are pleased to announce that
Greg Landrum has been awarded the 2022 Mike Lynch
Award, in recognition of his work on the development of
RDKit and his fostering of the community around it, a
transformative software resource for cheminformatics and
machine learning. https://csa-trust.org/2022/05/13/mike-lynch-award-2022-greg-landrum/
The purpose of the Award is to recognise and encourage outstanding
accomplishments in education, research and development activities that are
related to the systems and methods used to store, process and retrieve
information about chemical structures, reactions and properties.
The Mike Lynch Award will be presented at a prestigious, relevant conference
to be identified prior to each presentation and the awardee will be asked to
give a presentation at the conference. https://csa-trust.org/awards-and-grants/awards/
3
The RDKit
4
Acknowledgements
● Everyone who has contributed code, questions,
answers, bug reports, etc
● The people who manage RDKit packaging
● The organizers and sponsors of the RDKit
UGMs
● People who have funded RDKit development
(directly or indirectly)
● The others in our community who've been
pushing the idea and adoption of open source
5
An open source toolkit for cheminformatics
● Business-friendly BSD license
● Core data structures and algorithms in
C++
● Python 3.x wrapper generated using
Boost.Python
● Java and C# wrappers generated with
SWIG
● JavaScript wrappers
● CFFI wrapper for usage from other
languages
● 2D and 3D molecular operations
● Descriptor generation for machine
learning
● Molecular database cartridge for
PostgreSQL
● Cheminformatics nodes for KNIME
(distributed from the KNIME
community site:
http://www.knime.org/rdkit)
6
Ecodesystem
Exact same implementation regardless of where you are using it from
7
Releases, reproducibility, and citability
● 2 feature releases per year
● ~monthly patch releases with bug fixes
● Every release is assigned a DOI and archived on Zenodo
https://zenodo.org/record/6483170
8
Packaging
- conda-forge: conda install -c conda-forge rdkit
- pypi: pip install rdkit-pypi
- npm: npm i @rdkit/rdkit
- apt: apt install python3-rdkit postgresql-14-rdkit
9
Sustainability: the bus problem
https://commons.wikimedia.org/wiki/File:Postauto_susten.jpg
10
Sustainability: the bus problem
RDKit maintainers:
- Greg
- Brian Kelley (Relay Therapeutics)
- Ricardo Rodriguez (Schrödinger)
- Paolo Tosco (Novartis)
Regular code contributors:
- David Cosgrove
- Peter Gedeck
- Gareth Jones
- Eisuke Kawashima
- Dan Nealschneider
- Sereina Riniker
- Roger Sayle
- Riccardo Vianello
The RDKit community
How it started…
The RDKit community
How it’s going…
Where we came from, where we’re going
14
The early days
● 2000-2006: initial development work at Rational Discovery
● 2006: code open sourced and released on sourceforge.net
15
Aside: some motivations for open-sourcing scientific code
● Recognition
● Helping the scientific community
● Feedback and help from others
● You get to keep using the code when you move on
to your next position
16
Some history
● 2000-2006: initial development work at Rational Discovery
● 2006: code open sourced and released on sourceforge.net
● 2007: First NIBR contribution (chemical reaction handling); Noel discovers the RDKit
● 2008: first POC of Java wrapper; Mac support added; SLN and Mol2 parsers;
● 2009: Morgan fingerprints; switch to cmake; switch to VF2 for SSS
● 2010: PostgreSQL cartridge; First iteration of the KNIME nodes; $RDBASE/Contrib appears;
SaltRemover and FunctionalGroups code
● 2011: New Java wrappers; more functionality moved to C++; InChI support; AvalonTools
integration
● 2012: First UGM; Speed improvements; MCS implementation; IPython integration; “RDKit
Cookbook” appears
● 2013: Move to github; Pandas integration; MMFF and Open3DAlign support; PDB support;
rdkit blog started
17
Some history, cntd
● 2014: python3 support; conda integration; experimental lucene integration; MCS implementation in
C++
● 2015: new drawing code; improved canonicalization algorithm; ETKDG; reduced memory usage
● 2016: Regular patch releases; easier builds; performance improvements; KNIME nodes move to
Github
● 2017: Modern C++; R-group decomposition, first GSoC participation, conda-forge packages
● 2018: CoordGen integration; molecular standardization
● 2019: Azure DevOps, substructure speedup, new molecule hashing code, Neo4J integration, new JS
wrappers
● 2020: new CIP implementation, scaffold network, abbreviations, tautomer-insensitive substructure
search
● 2021: rdkit-cffi, more drawing improvements, R-group decomposition improvements
● 2022: C++17, generics for searching, non-tetrahedral symmetry…
An aside…
19
Looking forward
20
Longer term RDKit objectives
● Improved support for other classes of molecules
■ Polymers
■ Organometallics
● Ensuring that the PostgreSQL cartridge is a plausible
candidate for use in a corporate “data warehouse”1
● Ensuring all the pieces are in place to make it easy to
write a compound registration system
1
or whatever such things are called these days
21
Future directions: the cartridge
Ensuring that the PostgreSQL cartridge is a plausible candidate
for use in a corporate “data warehouse”
- Integration of tautomer insensitive search
- Integration of the MolStandardize code
- Improvements to the chemical reaction handling
- Integration of the generics for searching
Further ideas
- Adding some 3D search capabilities
22
Future directions: registration systems
First: what is a chemical registration system?
23
Aside: Goals of a compound registration system
We want to be able to answer these questions:
- Have we seen this compound before?
- Give me a key for this compound
- Give me the structure for this key
24
Aside: Goals of a compound registration system
We want to be able to answer these questions:
- Have we seen this compound before?
- Give me a key for this compound
- Give me the structure for this key
So what do we need to be able to do?
- Standardize molecules
- Generate hashes/keys for standardized molecules
- Store structures
25
Using keys for registration
Idea: use a hash to combine:
- The molecular structure (via a fixed H
InChI)
- A stereo code
- A stereo comment
https://github.com/rdkit/UGM_2015/blob/8f562e70add17bab35f43823af0f03673f8a
1f2d/Presentations/KeyToRegistration.GregLandrum.pdf
26
Future directions: registration systems
Ensuring all the pieces are in place to make it easy to write a compound registration system
- Improvements to MolStandardize code
- Improvements to the molecular hashing code
- Support for more other classes of molecules
27
Let’s talk about molecular identity
This isn’t just a topic for standard compound registration systems.
28
Molecular identity and computational questions
● Which molecules were used to generate this
result?
● Have I already done a calculation using this
molecule?
● Was this molecule part of my training set?
All of these require us to be able to answer
the question
“are these two molecules the same?”
Here be dragons…
29
Some things making molecular identity nontrivial
30
Some things making molecular identity nontrivial
● Counterions, solvents
● Resonance forms
● Charges
● Tautomers
● Stereochemistry
Sometimes we care about these differences, sometimes we don’t. It depends on the context
around when asking the question “are these two molecules the same?”
This is not a comprehensive list
31
Identity hashes for molecules
Idea: convert the molecule into some form which allows us to test whether or not it’s
identical to other molecules via a simple string (or numerical) comparison.
What “identical” means will be determined by the identity hash used.
Familiar examples:
- Canonical SMILES
- InChI
32
Contextual identity
Instead of having a single key/hash for a molecule, store a collection of layers with different
levels of detail/types of information. When searching, choose the layers which are relevant
for the current use case
● Store molecules using some relatively lossless format (e.g. v3000 SDF)
● Use molecular hashes capturing different levels of information to establish whether or
not duplicates exist
Note: it’s possible to do a limited version of this via careful manipulation of InChI strings
33
Some more identity hashes
https://www.nextmovesoftware.com/talks/OBoyle_MolHash_ACS_201908.pdf
Available in the RDKit since the 2019.09 release
34
Some of the basic identity hashes in rdMolHash
● Molecular formula
● Anonymous graph
● Element graph
● Murcko scaffold
● Tautomer
● Canonical smiles
There are many others
35
Hashes for registration
The team at Schrödinger1
have contributed a new RDKit module for calculating layered
hashes which are useful for compound identity testing and registration. This will be in the
2022.09 release.
Layers it currently supports:
- Formula
- Canonical SMILES : with and without stereo
- Tautomer hash: with and without stereo
- Sgroup data (for some help with polymers and things like atropisomers)
- “Escape layer” (free text allowing a structure to be different even if everything else says
it’s the same)
1
Chris Von Bargen, Hussein Faara, Dan Nealschneider, Ricardo Rodriguez, Rachel Walker
36
Registration hash example
{<HashLayer.CANONICAL_SMILES: 1>: 'COc1ccc2[nH]c([S@@](=O)Cc3ncc(C)c(OC)c3C)nc2c1',
<HashLayer.ESCAPE: 2>: '',
<HashLayer.FORMULA: 3>: 'C17H19N3O3S',
<HashLayer.NO_STEREO_SMILES: 4>: 'COc1ccc2[nH]c(S(=O)Cc3ncc(C)c(OC)c3C)nc2c1',
<HashLayer.NO_STEREO_TAUTOMER_HASH: 5>:
'CO[C]1[CH][CH][C]2[N][C]([S]([O])C[C]3[N][CH][C](C)[C](OC)[C]3C)[N][C]2[CH]1_1_0',
<HashLayer.SGROUP_DATA: 6>: '[]',
<HashLayer.TAUTOMER_HASH: 7>:
'CO[C]1[CH][CH][C]2[N][C]([S@@]([O])C[C]3[N][CH][C](C)[C](OC)[C]3C)[N][C]2[CH]1_1_0'}
37
Handling tautomers
{<HashLayer.CANONICAL_SMILES: 1>:
'CCCS(=O)(=O)Nc1ccc(F)c(C(=O)c2c[nH]c3ncc(-c
4ccc(Cl)cc4)cc23)c1F',
<HashLayer.ESCAPE: 2>: '',
<HashLayer.FORMULA: 3>: 'C23H18ClF2N3O3S',
…
<HashLayer.TAUTOMER_HASH: 7>:
'CCCS([O])([O])[N][C]1[CH][CH][C](F)[C]([C](
[O])[C]2[CH][N][C]3[N][CH][C]([C]4[CH][CH][C
](Cl)[CH][CH]4)[CH][C]32)[C]1F_2_0'}
{<HashLayer.CANONICAL_SMILES: 1>:
'CCCS(=O)(=O)Nc1ccc(F)c(C(=O)c2cnc3[nH]cc(-c
4ccc(Cl)cc4)cc2-3)c1F',
<HashLayer.ESCAPE: 2>: '',
<HashLayer.FORMULA: 3>: 'C23H18ClF2N3O3S',
…
<HashLayer.TAUTOMER_HASH: 7>:
'CCCS([O])([O])[N][C]1[CH][CH][C](F)[C]([C](
[O])[C]2[CH][N][C]3[N][CH][C]([C]4[CH][CH][C
](Cl)[CH][CH]4)[CH][C]32)[C]1F_2_0'}
38
Handling atropisomers
Structures from: https://doi.org/10.1016/j.xphs.2021.10.011
39
Handling atropisomers
Structures from: https://doi.org/10.1016/j.xphs.2021.10.011
The bold and hashed bonds are just drawing features and don’t survive translation
to things like CXSMILES or mol files. But we can use S groups to indicate the
stereochemistry
40
Handling atropisomers
Structures from: https://doi.org/10.1016/j.xphs.2021.10.011
{<HashLayer.CANONICAL_SMILES: 1>:
'COc1cc2ncc3c(c2cc1-c1cn(C)nc1C)n(-c1c(F)cncc1OC)c(=O
)n3C',
<HashLayer.ESCAPE: 2>: '',
<HashLayer.FORMULA: 3>: 'C23H21FN6O3',
…
<HashLayer.SGROUP_DATA: 6>: '[{"fieldName":
"atropisomer", "atom": [19, 20], "bonds": [],
"value": "M"}]',
…}
{<HashLayer.CANONICAL_SMILES: 1>:
'COc1cc2ncc3c(c2cc1-c1cn(C)nc1C)n(-c1c(F)cncc1OC)c(=O
)n3C',
<HashLayer.ESCAPE: 2>: '',
<HashLayer.FORMULA: 3>: 'C23H21FN6O3',
…
<HashLayer.SGROUP_DATA: 6>: '[{"fieldName":
"atropisomer", "atom": [19, 20], "bonds": [],
"value": "P"}]',
…}
41
Handling polymers
{<HashLayer.CANONICAL_SMILES: 1>: '*c1cnc(*)s1',
…,
<HashLayer.SGROUP_DATA: 6>: '[{"type": "SRU",
"atoms": [1, 2, 3, 4, 6], "bonds": [[0, 1], [4, 5]],
"index": 1, "connect": "HT", "label": "n"}]',
…}
{<HashLayer.CANONICAL_SMILES: 1>: '*c1cnc(*)s1',
…,
<HashLayer.SGROUP_DATA: 6>: '[{"type": "SRU",
"atoms": [1, 2, 3, 4, 6], "bonds": [[0, 1], [4, 5]],
"index": 1, "connect": "HH", "label": "n"}]',
…}
42
Handling enhanced stereochemistry
Ethambutol
These two describe the same racemic mixture
43
Handling enhanced stereochemistry
{<HashLayer.CANONICAL_SMILES: 1>:
'CC[C@@H](CO)NCCN[C@@H](CC)CO',
…,
<HashLayer.NO_STEREO_SMILES: 4>:
'CCC(CO)NCCNC(CC)CO',
…}
{<HashLayer.CANONICAL_SMILES: 1>:
'CC[C@@H](CO)NCCN[C@@H](CC)CO |&1:2,9|',
…,
<HashLayer.NO_STEREO_SMILES: 4>:
'CCC(CO)NCCNC(CC)CO',
…}
We get the same hash if the molecule is drawn with
wedged bonds.
44
Using the escape layer
Suppose I start with the racemic mixture, run it through a chiral column, and
collect the two fractions
I want to register the two fractions separately without determining the absolute
stereochemistry
45
Using the escape layer
{<HashLayer.CANONICAL_SMILES: 1>:
'CC[C@@H](CO)NCCN[C@@H](CC)CO |o1:2,9|',
<HashLayer.ESCAPE: 2>: ‘first fraction',
…}
{<HashLayer.CANONICAL_SMILES: 1>:
'CC[C@@H](CO)NCCN[C@@H](CC)CO |o1:2,9|',
<HashLayer.ESCAPE: 2>: ‘second fraction',
…}
46
Aside: using the escape layer for comp chem
{…
<HashLayer.ESCAPE: 2>: ‘conformer 1',
…}
{…
<HashLayer.ESCAPE: 2>: ‘conformer 2',
…}
Suppose I want to store multiple conformers/poses of the same molecule
47
Wrapping up: molecular identity
● For many computational tasks we want to be
able to figure out whether or not we have
seen/used a particular molecule
● The definition of “same” for molecules
depends on the context/question being asked
● Layered registration hashes make it easy (and
cheap) to store sets of molecules and answer
the context-dependent “are these the same?”
question
48
Thanks!
Thanks!

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

The Global Crypto Classification Standard by 21Shares & CoinGecko
The Global Crypto Classification Standard by 21Shares & CoinGeckoThe Global Crypto Classification Standard by 21Shares & CoinGecko
The Global Crypto Classification Standard by 21Shares & CoinGecko
 
Blockchain (1).pptx
Blockchain (1).pptxBlockchain (1).pptx
Blockchain (1).pptx
 
iA Générative : #ChatGPT #MidJourney
iA Générative : #ChatGPT #MidJourney iA Générative : #ChatGPT #MidJourney
iA Générative : #ChatGPT #MidJourney
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)
 
Git Branching – the battle of the ages
Git Branching – the battle of the agesGit Branching – the battle of the ages
Git Branching – the battle of the ages
 
Bitcoin
BitcoinBitcoin
Bitcoin
 
McKinsey_2022_ESG_Full_Report.pdf
McKinsey_2022_ESG_Full_Report.pdfMcKinsey_2022_ESG_Full_Report.pdf
McKinsey_2022_ESG_Full_Report.pdf
 
Multi-Signature Crypto-Wallets: Nakov at Blockchain Berlin 2018
Multi-Signature Crypto-Wallets: Nakov at Blockchain Berlin 2018Multi-Signature Crypto-Wallets: Nakov at Blockchain Berlin 2018
Multi-Signature Crypto-Wallets: Nakov at Blockchain Berlin 2018
 
What's cryptocurrency ?
What's cryptocurrency ?What's cryptocurrency ?
What's cryptocurrency ?
 
Bain Covid 19 situation report & action agenda
Bain   Covid 19 situation report & action agendaBain   Covid 19 situation report & action agenda
Bain Covid 19 situation report & action agenda
 
Decentralized Autonomous Organizations.pptx
Decentralized Autonomous Organizations.pptxDecentralized Autonomous Organizations.pptx
Decentralized Autonomous Organizations.pptx
 
pixel SK6812 rgb led specification datasheet from hanron lighting
pixel SK6812  rgb led specification datasheet from hanron lightingpixel SK6812  rgb led specification datasheet from hanron lighting
pixel SK6812 rgb led specification datasheet from hanron lighting
 
Economy SEA 2019 by google
Economy SEA 2019 by googleEconomy SEA 2019 by google
Economy SEA 2019 by google
 
BITCOIN
BITCOINBITCOIN
BITCOIN
 
Using Generative AI in the Classroom .pptx
Using Generative AI in the Classroom .pptxUsing Generative AI in the Classroom .pptx
Using Generative AI in the Classroom .pptx
 
초보자를 위한 Git & GitHub
초보자를 위한 Git & GitHub초보자를 위한 Git & GitHub
초보자를 위한 Git & GitHub
 
Understanding Cryptocurrency
Understanding CryptocurrencyUnderstanding Cryptocurrency
Understanding Cryptocurrency
 
Introducing Oracle Linux and Securing It With ksplice
Introducing Oracle Linux and Securing It With kspliceIntroducing Oracle Linux and Securing It With ksplice
Introducing Oracle Linux and Securing It With ksplice
 
Using blockchains in the energy sector and beyond
Using blockchains in the energy sector and beyondUsing blockchains in the energy sector and beyond
Using blockchains in the energy sector and beyond
 
Blockchain 101 | Blockchain Tutorial | Blockchain Smart Contracts | Blockchai...
Blockchain 101 | Blockchain Tutorial | Blockchain Smart Contracts | Blockchai...Blockchain 101 | Blockchain Tutorial | Blockchain Smart Contracts | Blockchai...
Blockchain 101 | Blockchain Tutorial | Blockchain Smart Contracts | Blockchai...
 

Semelhante a Mike Lynch Award Lecture, ICCS 2022

Open Chemistry: Input Preparation, Data Visualization & Analysis
Open Chemistry: Input Preparation, Data Visualization & AnalysisOpen Chemistry: Input Preparation, Data Visualization & Analysis
Open Chemistry: Input Preparation, Data Visualization & Analysis
Marcus Hanwell
 
20191116 DevFest 2019 The Legacy Code came to stay (El legacy vino para queda...
20191116 DevFest 2019 The Legacy Code came to stay (El legacy vino para queda...20191116 DevFest 2019 The Legacy Code came to stay (El legacy vino para queda...
20191116 DevFest 2019 The Legacy Code came to stay (El legacy vino para queda...
Antonio de la Torre Fernández
 

Semelhante a Mike Lynch Award Lecture, ICCS 2022 (20)

ACS San Diego - The RDKit: Open-source cheminformatics
ACS San Diego - The RDKit: Open-source cheminformaticsACS San Diego - The RDKit: Open-source cheminformatics
ACS San Diego - The RDKit: Open-source cheminformatics
 
Querying a Complex Web-Based KB for Cultural Heritage Preservation
Querying a Complex Web-Based KB  for Cultural Heritage PreservationQuerying a Complex Web-Based KB  for Cultural Heritage Preservation
Querying a Complex Web-Based KB for Cultural Heritage Preservation
 
IIT-RTC 2017 Qt WebRTC Tutorial (Qt Janus Client)
IIT-RTC 2017 Qt WebRTC Tutorial (Qt Janus Client)IIT-RTC 2017 Qt WebRTC Tutorial (Qt Janus Client)
IIT-RTC 2017 Qt WebRTC Tutorial (Qt Janus Client)
 
Maintaining and Releasing Open Source Software
Maintaining and Releasing Open Source SoftwareMaintaining and Releasing Open Source Software
Maintaining and Releasing Open Source Software
 
The path to an hybrid open source paradigm
The path to an hybrid open source paradigmThe path to an hybrid open source paradigm
The path to an hybrid open source paradigm
 
Boston Data Engineering: Kedro Python Framework for Data Science: Overview an...
Boston Data Engineering: Kedro Python Framework for Data Science: Overview an...Boston Data Engineering: Kedro Python Framework for Data Science: Overview an...
Boston Data Engineering: Kedro Python Framework for Data Science: Overview an...
 
Primers or Reminders? The Effects of Existing Review Comments on Code Review
Primers or Reminders? The Effects of Existing Review Comments on Code ReviewPrimers or Reminders? The Effects of Existing Review Comments on Code Review
Primers or Reminders? The Effects of Existing Review Comments on Code Review
 
Open Chemistry: Input Preparation, Data Visualization & Analysis
Open Chemistry: Input Preparation, Data Visualization & AnalysisOpen Chemistry: Input Preparation, Data Visualization & Analysis
Open Chemistry: Input Preparation, Data Visualization & Analysis
 
Drupal Dev Days Vienna 2023 - What is the secure software supply chain and th...
Drupal Dev Days Vienna 2023 - What is the secure software supply chain and th...Drupal Dev Days Vienna 2023 - What is the secure software supply chain and th...
Drupal Dev Days Vienna 2023 - What is the secure software supply chain and th...
 
Docs as Code: Publishing Processes for API Experiences
Docs as Code: Publishing Processes for API ExperiencesDocs as Code: Publishing Processes for API Experiences
Docs as Code: Publishing Processes for API Experiences
 
Continuous Security for GitOps
Continuous Security for GitOpsContinuous Security for GitOps
Continuous Security for GitOps
 
OpenChain Mini-Summit May 2023
OpenChain Mini-Summit May 2023OpenChain Mini-Summit May 2023
OpenChain Mini-Summit May 2023
 
Service computation20.ppt
Service computation20.pptService computation20.ppt
Service computation20.ppt
 
BlockchainLAB Hackathon
BlockchainLAB HackathonBlockchainLAB Hackathon
BlockchainLAB Hackathon
 
Not all open source is the same
Not all open source is the sameNot all open source is the same
Not all open source is the same
 
PRO TALK - Kubernetes Security Workshop.pdf
PRO TALK - Kubernetes Security Workshop.pdfPRO TALK - Kubernetes Security Workshop.pdf
PRO TALK - Kubernetes Security Workshop.pdf
 
Kubernetes Security Workshop
Kubernetes Security WorkshopKubernetes Security Workshop
Kubernetes Security Workshop
 
Juni_Mukherjee_The_DevSecOps_Journey_AntiPatterns_Analytics_and_Insights
Juni_Mukherjee_The_DevSecOps_Journey_AntiPatterns_Analytics_and_InsightsJuni_Mukherjee_The_DevSecOps_Journey_AntiPatterns_Analytics_and_Insights
Juni_Mukherjee_The_DevSecOps_Journey_AntiPatterns_Analytics_and_Insights
 
ExSchema - ICSM'13
ExSchema - ICSM'13ExSchema - ICSM'13
ExSchema - ICSM'13
 
20191116 DevFest 2019 The Legacy Code came to stay (El legacy vino para queda...
20191116 DevFest 2019 The Legacy Code came to stay (El legacy vino para queda...20191116 DevFest 2019 The Legacy Code came to stay (El legacy vino para queda...
20191116 DevFest 2019 The Legacy Code came to stay (El legacy vino para queda...
 

Mais de Greg Landrum

How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them? How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
Greg Landrum
 
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Greg Landrum
 

Mais de Greg Landrum (18)

Chemical registration
Chemical registrationChemical registration
Chemical registration
 
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Google BigQuery for analysis of scientific datasets: Interactive exploration ...Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
 
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)
 
Moving from Artisanal to Industrial Machine Learning
Moving from Artisanal to Industrial Machine LearningMoving from Artisanal to Industrial Machine Learning
Moving from Artisanal to Industrial Machine Learning
 
Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)Building useful models for imbalanced datasets (without resampling)
Building useful models for imbalanced datasets (without resampling)
 
Let’s talk about reproducible data analysis
Let’s talk about reproducible data analysisLet’s talk about reproducible data analysis
Let’s talk about reproducible data analysis
 
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them? How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
 
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...
 
Processing malaria HTS results using KNIME: a tutorial
Processing malaria HTS results using KNIME: a tutorialProcessing malaria HTS results using KNIME: a tutorial
Processing malaria HTS results using KNIME: a tutorial
 
Big (chemical) data? No Problem!
Big (chemical) data? No Problem!Big (chemical) data? No Problem!
Big (chemical) data? No Problem!
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical research
 
Some "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data frontSome "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data front
 
Large scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent dataLarge scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent data
 
Machine learning in the life sciences with knime
Machine learning in the life sciences with knimeMachine learning in the life sciences with knime
Machine learning in the life sciences with knime
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKit
 
Open-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databasesOpen-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databases
 
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...
 
Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...
 

Último

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
ANSARKHAN96
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Silpa
 
Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.
Silpa
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 

Último (20)

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdf
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
 
Genome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxGenome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptx
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
 
Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 

Mike Lynch Award Lecture, ICCS 2022

  • 1. RDKit: where did we come from and where are we going? Greg Landrum (@dr_greg_landrum) 12th International Conference on Chemical Structures 12 June, 2022
  • 2. The Trustees of the CSA Trust are pleased to announce that Greg Landrum has been awarded the 2022 Mike Lynch Award, in recognition of his work on the development of RDKit and his fostering of the community around it, a transformative software resource for cheminformatics and machine learning. https://csa-trust.org/2022/05/13/mike-lynch-award-2022-greg-landrum/ The purpose of the Award is to recognise and encourage outstanding accomplishments in education, research and development activities that are related to the systems and methods used to store, process and retrieve information about chemical structures, reactions and properties. The Mike Lynch Award will be presented at a prestigious, relevant conference to be identified prior to each presentation and the awardee will be asked to give a presentation at the conference. https://csa-trust.org/awards-and-grants/awards/
  • 4. 4 Acknowledgements ● Everyone who has contributed code, questions, answers, bug reports, etc ● The people who manage RDKit packaging ● The organizers and sponsors of the RDKit UGMs ● People who have funded RDKit development (directly or indirectly) ● The others in our community who've been pushing the idea and adoption of open source
  • 5. 5 An open source toolkit for cheminformatics ● Business-friendly BSD license ● Core data structures and algorithms in C++ ● Python 3.x wrapper generated using Boost.Python ● Java and C# wrappers generated with SWIG ● JavaScript wrappers ● CFFI wrapper for usage from other languages ● 2D and 3D molecular operations ● Descriptor generation for machine learning ● Molecular database cartridge for PostgreSQL ● Cheminformatics nodes for KNIME (distributed from the KNIME community site: http://www.knime.org/rdkit)
  • 6. 6 Ecodesystem Exact same implementation regardless of where you are using it from
  • 7. 7 Releases, reproducibility, and citability ● 2 feature releases per year ● ~monthly patch releases with bug fixes ● Every release is assigned a DOI and archived on Zenodo https://zenodo.org/record/6483170
  • 8. 8 Packaging - conda-forge: conda install -c conda-forge rdkit - pypi: pip install rdkit-pypi - npm: npm i @rdkit/rdkit - apt: apt install python3-rdkit postgresql-14-rdkit
  • 9. 9 Sustainability: the bus problem https://commons.wikimedia.org/wiki/File:Postauto_susten.jpg
  • 10. 10 Sustainability: the bus problem RDKit maintainers: - Greg - Brian Kelley (Relay Therapeutics) - Ricardo Rodriguez (Schrödinger) - Paolo Tosco (Novartis) Regular code contributors: - David Cosgrove - Peter Gedeck - Gareth Jones - Eisuke Kawashima - Dan Nealschneider - Sereina Riniker - Roger Sayle - Riccardo Vianello
  • 11. The RDKit community How it started…
  • 12. The RDKit community How it’s going…
  • 13. Where we came from, where we’re going
  • 14. 14 The early days ● 2000-2006: initial development work at Rational Discovery ● 2006: code open sourced and released on sourceforge.net
  • 15. 15 Aside: some motivations for open-sourcing scientific code ● Recognition ● Helping the scientific community ● Feedback and help from others ● You get to keep using the code when you move on to your next position
  • 16. 16 Some history ● 2000-2006: initial development work at Rational Discovery ● 2006: code open sourced and released on sourceforge.net ● 2007: First NIBR contribution (chemical reaction handling); Noel discovers the RDKit ● 2008: first POC of Java wrapper; Mac support added; SLN and Mol2 parsers; ● 2009: Morgan fingerprints; switch to cmake; switch to VF2 for SSS ● 2010: PostgreSQL cartridge; First iteration of the KNIME nodes; $RDBASE/Contrib appears; SaltRemover and FunctionalGroups code ● 2011: New Java wrappers; more functionality moved to C++; InChI support; AvalonTools integration ● 2012: First UGM; Speed improvements; MCS implementation; IPython integration; “RDKit Cookbook” appears ● 2013: Move to github; Pandas integration; MMFF and Open3DAlign support; PDB support; rdkit blog started
  • 17. 17 Some history, cntd ● 2014: python3 support; conda integration; experimental lucene integration; MCS implementation in C++ ● 2015: new drawing code; improved canonicalization algorithm; ETKDG; reduced memory usage ● 2016: Regular patch releases; easier builds; performance improvements; KNIME nodes move to Github ● 2017: Modern C++; R-group decomposition, first GSoC participation, conda-forge packages ● 2018: CoordGen integration; molecular standardization ● 2019: Azure DevOps, substructure speedup, new molecule hashing code, Neo4J integration, new JS wrappers ● 2020: new CIP implementation, scaffold network, abbreviations, tautomer-insensitive substructure search ● 2021: rdkit-cffi, more drawing improvements, R-group decomposition improvements ● 2022: C++17, generics for searching, non-tetrahedral symmetry…
  • 20. 20 Longer term RDKit objectives ● Improved support for other classes of molecules ■ Polymers ■ Organometallics ● Ensuring that the PostgreSQL cartridge is a plausible candidate for use in a corporate “data warehouse”1 ● Ensuring all the pieces are in place to make it easy to write a compound registration system 1 or whatever such things are called these days
  • 21. 21 Future directions: the cartridge Ensuring that the PostgreSQL cartridge is a plausible candidate for use in a corporate “data warehouse” - Integration of tautomer insensitive search - Integration of the MolStandardize code - Improvements to the chemical reaction handling - Integration of the generics for searching Further ideas - Adding some 3D search capabilities
  • 22. 22 Future directions: registration systems First: what is a chemical registration system?
  • 23. 23 Aside: Goals of a compound registration system We want to be able to answer these questions: - Have we seen this compound before? - Give me a key for this compound - Give me the structure for this key
  • 24. 24 Aside: Goals of a compound registration system We want to be able to answer these questions: - Have we seen this compound before? - Give me a key for this compound - Give me the structure for this key So what do we need to be able to do? - Standardize molecules - Generate hashes/keys for standardized molecules - Store structures
  • 25. 25 Using keys for registration Idea: use a hash to combine: - The molecular structure (via a fixed H InChI) - A stereo code - A stereo comment https://github.com/rdkit/UGM_2015/blob/8f562e70add17bab35f43823af0f03673f8a 1f2d/Presentations/KeyToRegistration.GregLandrum.pdf
  • 26. 26 Future directions: registration systems Ensuring all the pieces are in place to make it easy to write a compound registration system - Improvements to MolStandardize code - Improvements to the molecular hashing code - Support for more other classes of molecules
  • 27. 27 Let’s talk about molecular identity This isn’t just a topic for standard compound registration systems.
  • 28. 28 Molecular identity and computational questions ● Which molecules were used to generate this result? ● Have I already done a calculation using this molecule? ● Was this molecule part of my training set? All of these require us to be able to answer the question “are these two molecules the same?” Here be dragons…
  • 29. 29 Some things making molecular identity nontrivial
  • 30. 30 Some things making molecular identity nontrivial ● Counterions, solvents ● Resonance forms ● Charges ● Tautomers ● Stereochemistry Sometimes we care about these differences, sometimes we don’t. It depends on the context around when asking the question “are these two molecules the same?” This is not a comprehensive list
  • 31. 31 Identity hashes for molecules Idea: convert the molecule into some form which allows us to test whether or not it’s identical to other molecules via a simple string (or numerical) comparison. What “identical” means will be determined by the identity hash used. Familiar examples: - Canonical SMILES - InChI
  • 32. 32 Contextual identity Instead of having a single key/hash for a molecule, store a collection of layers with different levels of detail/types of information. When searching, choose the layers which are relevant for the current use case ● Store molecules using some relatively lossless format (e.g. v3000 SDF) ● Use molecular hashes capturing different levels of information to establish whether or not duplicates exist Note: it’s possible to do a limited version of this via careful manipulation of InChI strings
  • 33. 33 Some more identity hashes https://www.nextmovesoftware.com/talks/OBoyle_MolHash_ACS_201908.pdf Available in the RDKit since the 2019.09 release
  • 34. 34 Some of the basic identity hashes in rdMolHash ● Molecular formula ● Anonymous graph ● Element graph ● Murcko scaffold ● Tautomer ● Canonical smiles There are many others
  • 35. 35 Hashes for registration The team at Schrödinger1 have contributed a new RDKit module for calculating layered hashes which are useful for compound identity testing and registration. This will be in the 2022.09 release. Layers it currently supports: - Formula - Canonical SMILES : with and without stereo - Tautomer hash: with and without stereo - Sgroup data (for some help with polymers and things like atropisomers) - “Escape layer” (free text allowing a structure to be different even if everything else says it’s the same) 1 Chris Von Bargen, Hussein Faara, Dan Nealschneider, Ricardo Rodriguez, Rachel Walker
  • 36. 36 Registration hash example {<HashLayer.CANONICAL_SMILES: 1>: 'COc1ccc2[nH]c([S@@](=O)Cc3ncc(C)c(OC)c3C)nc2c1', <HashLayer.ESCAPE: 2>: '', <HashLayer.FORMULA: 3>: 'C17H19N3O3S', <HashLayer.NO_STEREO_SMILES: 4>: 'COc1ccc2[nH]c(S(=O)Cc3ncc(C)c(OC)c3C)nc2c1', <HashLayer.NO_STEREO_TAUTOMER_HASH: 5>: 'CO[C]1[CH][CH][C]2[N][C]([S]([O])C[C]3[N][CH][C](C)[C](OC)[C]3C)[N][C]2[CH]1_1_0', <HashLayer.SGROUP_DATA: 6>: '[]', <HashLayer.TAUTOMER_HASH: 7>: 'CO[C]1[CH][CH][C]2[N][C]([S@@]([O])C[C]3[N][CH][C](C)[C](OC)[C]3C)[N][C]2[CH]1_1_0'}
  • 37. 37 Handling tautomers {<HashLayer.CANONICAL_SMILES: 1>: 'CCCS(=O)(=O)Nc1ccc(F)c(C(=O)c2c[nH]c3ncc(-c 4ccc(Cl)cc4)cc23)c1F', <HashLayer.ESCAPE: 2>: '', <HashLayer.FORMULA: 3>: 'C23H18ClF2N3O3S', … <HashLayer.TAUTOMER_HASH: 7>: 'CCCS([O])([O])[N][C]1[CH][CH][C](F)[C]([C]( [O])[C]2[CH][N][C]3[N][CH][C]([C]4[CH][CH][C ](Cl)[CH][CH]4)[CH][C]32)[C]1F_2_0'} {<HashLayer.CANONICAL_SMILES: 1>: 'CCCS(=O)(=O)Nc1ccc(F)c(C(=O)c2cnc3[nH]cc(-c 4ccc(Cl)cc4)cc2-3)c1F', <HashLayer.ESCAPE: 2>: '', <HashLayer.FORMULA: 3>: 'C23H18ClF2N3O3S', … <HashLayer.TAUTOMER_HASH: 7>: 'CCCS([O])([O])[N][C]1[CH][CH][C](F)[C]([C]( [O])[C]2[CH][N][C]3[N][CH][C]([C]4[CH][CH][C ](Cl)[CH][CH]4)[CH][C]32)[C]1F_2_0'}
  • 38. 38 Handling atropisomers Structures from: https://doi.org/10.1016/j.xphs.2021.10.011
  • 39. 39 Handling atropisomers Structures from: https://doi.org/10.1016/j.xphs.2021.10.011 The bold and hashed bonds are just drawing features and don’t survive translation to things like CXSMILES or mol files. But we can use S groups to indicate the stereochemistry
  • 40. 40 Handling atropisomers Structures from: https://doi.org/10.1016/j.xphs.2021.10.011 {<HashLayer.CANONICAL_SMILES: 1>: 'COc1cc2ncc3c(c2cc1-c1cn(C)nc1C)n(-c1c(F)cncc1OC)c(=O )n3C', <HashLayer.ESCAPE: 2>: '', <HashLayer.FORMULA: 3>: 'C23H21FN6O3', … <HashLayer.SGROUP_DATA: 6>: '[{"fieldName": "atropisomer", "atom": [19, 20], "bonds": [], "value": "M"}]', …} {<HashLayer.CANONICAL_SMILES: 1>: 'COc1cc2ncc3c(c2cc1-c1cn(C)nc1C)n(-c1c(F)cncc1OC)c(=O )n3C', <HashLayer.ESCAPE: 2>: '', <HashLayer.FORMULA: 3>: 'C23H21FN6O3', … <HashLayer.SGROUP_DATA: 6>: '[{"fieldName": "atropisomer", "atom": [19, 20], "bonds": [], "value": "P"}]', …}
  • 41. 41 Handling polymers {<HashLayer.CANONICAL_SMILES: 1>: '*c1cnc(*)s1', …, <HashLayer.SGROUP_DATA: 6>: '[{"type": "SRU", "atoms": [1, 2, 3, 4, 6], "bonds": [[0, 1], [4, 5]], "index": 1, "connect": "HT", "label": "n"}]', …} {<HashLayer.CANONICAL_SMILES: 1>: '*c1cnc(*)s1', …, <HashLayer.SGROUP_DATA: 6>: '[{"type": "SRU", "atoms": [1, 2, 3, 4, 6], "bonds": [[0, 1], [4, 5]], "index": 1, "connect": "HH", "label": "n"}]', …}
  • 42. 42 Handling enhanced stereochemistry Ethambutol These two describe the same racemic mixture
  • 43. 43 Handling enhanced stereochemistry {<HashLayer.CANONICAL_SMILES: 1>: 'CC[C@@H](CO)NCCN[C@@H](CC)CO', …, <HashLayer.NO_STEREO_SMILES: 4>: 'CCC(CO)NCCNC(CC)CO', …} {<HashLayer.CANONICAL_SMILES: 1>: 'CC[C@@H](CO)NCCN[C@@H](CC)CO |&1:2,9|', …, <HashLayer.NO_STEREO_SMILES: 4>: 'CCC(CO)NCCNC(CC)CO', …} We get the same hash if the molecule is drawn with wedged bonds.
  • 44. 44 Using the escape layer Suppose I start with the racemic mixture, run it through a chiral column, and collect the two fractions I want to register the two fractions separately without determining the absolute stereochemistry
  • 45. 45 Using the escape layer {<HashLayer.CANONICAL_SMILES: 1>: 'CC[C@@H](CO)NCCN[C@@H](CC)CO |o1:2,9|', <HashLayer.ESCAPE: 2>: ‘first fraction', …} {<HashLayer.CANONICAL_SMILES: 1>: 'CC[C@@H](CO)NCCN[C@@H](CC)CO |o1:2,9|', <HashLayer.ESCAPE: 2>: ‘second fraction', …}
  • 46. 46 Aside: using the escape layer for comp chem {… <HashLayer.ESCAPE: 2>: ‘conformer 1', …} {… <HashLayer.ESCAPE: 2>: ‘conformer 2', …} Suppose I want to store multiple conformers/poses of the same molecule
  • 47. 47 Wrapping up: molecular identity ● For many computational tasks we want to be able to figure out whether or not we have seen/used a particular molecule ● The definition of “same” for molecules depends on the context/question being asked ● Layered registration hashes make it easy (and cheap) to store sets of molecules and answer the context-dependent “are these the same?” question