SlideShare uma empresa Scribd logo
1 de 35
Baixar para ler offline
RDkit Gems
Roger Sayle, PhD
NextMove Software LTD
cambridge, uk
introduction
• This presentation is intended as a short talktorial
describing real world code examples, recent
features and commonly used RDKit idioms.
• Although many refect personal/NextMove
Software experiences developing with the C++
APIs, most (though not all) observations apply
equally to the Java and Python APIs.
looping over molecule atoms
• Poor
for (RWMol::AtomIterator atomIt=mol->beginAtoms();
atomIt != mol->endAtoms(); atomIt++) …
• Better
for (RWMol::AtomIterator atomIt=mol->beginAtoms();
atomIt != mol->endAtoms(); ++atomIt) …
• The C++ post increment operator returns
the original value prior to the increment,
requiring a copy to made.
looping over aTOM's bonds
• Documented idiom:
... molPtr is a const ROMol * ...
... atomPtr is a const Atom * ...
ROMol::OEDGE_ITER beg,end;
boost::tie(beg,end) = molPtr->getAtomBonds(atomPtr);
while(beg!=end){
const Bond * bond=(*molPtr)[*beg].get();
... do something with the Bond ...
++beg;
}
looping over aTOM's bonds
• Proposed replacement:
... molPtr is a const ROMol * ...
... atomPtr is a const Atom * ...
ROMol::OBOND_ITER_PAIR bondIt;
bondIt = molPtr->getAtomBonds(atomPtr);
while(bondIt.first != bondIt.second){
const Bond * bond=(*molPtr)[*bondIt.first].get();
... do something with the Bond ...
++bondIt;
}
Accepting string arguments
• Good practice to accept both C++
constant literals and STL std::strings.
• Poor
void foo(const char *ptr) { foo(std::string(ptr)); }
void foo(const std::string str) { … }
• Better
void foo(const std::string &str) { foo(str.c_str()); }
void foo(const char *ptr) { … }
isoelectronic valences
• Supporting PF6
-
in RDKit currently requires
adding 7 to the neutral valences of P.
Instead, P-1
should be isoelectronic with S.
• Poor
valence(atomicNo,charge) ~= valence(atomicNo,0)-charge
• Better
valence(atomicNo,charge) ~= valence(atomicNo-charge,0)
representing reactions
• Two possible representations
– A new object containing vectors of reactant
and product (and agent?) molecules.
– Annotation of existing molecule with reaction
roles annotated on each atom.
• Both are representations are equivalent
and one may be converted to the other
and vice versa.
common ancestry/base class
• A significant advantage of avoiding new
object types is the common requirement in
cheminformatics of reading reactions,
molecules, queries, peptides, transforms,
and pharmacophores for the same file
format.
• OpenBabel and GGA's Indigo require you
to know the contents of a file/record (mol
vs. rxn) before reading it!?
Proposed rdkit interface
• Atom::setProp(“rxnRole”,(int)role)
– RXN_ROLE_NONE 0
– RXN_ROLE_REACTANT 1
– RXN_ROLE_AGENT 2
– RXN_ROLE_PRODUCT 3
• Atom::setProp(“rxnGroup”,(int)group)
• Atom::setProp(“molAtomMapNumber”,idx)
• RWMol::setProp(“isRxn”,isrxn?1:0)
and where this gets used...
• Code/GraphMol/FileParser/MolFileWriter.cpp:
• Function GetMolFileAtomLine contains
• Currently unused local variables
– rxnComponentType
– rxnComponentNumber
• This would allow support for CPSS-style
reactions [i.e. reactions in SD files]
inverted protein hierarchy
• Optimize data structures for common ops.
• Poor
for (modelIt mi=mol->getModels(); mi; ++mi)
for (chainIt ci=(*mi)->getChains(); ci; ++ci)
for (resIt ri=(*ci)->getResidues(); ri; ++ri)
for (atomIt ai=(*ri)->getAtoms(); ai; ++ai)
… /* Do something! */
• Better
for (atomIt ai=mol->getAtoms(); ai; ++ai)
… /* Do something! */
rdkit pdb file writer features
• By default
– Atoms are written as HETAM records
– Bonds are written as CONECT records
– Bond orders encoded by popular extension
– All atoms are uniquely named
– Molecule title written as COMPND record
– Formal charges are preserved.
– Default residue is “UNL” (not “LIG”).
example pdb format output
COMPND benzene
HETATM 1 C1 UNL 1 0.000 0.000 0.000 1.00 0.00 C
HETATM 2 C2 UNL 1 0.000 0.000 0.000 1.00 0.00 C
HETATM 3 C3 UNL 1 0.000 0.000 0.000 1.00 0.00 C
HETATM 4 C4 UNL 1 0.000 0.000 0.000 1.00 0.00 C
HETATM 5 C5 UNL 1 0.000 0.000 0.000 1.00 0.00 C
HETATM 6 C6 UNL 1 0.000 0.000 0.000 1.00 0.00 C
CONECT 1 2 2 6
CONECT 2 3
CONECT 3 4 4
CONECT 4 5
CONECT 5 6 6
END
unique atom names
• Counts unique per element type
• C1..C99,CA0..CA9,CB0..CZ9,CAA...CZZ
• This allows for up to 1036 indices.
beware openbabel's lig residue
• PDB residue LIG is reserved for 3-pyridin-
4-yl-2,4-dihydro-indeno[1,2-c]pyrazole.
• Use of residue UNL (for UNknown Ligand)
is much preferred.
PDB file readers
• Not all connectivity is explicit
• Two options for determining bonds
– Template methods
– Proximity methods
• Two options for determining bond orders
– Template methods
– 3D geometry methods
• See Sayle, “PDB Cruft to Content”, MUG 2001
template-based bond orders
• Need only specify the double bonds
• Most (11/20) amino acids only need C=O.
• Arg: C=O,CZ=NH2
• Asn/Asp: C=O,CG=OD1
• Gln/Glu: C=O,CD=OE1
• His: C=O,CG=CD2,ND1=CE1
• Phe/Tyr: C=O,CG=CD1,CD2=CE2,CE1=CZ
• Trp: C=O,CG=CD1,CD2=CE2,CE3=CZ3,CZ2=CH2
switching on short strings
// These are macros to allow their use in C++ constants
#define BCNAM(A,B,C) (((A)<<16) | ((B)<<8) | (C))
#define BCATM(A,B,C,D) (((A)<<24) | ((B)<<16) | ((C)<<8) | (D))
switch (rescode) {
case BCNAM('A','L','A'):
case BCNAM('C','Y','S'):
case BCNAM('G','L','Y'):
case BCNAM('I','L','E'):
case BCNAM('L','E','U'):
...
if (atm1==BCATM(' ','C',' ',' ') && atm2==BCATM(' ','O',' ',' '))
return true;
break;
proximity bonding
• The ideal distance between two bonded
atoms is the sum of their covalent radii.
• The ideal distance between two non-
bonded atoms is the sum of their van der
Waal's radii.
• Assume bonding between all atoms closer
than the sum of Rcov plus a delta (0.45Å).
• Impose minimum distance threshold (0.4Å)
compare the square
• A rate limiting step in many (poorly written)
proximity calculations is the calculation of
square roots.
• Poor
if(sqrt(dx*dx+dy*dy+dz*dz) < radius) ...
• Better
if(dx*dx + dy*dy + dz*dz < radius*radius) …
c.f. http://gcc.gnu.org/ml/gcc-patches/2003-03/msg01683.html
fail early, fail fast (like pharma)
• Poor
return dx*dx + dy*dy + dz*dz <= radius2
• Better
dist2 = dx*dx
if (dist > radius2)
return false
dist2 += dy*dy
if (dist > radius2)
return false
dist2 += dz*dz
return dist2 <= radius2
Proximity algorithms
• Brute Force [FreON] N2
• Z (Ordinate) Sort [OpenBabel] NlogN
• Binary Space Partition [CDK] NlogN..N2
• Fixed Dimension Grid [RasMol] ~N
• Fixed Spacing Grid [Coot] ~N
• Grid Hashing [RDKit/OEChem] ~N
proximity comparison
proximity comparison
proximity comparison
Brute force
z (ordinate) sort
binary space partition (bsp)
fixed dimension grid
fixed spacing grid
finite fixed spacing grid
infinite fixed spacing grid
grid hashing
future work/wish list
• Atom reordering in RDKit.
• C++ compressed (gzip) file support.
• Non-forward PDB supplier.
• In-place removeHs/addHs.

Mais conteúdo relacionado

Destaque

Real estate industry in chittagong
Real estate industry in chittagongReal estate industry in chittagong
Real estate industry in chittagongAlexander Decker
 
10分でわかるRandom forest
10分でわかるRandom forest10分でわかるRandom forest
10分でわかるRandom forestYasunori Ozaki
 
機械学習におけるオンライン確率的最適化の理論
機械学習におけるオンライン確率的最適化の理論機械学習におけるオンライン確率的最適化の理論
機械学習におけるオンライン確率的最適化の理論Taiji Suzuki
 
Частотный преобразователь
Частотный преобразовательЧастотный преобразователь
Частотный преобразовательkulibin
 
AQA Biology-Physical factors affecting organisms
AQA Biology-Physical factors affecting organismsAQA Biology-Physical factors affecting organisms
AQA Biology-Physical factors affecting organismssherinshaju
 
Journalism, Networks, Ontology: Pat kane presentation at Media140 barcelona
Journalism, Networks, Ontology: Pat kane presentation at Media140 barcelonaJournalism, Networks, Ontology: Pat kane presentation at Media140 barcelona
Journalism, Networks, Ontology: Pat kane presentation at Media140 barcelonawww.patkane.global
 
Mobile for SharePoint with Windows Phone
Mobile for SharePoint with Windows PhoneMobile for SharePoint with Windows Phone
Mobile for SharePoint with Windows PhoneEdgewater
 
Content Marketing: How to Attract Talent using Sponsored Updates
Content Marketing: How to Attract Talent using Sponsored UpdatesContent Marketing: How to Attract Talent using Sponsored Updates
Content Marketing: How to Attract Talent using Sponsored UpdatesRebecca Feldman
 
22 of the best marketing quotes
22 of the best marketing quotes22 of the best marketing quotes
22 of the best marketing quotessherinshaju
 
PDF of the 101 Things you need to know about the police
PDF of the 101 Things you need to know about the policePDF of the 101 Things you need to know about the police
PDF of the 101 Things you need to know about the policeThe Star Newspaper
 

Destaque (18)

Real estate industry in chittagong
Real estate industry in chittagongReal estate industry in chittagong
Real estate industry in chittagong
 
Retailer 01-2013-preview
Retailer 01-2013-previewRetailer 01-2013-preview
Retailer 01-2013-preview
 
Mishimasyk141025
Mishimasyk141025Mishimasyk141025
Mishimasyk141025
 
R -> Python
R -> PythonR -> Python
R -> Python
 
Rdkitの紹介
Rdkitの紹介Rdkitの紹介
Rdkitの紹介
 
10分でわかるRandom forest
10分でわかるRandom forest10分でわかるRandom forest
10分でわかるRandom forest
 
機械学習におけるオンライン確率的最適化の理論
機械学習におけるオンライン確率的最適化の理論機械学習におけるオンライン確率的最適化の理論
機械学習におけるオンライン確率的最適化の理論
 
Частотный преобразователь
Частотный преобразовательЧастотный преобразователь
Частотный преобразователь
 
AQA Biology-Physical factors affecting organisms
AQA Biology-Physical factors affecting organismsAQA Biology-Physical factors affecting organisms
AQA Biology-Physical factors affecting organisms
 
Tech
TechTech
Tech
 
Journalism, Networks, Ontology: Pat kane presentation at Media140 barcelona
Journalism, Networks, Ontology: Pat kane presentation at Media140 barcelonaJournalism, Networks, Ontology: Pat kane presentation at Media140 barcelona
Journalism, Networks, Ontology: Pat kane presentation at Media140 barcelona
 
Cómo hacer presentaciones exitosas
Cómo hacer presentaciones exitosasCómo hacer presentaciones exitosas
Cómo hacer presentaciones exitosas
 
Mobile for SharePoint with Windows Phone
Mobile for SharePoint with Windows PhoneMobile for SharePoint with Windows Phone
Mobile for SharePoint with Windows Phone
 
Content Marketing: How to Attract Talent using Sponsored Updates
Content Marketing: How to Attract Talent using Sponsored UpdatesContent Marketing: How to Attract Talent using Sponsored Updates
Content Marketing: How to Attract Talent using Sponsored Updates
 
Diana maria morales hernandez actividad1 mapa_c
Diana maria morales hernandez actividad1 mapa_cDiana maria morales hernandez actividad1 mapa_c
Diana maria morales hernandez actividad1 mapa_c
 
BIOSTER Technology Research Institute
BIOSTER Technology Research InstituteBIOSTER Technology Research Institute
BIOSTER Technology Research Institute
 
22 of the best marketing quotes
22 of the best marketing quotes22 of the best marketing quotes
22 of the best marketing quotes
 
PDF of the 101 Things you need to know about the police
PDF of the 101 Things you need to know about the policePDF of the 101 Things you need to know about the police
PDF of the 101 Things you need to know about the police
 

Semelhante a RDKit Gems

MICROPROCESSOR BASED SUN TRACKING SOLAR PANEL SYSTEM TO MAXIMIZE ENERGY GENER...
MICROPROCESSOR BASED SUN TRACKING SOLAR PANEL SYSTEM TO MAXIMIZE ENERGY GENER...MICROPROCESSOR BASED SUN TRACKING SOLAR PANEL SYSTEM TO MAXIMIZE ENERGY GENER...
MICROPROCESSOR BASED SUN TRACKING SOLAR PANEL SYSTEM TO MAXIMIZE ENERGY GENER...moiz89
 
Nptel cad2-06 capcitances
Nptel cad2-06 capcitancesNptel cad2-06 capcitances
Nptel cad2-06 capcitanceschenna_kesava
 
Bits protein structure
Bits protein structureBits protein structure
Bits protein structureBITS
 
Quantum-Espresso_10_8_14
Quantum-Espresso_10_8_14Quantum-Espresso_10_8_14
Quantum-Espresso_10_8_14cjfoss
 
Design and implementation of cyclo converter for high frequency applications
Design and implementation of cyclo converter for high frequency applicationsDesign and implementation of cyclo converter for high frequency applications
Design and implementation of cyclo converter for high frequency applicationscuashok07
 
Proposals for Memristor Crossbar Design and Applications
Proposals for Memristor Crossbar Design and ApplicationsProposals for Memristor Crossbar Design and Applications
Proposals for Memristor Crossbar Design and ApplicationsBlaise Mouttet
 
90981041 control-system-lab-manual
90981041 control-system-lab-manual90981041 control-system-lab-manual
90981041 control-system-lab-manualGopinath.B.L Naidu
 
MetroScientific Week 1.pptx
MetroScientific Week 1.pptxMetroScientific Week 1.pptx
MetroScientific Week 1.pptxBipin Saha
 
Digital logic-formula-notes-final-1
Digital logic-formula-notes-final-1Digital logic-formula-notes-final-1
Digital logic-formula-notes-final-1Kshitij Singh
 
write equations and waveforms for Variable Frequency converter and v.pdf
write equations and waveforms for Variable Frequency converter and v.pdfwrite equations and waveforms for Variable Frequency converter and v.pdf
write equations and waveforms for Variable Frequency converter and v.pdfsuresh640714
 
Benchmark Calculations of Atomic Data for Modelling Applications
 Benchmark Calculations of Atomic Data for Modelling Applications Benchmark Calculations of Atomic Data for Modelling Applications
Benchmark Calculations of Atomic Data for Modelling ApplicationsAstroAtom
 
PPT FINAL (1)-1 (1).ppt
PPT FINAL (1)-1 (1).pptPPT FINAL (1)-1 (1).ppt
PPT FINAL (1)-1 (1).ppttariqqureshi33
 

Semelhante a RDKit Gems (20)

En528032
En528032En528032
En528032
 
Lp seminar
Lp seminarLp seminar
Lp seminar
 
MICROPROCESSOR BASED SUN TRACKING SOLAR PANEL SYSTEM TO MAXIMIZE ENERGY GENER...
MICROPROCESSOR BASED SUN TRACKING SOLAR PANEL SYSTEM TO MAXIMIZE ENERGY GENER...MICROPROCESSOR BASED SUN TRACKING SOLAR PANEL SYSTEM TO MAXIMIZE ENERGY GENER...
MICROPROCESSOR BASED SUN TRACKING SOLAR PANEL SYSTEM TO MAXIMIZE ENERGY GENER...
 
Nptel cad2-06 capcitances
Nptel cad2-06 capcitancesNptel cad2-06 capcitances
Nptel cad2-06 capcitances
 
Bits protein structure
Bits protein structureBits protein structure
Bits protein structure
 
Pot.ppt.pdf
Pot.ppt.pdfPot.ppt.pdf
Pot.ppt.pdf
 
Quantum-Espresso_10_8_14
Quantum-Espresso_10_8_14Quantum-Espresso_10_8_14
Quantum-Espresso_10_8_14
 
Design and implementation of cyclo converter for high frequency applications
Design and implementation of cyclo converter for high frequency applicationsDesign and implementation of cyclo converter for high frequency applications
Design and implementation of cyclo converter for high frequency applications
 
Making a peaking filter by Julio Marqués
Making a peaking filter by Julio MarquésMaking a peaking filter by Julio Marqués
Making a peaking filter by Julio Marqués
 
Proposals for Memristor Crossbar Design and Applications
Proposals for Memristor Crossbar Design and ApplicationsProposals for Memristor Crossbar Design and Applications
Proposals for Memristor Crossbar Design and Applications
 
90981041 control-system-lab-manual
90981041 control-system-lab-manual90981041 control-system-lab-manual
90981041 control-system-lab-manual
 
MetroScientific Week 1.pptx
MetroScientific Week 1.pptxMetroScientific Week 1.pptx
MetroScientific Week 1.pptx
 
OpenMP And C++
OpenMP And C++OpenMP And C++
OpenMP And C++
 
Basic execution
Basic executionBasic execution
Basic execution
 
Digital logic-formula-notes-final-1
Digital logic-formula-notes-final-1Digital logic-formula-notes-final-1
Digital logic-formula-notes-final-1
 
write equations and waveforms for Variable Frequency converter and v.pdf
write equations and waveforms for Variable Frequency converter and v.pdfwrite equations and waveforms for Variable Frequency converter and v.pdf
write equations and waveforms for Variable Frequency converter and v.pdf
 
Buck converter design
Buck converter designBuck converter design
Buck converter design
 
Benchmark Calculations of Atomic Data for Modelling Applications
 Benchmark Calculations of Atomic Data for Modelling Applications Benchmark Calculations of Atomic Data for Modelling Applications
Benchmark Calculations of Atomic Data for Modelling Applications
 
Wien2k getting started
Wien2k getting startedWien2k getting started
Wien2k getting started
 
PPT FINAL (1)-1 (1).ppt
PPT FINAL (1)-1 (1).pptPPT FINAL (1)-1 (1).ppt
PPT FINAL (1)-1 (1).ppt
 

Mais de NextMove Software

CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...NextMove Software
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...NextMove Software
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedNextMove Software
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESNextMove Software
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionNextMove Software
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...NextMove Software
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsNextMove Software
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...NextMove Software
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...NextMove Software
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKitNextMove Software
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...NextMove Software
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical RepresentationsNextMove Software
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsNextMove Software
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics DatabaseNextMove Software
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...NextMove Software
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesNextMove Software
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesNextMove Software
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...NextMove Software
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)NextMove Software
 

Mais de NextMove Software (20)

DeepSMILES
DeepSMILESDeepSMILES
DeepSMILES
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speed
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKit
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical Representations
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptions
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics Database
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfiles
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)
 

Último

Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 

Último (20)

Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 

RDKit Gems

  • 1. RDkit Gems Roger Sayle, PhD NextMove Software LTD cambridge, uk
  • 2. introduction • This presentation is intended as a short talktorial describing real world code examples, recent features and commonly used RDKit idioms. • Although many refect personal/NextMove Software experiences developing with the C++ APIs, most (though not all) observations apply equally to the Java and Python APIs.
  • 3. looping over molecule atoms • Poor for (RWMol::AtomIterator atomIt=mol->beginAtoms(); atomIt != mol->endAtoms(); atomIt++) … • Better for (RWMol::AtomIterator atomIt=mol->beginAtoms(); atomIt != mol->endAtoms(); ++atomIt) … • The C++ post increment operator returns the original value prior to the increment, requiring a copy to made.
  • 4. looping over aTOM's bonds • Documented idiom: ... molPtr is a const ROMol * ... ... atomPtr is a const Atom * ... ROMol::OEDGE_ITER beg,end; boost::tie(beg,end) = molPtr->getAtomBonds(atomPtr); while(beg!=end){ const Bond * bond=(*molPtr)[*beg].get(); ... do something with the Bond ... ++beg; }
  • 5. looping over aTOM's bonds • Proposed replacement: ... molPtr is a const ROMol * ... ... atomPtr is a const Atom * ... ROMol::OBOND_ITER_PAIR bondIt; bondIt = molPtr->getAtomBonds(atomPtr); while(bondIt.first != bondIt.second){ const Bond * bond=(*molPtr)[*bondIt.first].get(); ... do something with the Bond ... ++bondIt; }
  • 6. Accepting string arguments • Good practice to accept both C++ constant literals and STL std::strings. • Poor void foo(const char *ptr) { foo(std::string(ptr)); } void foo(const std::string str) { … } • Better void foo(const std::string &str) { foo(str.c_str()); } void foo(const char *ptr) { … }
  • 7. isoelectronic valences • Supporting PF6 - in RDKit currently requires adding 7 to the neutral valences of P. Instead, P-1 should be isoelectronic with S. • Poor valence(atomicNo,charge) ~= valence(atomicNo,0)-charge • Better valence(atomicNo,charge) ~= valence(atomicNo-charge,0)
  • 8. representing reactions • Two possible representations – A new object containing vectors of reactant and product (and agent?) molecules. – Annotation of existing molecule with reaction roles annotated on each atom. • Both are representations are equivalent and one may be converted to the other and vice versa.
  • 9. common ancestry/base class • A significant advantage of avoiding new object types is the common requirement in cheminformatics of reading reactions, molecules, queries, peptides, transforms, and pharmacophores for the same file format. • OpenBabel and GGA's Indigo require you to know the contents of a file/record (mol vs. rxn) before reading it!?
  • 10. Proposed rdkit interface • Atom::setProp(“rxnRole”,(int)role) – RXN_ROLE_NONE 0 – RXN_ROLE_REACTANT 1 – RXN_ROLE_AGENT 2 – RXN_ROLE_PRODUCT 3 • Atom::setProp(“rxnGroup”,(int)group) • Atom::setProp(“molAtomMapNumber”,idx) • RWMol::setProp(“isRxn”,isrxn?1:0)
  • 11. and where this gets used... • Code/GraphMol/FileParser/MolFileWriter.cpp: • Function GetMolFileAtomLine contains • Currently unused local variables – rxnComponentType – rxnComponentNumber • This would allow support for CPSS-style reactions [i.e. reactions in SD files]
  • 12. inverted protein hierarchy • Optimize data structures for common ops. • Poor for (modelIt mi=mol->getModels(); mi; ++mi) for (chainIt ci=(*mi)->getChains(); ci; ++ci) for (resIt ri=(*ci)->getResidues(); ri; ++ri) for (atomIt ai=(*ri)->getAtoms(); ai; ++ai) … /* Do something! */ • Better for (atomIt ai=mol->getAtoms(); ai; ++ai) … /* Do something! */
  • 13. rdkit pdb file writer features • By default – Atoms are written as HETAM records – Bonds are written as CONECT records – Bond orders encoded by popular extension – All atoms are uniquely named – Molecule title written as COMPND record – Formal charges are preserved. – Default residue is “UNL” (not “LIG”).
  • 14. example pdb format output COMPND benzene HETATM 1 C1 UNL 1 0.000 0.000 0.000 1.00 0.00 C HETATM 2 C2 UNL 1 0.000 0.000 0.000 1.00 0.00 C HETATM 3 C3 UNL 1 0.000 0.000 0.000 1.00 0.00 C HETATM 4 C4 UNL 1 0.000 0.000 0.000 1.00 0.00 C HETATM 5 C5 UNL 1 0.000 0.000 0.000 1.00 0.00 C HETATM 6 C6 UNL 1 0.000 0.000 0.000 1.00 0.00 C CONECT 1 2 2 6 CONECT 2 3 CONECT 3 4 4 CONECT 4 5 CONECT 5 6 6 END
  • 15. unique atom names • Counts unique per element type • C1..C99,CA0..CA9,CB0..CZ9,CAA...CZZ • This allows for up to 1036 indices.
  • 16. beware openbabel's lig residue • PDB residue LIG is reserved for 3-pyridin- 4-yl-2,4-dihydro-indeno[1,2-c]pyrazole. • Use of residue UNL (for UNknown Ligand) is much preferred.
  • 17. PDB file readers • Not all connectivity is explicit • Two options for determining bonds – Template methods – Proximity methods • Two options for determining bond orders – Template methods – 3D geometry methods • See Sayle, “PDB Cruft to Content”, MUG 2001
  • 18. template-based bond orders • Need only specify the double bonds • Most (11/20) amino acids only need C=O. • Arg: C=O,CZ=NH2 • Asn/Asp: C=O,CG=OD1 • Gln/Glu: C=O,CD=OE1 • His: C=O,CG=CD2,ND1=CE1 • Phe/Tyr: C=O,CG=CD1,CD2=CE2,CE1=CZ • Trp: C=O,CG=CD1,CD2=CE2,CE3=CZ3,CZ2=CH2
  • 19. switching on short strings // These are macros to allow their use in C++ constants #define BCNAM(A,B,C) (((A)<<16) | ((B)<<8) | (C)) #define BCATM(A,B,C,D) (((A)<<24) | ((B)<<16) | ((C)<<8) | (D)) switch (rescode) { case BCNAM('A','L','A'): case BCNAM('C','Y','S'): case BCNAM('G','L','Y'): case BCNAM('I','L','E'): case BCNAM('L','E','U'): ... if (atm1==BCATM(' ','C',' ',' ') && atm2==BCATM(' ','O',' ',' ')) return true; break;
  • 20. proximity bonding • The ideal distance between two bonded atoms is the sum of their covalent radii. • The ideal distance between two non- bonded atoms is the sum of their van der Waal's radii. • Assume bonding between all atoms closer than the sum of Rcov plus a delta (0.45Å). • Impose minimum distance threshold (0.4Å)
  • 21. compare the square • A rate limiting step in many (poorly written) proximity calculations is the calculation of square roots. • Poor if(sqrt(dx*dx+dy*dy+dz*dz) < radius) ... • Better if(dx*dx + dy*dy + dz*dz < radius*radius) … c.f. http://gcc.gnu.org/ml/gcc-patches/2003-03/msg01683.html
  • 22. fail early, fail fast (like pharma) • Poor return dx*dx + dy*dy + dz*dz <= radius2 • Better dist2 = dx*dx if (dist > radius2) return false dist2 += dy*dy if (dist > radius2) return false dist2 += dz*dz return dist2 <= radius2
  • 23. Proximity algorithms • Brute Force [FreON] N2 • Z (Ordinate) Sort [OpenBabel] NlogN • Binary Space Partition [CDK] NlogN..N2 • Fixed Dimension Grid [RasMol] ~N • Fixed Spacing Grid [Coot] ~N • Grid Hashing [RDKit/OEChem] ~N
  • 35. future work/wish list • Atom reordering in RDKit. • C++ compressed (gzip) file support. • Non-forward PDB supplier. • In-place removeHs/addHs.