SlideShare uma empresa Scribd logo
1 de 36
Lessons from ChEMBL Willem P. van Hoorn Senior Solutions Consultant Willem.vanhoorn@accelrys.com
Those who cannot remember the  past are condemned to repeat it Contents
‘Nasties’ Things you would not like to see in your hits Specifically: reactive/labile chemical groups Is the compound still on the plate? Activity due to (selective) non-covalent binding? Some overlap with frequent hitters/aggregators Peroxides, aldehydes, etc Not ‘structural alerts’ Off-target toxicity Toxic compounds after metabolic activation hERG binders, anilines, etc
This is not a new concept If you are a chemist you know many of these If you have been working in pharma you know more of these Pharma companies probably all have their in-house list of ‘forbidden/risky/ugly’ structures Some publications but no definitive public list Thus reinvention of the wheel, wasted effort
ChEMBL:  “the most comprehensive ever seen in a public database.’” (wikipedia) “…cover a significant fraction of the SAR and discovery of modern drugs” (ChEMBL website) This must be a good source to learn what goes Experienced scientists who cared enough about compounds to measure the activity and submit the results to peer-reviewed journals ChEMBL as a teacher
To learn we also need to know what not to do: Compound vendor catalogues Fewer constraints on reactivity / stability Drive for diversity More customers than just pharma: Should be enriched in nasties compared to ChEMBL ChEMBL as a teacher
Lesson 1 ChEMBL Release 7 Dump all compounds, keep largest fragment Unique canonical smiles: 597,255 Vendor reagents Pipeline Pilot examples: Maybridge + Asinex 186,967 unique compounds Build Bayesian model ‘reagentlike’ Vendor “good” v. ChEMBL “baseline” What do reagents have in common that ChEMBL compounds don’t?
Training/Test: Random 80% / 20% Excellent separation ChEMBL / Reagent Reagentlike model Leave-one-out enrichment Test set enrichment
Done?
A look at high and low scoring compounds Colour atoms by contribution to Bayesian score Red: high contribution: reagent-like Blue: low contribution: not reagent-like Color gradient over set of molecules
High scoring molecules
More high scoring molecules They do contain ‘nasty’groups… But they don’t stand out against rest of the molecules (all red).
Low scoring molecules Etc
High scoring features Low scoring features High and low scoring reagent features Seen 1029 times, of which in reagent set 1024 times Many variations of peptide bondand other polypeptide features: 635 out of 639 in reagent set
Learning the difference was too easy Small organic vs large polypeptide Both sets contain many series, model learns common core instead of (nasty) decorations Metric: compounds / Murcko frames ChEMBL: ~6.7, reagent: ~9.0  Number of frames / in common: ~81k / ~6k I need to resit this class Conclusions from lesson 1
Restrict to organic small molecules AlogP < 6, Mw < 600, organic compound filter Bayesian Model ECFP_2(smaller features compared to ECFP_6) Less likely to capture whole core Lesson 2: Rebalancing the training set
Still a predictive model
A typical high scoring compound  ~neutral score for parts presumed common to both sets like phenyl ~positive score for nasty parts
Low scoring example Many sugars, phosphates, steroids, etc
High scoring features Low scoring features Some ECFP_2 features
Less learning of “series by template” But it still happens, don’t need to capture whole ring to capture sugar, steroid, etc Some of expected nasty features found But many are not Better training set needed Series: similar in both clean/nasty training set, so that difference is not the template Many ChEMBL compounds are odd I have still not learned the lesson Conclusions from lesson 2
ChEMBL: What I should have started with: All compounds with IC50 or Ki expressed in nM,  Against human target, Include reference: journal, volume, year, page 569,569 activities 223,896 compounds 14,383 references Lesson 3: Learning from (big) pharma
Looking up author affiliation in PubMed NCBI Entrez Utilities Web Service (Text Analytics component collection) This takes ~4 hours in a weekend (PubMedusage restriction) ,[object Object]
 564,422 activities
 214,747 compounds,[object Object]
Top 10 affiliations
Where is Pfizer? And 318 more… Similar for other contributors
if DocAuthorsAffiliationrlike 'univers|Faculty|hospital|National.*Institute.*Health|Polytechnic' then Published_by := 'Academic'; elsifDocAuthorsAffiliationrlike 'Pfizer' then Published_by := 'Pfizer'; elsifDocAuthorsAffiliationrlike 'warner.*lambert|parke.*davis' then Published_by := 'Warner-Lambert'; elsifDocAuthorsAffiliationrlike 'Pharmacia|Upjohn' then Published_by := 'Pharmacia'; elsifDocAuthorsAffiliationrlike 'Wyeth' then Published_by := 'Wyeth'; elsifDocAuthorsAffiliationrlike 'Merck' then Published_by := 'Merck'; … else Published_by := 'Other'; end if; Merging affiliations
Ranked contributors to ChEMBL
Creating balanced training/test sets Affiliation: Pharma, Other, Academic Keep 602 targets for which measured activities are available for all 3 affiliations Same target, same pharmacophore, some me-too work: less series learning
Bayesian model based on <= 2005 data Descriptors: ECFP_6 + Ro5 physical properties Categorical model: Pharma/Academic/Other
Predicting affiliation post 2005 Academic Pharma ,[object Object]
Academic/Pharma distinctOther
What makes a compound ‘Pharma’ Aromatic rings, aromatic rings, aromatic rings. IP? Absence of decorations means these are not distinctive. Number of times feature observed / how many times in academic / pharma
What makes a compound ‘Academic’ Aliphatic, single rings, bold usage of F and other decorations, etc. Maybe not nasty but not very druglike. Number of times feature observed / how many times in academic / pharma
Most Pharma-like compounds For each target, compound with highest ‘Pharma’ score and true origin

Mais conteúdo relacionado

Destaque

Quick installation with an existing router or modern router
Quick installation with an existing router or modern routerQuick installation with an existing router or modern router
Quick installation with an existing router or modern router
IT Tech
 
第9章 ネットワーク上の他の確率過程
第9章 ネットワーク上の他の確率過程第9章 ネットワーク上の他の確率過程
第9章 ネットワーク上の他の確率過程
Ohsawa Goodfellow
 

Destaque (14)

Nature Thrill
Nature ThrillNature Thrill
Nature Thrill
 
Modulo 6 collage de conceptualizacion y observacion
Modulo 6 collage de conceptualizacion y observacionModulo 6 collage de conceptualizacion y observacion
Modulo 6 collage de conceptualizacion y observacion
 
Piotr Wilam - Product Development Days - Raise the bar high
Piotr Wilam - Product Development Days - Raise the bar highPiotr Wilam - Product Development Days - Raise the bar high
Piotr Wilam - Product Development Days - Raise the bar high
 
Allahabad - City of global repute
Allahabad - City of global reputeAllahabad - City of global repute
Allahabad - City of global repute
 
Waleed C.V
Waleed C.V Waleed C.V
Waleed C.V
 
Building a Safer and Healthier Workspace with JD Edwards Health, Safety and E...
Building a Safer and Healthier Workspace with JD Edwards Health, Safety and E...Building a Safer and Healthier Workspace with JD Edwards Health, Safety and E...
Building a Safer and Healthier Workspace with JD Edwards Health, Safety and E...
 
Best Practices
Best PracticesBest Practices
Best Practices
 
2016 October Tools for Change CGI Newsletter
2016 October Tools for Change CGI Newsletter2016 October Tools for Change CGI Newsletter
2016 October Tools for Change CGI Newsletter
 
RED Y NET (Cordara y Beltran)
RED Y NET (Cordara y Beltran)RED Y NET (Cordara y Beltran)
RED Y NET (Cordara y Beltran)
 
Quick installation with an existing router or modern router
Quick installation with an existing router or modern routerQuick installation with an existing router or modern router
Quick installation with an existing router or modern router
 
第9章 ネットワーク上の他の確率過程
第9章 ネットワーク上の他の確率過程第9章 ネットワーク上の他の確率過程
第9章 ネットワーク上の他の確率過程
 
UX Poland 2014: N.Efimov & Y. Vedenin - Playful design
UX Poland 2014: N.Efimov & Y. Vedenin - Playful designUX Poland 2014: N.Efimov & Y. Vedenin - Playful design
UX Poland 2014: N.Efimov & Y. Vedenin - Playful design
 
Glass lions pdf
Glass lions pdfGlass lions pdf
Glass lions pdf
 
Generacion del 98
Generacion del 98Generacion del 98
Generacion del 98
 

Semelhante a ChEMBL UGM May 2011

Session 1 part 3
Session 1 part 3Session 1 part 3
Session 1 part 3
plmiami
 
DNA and Genes Lab ActivityComplete your answers in the spaces .docx
DNA and Genes Lab ActivityComplete your answers in the spaces .docxDNA and Genes Lab ActivityComplete your answers in the spaces .docx
DNA and Genes Lab ActivityComplete your answers in the spaces .docx
jacksnathalie
 
Embase for pharmacovigilance: Search and validation March 22 2017
Embase for pharmacovigilance: Search and validation March 22 2017Embase for pharmacovigilance: Search and validation March 22 2017
Embase for pharmacovigilance: Search and validation March 22 2017
Ann-Marie Roche
 
Session 1 part 2
Session 1 part 2Session 1 part 2
Session 1 part 2
plmiami
 
Search Terms And Strategies
Search Terms And StrategiesSearch Terms And Strategies
Search Terms And Strategies
kjurecki
 

Semelhante a ChEMBL UGM May 2011 (20)

Els nuigalway-embase-training-gm-slidespdf
Els nuigalway-embase-training-gm-slidespdfEls nuigalway-embase-training-gm-slidespdf
Els nuigalway-embase-training-gm-slidespdf
 
123713AB lecture01
123713AB lecture01123713AB lecture01
123713AB lecture01
 
MCQs_and_EMQs_human_physiology high yield.pdf
MCQs_and_EMQs_human_physiology high yield.pdfMCQs_and_EMQs_human_physiology high yield.pdf
MCQs_and_EMQs_human_physiology high yield.pdf
 
Bioinformatics t9-t10-biocheminformatics v2014
Bioinformatics t9-t10-biocheminformatics v2014Bioinformatics t9-t10-biocheminformatics v2014
Bioinformatics t9-t10-biocheminformatics v2014
 
Session 1 part 3
Session 1 part 3Session 1 part 3
Session 1 part 3
 
Bioinformatica 15-12-2011-t9-t10-bio cheminformatics
Bioinformatica 15-12-2011-t9-t10-bio cheminformaticsBioinformatica 15-12-2011-t9-t10-bio cheminformatics
Bioinformatica 15-12-2011-t9-t10-bio cheminformatics
 
DNA and Genes Lab ActivityComplete your answers in the spaces .docx
DNA and Genes Lab ActivityComplete your answers in the spaces .docxDNA and Genes Lab ActivityComplete your answers in the spaces .docx
DNA and Genes Lab ActivityComplete your answers in the spaces .docx
 
Exploring virtual compound space with Bayesian statistics
Exploring virtual compound space with Bayesian statisticsExploring virtual compound space with Bayesian statistics
Exploring virtual compound space with Bayesian statistics
 
Chemistry lab cleanup in 15 minutes a day | AACT, Chemistry Solutions Vol. 1,...
Chemistry lab cleanup in 15 minutes a day | AACT, Chemistry Solutions Vol. 1,...Chemistry lab cleanup in 15 minutes a day | AACT, Chemistry Solutions Vol. 1,...
Chemistry lab cleanup in 15 minutes a day | AACT, Chemistry Solutions Vol. 1,...
 
Embase for pharmacovigilance: Search and validation March 22 2017
Embase for pharmacovigilance: Search and validation March 22 2017Embase for pharmacovigilance: Search and validation March 22 2017
Embase for pharmacovigilance: Search and validation March 22 2017
 
2016 bioinformatics i_bio_cheminformatics_wimvancriekinge
2016 bioinformatics i_bio_cheminformatics_wimvancriekinge2016 bioinformatics i_bio_cheminformatics_wimvancriekinge
2016 bioinformatics i_bio_cheminformatics_wimvancriekinge
 
Bioinformatics t9-t10-bio cheminformatics-wimvancriekinge_v2013
Bioinformatics t9-t10-bio cheminformatics-wimvancriekinge_v2013Bioinformatics t9-t10-bio cheminformatics-wimvancriekinge_v2013
Bioinformatics t9-t10-bio cheminformatics-wimvancriekinge_v2013
 
Writing a lab report
Writing a lab reportWriting a lab report
Writing a lab report
 
Writing a lab report
Writing a lab reportWriting a lab report
Writing a lab report
 
Ebpsearching Handouts
Ebpsearching HandoutsEbpsearching Handouts
Ebpsearching Handouts
 
2015 bioinformatics bio_cheminformatics_wim_vancriekinge
2015 bioinformatics bio_cheminformatics_wim_vancriekinge2015 bioinformatics bio_cheminformatics_wim_vancriekinge
2015 bioinformatics bio_cheminformatics_wim_vancriekinge
 
Systematic Literature Retrieval in PubMed
Systematic Literature Retrieval in PubMedSystematic Literature Retrieval in PubMed
Systematic Literature Retrieval in PubMed
 
Session 1 part 2
Session 1 part 2Session 1 part 2
Session 1 part 2
 
Search Terms And Strategies
Search Terms And StrategiesSearch Terms And Strategies
Search Terms And Strategies
 
BIO 204 Success Begins / snaptutorial.com
BIO 204 Success Begins / snaptutorial.comBIO 204 Success Begins / snaptutorial.com
BIO 204 Success Begins / snaptutorial.com
 

ChEMBL UGM May 2011

  • 1. Lessons from ChEMBL Willem P. van Hoorn Senior Solutions Consultant Willem.vanhoorn@accelrys.com
  • 2. Those who cannot remember the past are condemned to repeat it Contents
  • 3. ‘Nasties’ Things you would not like to see in your hits Specifically: reactive/labile chemical groups Is the compound still on the plate? Activity due to (selective) non-covalent binding? Some overlap with frequent hitters/aggregators Peroxides, aldehydes, etc Not ‘structural alerts’ Off-target toxicity Toxic compounds after metabolic activation hERG binders, anilines, etc
  • 4. This is not a new concept If you are a chemist you know many of these If you have been working in pharma you know more of these Pharma companies probably all have their in-house list of ‘forbidden/risky/ugly’ structures Some publications but no definitive public list Thus reinvention of the wheel, wasted effort
  • 5. ChEMBL: “the most comprehensive ever seen in a public database.’” (wikipedia) “…cover a significant fraction of the SAR and discovery of modern drugs” (ChEMBL website) This must be a good source to learn what goes Experienced scientists who cared enough about compounds to measure the activity and submit the results to peer-reviewed journals ChEMBL as a teacher
  • 6. To learn we also need to know what not to do: Compound vendor catalogues Fewer constraints on reactivity / stability Drive for diversity More customers than just pharma: Should be enriched in nasties compared to ChEMBL ChEMBL as a teacher
  • 7. Lesson 1 ChEMBL Release 7 Dump all compounds, keep largest fragment Unique canonical smiles: 597,255 Vendor reagents Pipeline Pilot examples: Maybridge + Asinex 186,967 unique compounds Build Bayesian model ‘reagentlike’ Vendor “good” v. ChEMBL “baseline” What do reagents have in common that ChEMBL compounds don’t?
  • 8. Training/Test: Random 80% / 20% Excellent separation ChEMBL / Reagent Reagentlike model Leave-one-out enrichment Test set enrichment
  • 10. A look at high and low scoring compounds Colour atoms by contribution to Bayesian score Red: high contribution: reagent-like Blue: low contribution: not reagent-like Color gradient over set of molecules
  • 12. More high scoring molecules They do contain ‘nasty’groups… But they don’t stand out against rest of the molecules (all red).
  • 14. High scoring features Low scoring features High and low scoring reagent features Seen 1029 times, of which in reagent set 1024 times Many variations of peptide bondand other polypeptide features: 635 out of 639 in reagent set
  • 15. Learning the difference was too easy Small organic vs large polypeptide Both sets contain many series, model learns common core instead of (nasty) decorations Metric: compounds / Murcko frames ChEMBL: ~6.7, reagent: ~9.0 Number of frames / in common: ~81k / ~6k I need to resit this class Conclusions from lesson 1
  • 16. Restrict to organic small molecules AlogP < 6, Mw < 600, organic compound filter Bayesian Model ECFP_2(smaller features compared to ECFP_6) Less likely to capture whole core Lesson 2: Rebalancing the training set
  • 18. A typical high scoring compound ~neutral score for parts presumed common to both sets like phenyl ~positive score for nasty parts
  • 19. Low scoring example Many sugars, phosphates, steroids, etc
  • 20. High scoring features Low scoring features Some ECFP_2 features
  • 21. Less learning of “series by template” But it still happens, don’t need to capture whole ring to capture sugar, steroid, etc Some of expected nasty features found But many are not Better training set needed Series: similar in both clean/nasty training set, so that difference is not the template Many ChEMBL compounds are odd I have still not learned the lesson Conclusions from lesson 2
  • 22. ChEMBL: What I should have started with: All compounds with IC50 or Ki expressed in nM, Against human target, Include reference: journal, volume, year, page 569,569 activities 223,896 compounds 14,383 references Lesson 3: Learning from (big) pharma
  • 23.
  • 25.
  • 27. Where is Pfizer? And 318 more… Similar for other contributors
  • 28. if DocAuthorsAffiliationrlike 'univers|Faculty|hospital|National.*Institute.*Health|Polytechnic' then Published_by := 'Academic'; elsifDocAuthorsAffiliationrlike 'Pfizer' then Published_by := 'Pfizer'; elsifDocAuthorsAffiliationrlike 'warner.*lambert|parke.*davis' then Published_by := 'Warner-Lambert'; elsifDocAuthorsAffiliationrlike 'Pharmacia|Upjohn' then Published_by := 'Pharmacia'; elsifDocAuthorsAffiliationrlike 'Wyeth' then Published_by := 'Wyeth'; elsifDocAuthorsAffiliationrlike 'Merck' then Published_by := 'Merck'; … else Published_by := 'Other'; end if; Merging affiliations
  • 30. Creating balanced training/test sets Affiliation: Pharma, Other, Academic Keep 602 targets for which measured activities are available for all 3 affiliations Same target, same pharmacophore, some me-too work: less series learning
  • 31. Bayesian model based on <= 2005 data Descriptors: ECFP_6 + Ro5 physical properties Categorical model: Pharma/Academic/Other
  • 32.
  • 34. What makes a compound ‘Pharma’ Aromatic rings, aromatic rings, aromatic rings. IP? Absence of decorations means these are not distinctive. Number of times feature observed / how many times in academic / pharma
  • 35. What makes a compound ‘Academic’ Aliphatic, single rings, bold usage of F and other decorations, etc. Maybe not nasty but not very druglike. Number of times feature observed / how many times in academic / pharma
  • 36. Most Pharma-like compounds For each target, compound with highest ‘Pharma’ score and true origin
  • 37. Most Academic-like compounds For each target, compound with highest ‘Academic’ score and true origin
  • 38. Set out to learn nasty model, ended up with a (non)drug-like model Pharma is ‘a bit’ underrepresented 10% of MDDR is in ChEMBL (Dave Rogers) ChEMBL c/should include patent literature Over the years (big) pharma has delivered the goods and learned what does (not) work in a structure. Some of this knowledge can be extracted from ChEMBL. Ignore this at your peril Conclusions