SlideShare uma empresa Scribd logo
1 de 18
Modeling	
  Chemical	
  datasets	
  
	
  

with	
  a	
  focus	
  on	
  regression	
  based	
  methods	
  

dsdht.wikispaces.com	
  
Aims
•  How does the dynamic range of the data
being modeled impact the apparent
performance of the model? "
•  How does experimental error impact the
apparent predictivity of a model? "
•  How can we determine whether a model is
applicable to a new dataset?"
•  How should we compare the performance
of regression models? " 	
  
	
  	
  

	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  h0p://media.johnwiley.com.au/product_data/excerpt/00/11181391/1118139100-­‐4.pdf	
  
Example	
  
	
  
Examine	
  a	
  number	
  of	
  datasets	
  containing	
  
measured	
  values	
  for	
  aqueous	
  solubility	
  and	
  use	
  
these	
  datasets	
  to	
  build	
  and	
  evaluate	
  predic7ve	
  
models.	
  
CChallenges	
  in	
  modeling	
  solubility	
  
Aqueous solubility of a compound can vary
depending on a number of factors:
•  	
  Temperature	
  
•  	
  Purity	
  
•  	
  polymorph	
  
Datasets	
  under	
  study	
  
•  	
  The	
  Huuskonen	
  Dataset	
  :	
  	
  1274	
  experimental	
  
solubility	
  values	
  first	
  largest	
  solubility	
  dataset.	
  
•  	
  The	
  JCIM	
  Dataset	
  :	
  	
  94	
  experimental	
  solubility	
  2008	
  
•  	
  The	
  PubChem	
  Dataset	
  (AID1996):	
  A	
  randomly	
  
selected	
  subset	
  of	
  1000	
  measured	
  solubility	
  values	
  
selected	
  from	
  a	
  set	
  of	
  58,000	
  values	
  that	
  were	
  
experimentally	
  determined	
  using	
  chemilumenescent	
  
nitrogen	
  detec7on	
  (CLND).	
  
Formula	
  

LogS = log10((solubility in µg/ml)/(1000.0 MW))	
  
 	
  	
  	
  	
  	
  Solubility	
  Comparison	
  

	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  A	
  boxplot	
  comparison	
  of	
  Log	
  S	
  for	
  the	
  three	
  datasets	
  
Requirements	
  for	
  PredicCve	
  model	
  

•  Reliable experimental data
•  	
  Sets	
  of	
  molecular	
  descriptors	
  
•  	
  Sta7s7cal	
  or	
  machine-­‐learning	
  methods	
  
Types	
  of	
  Models	
  
ClassificaCon	
  Model	
  :	
  	
  	
  
•  Taking	
  cutoffs	
  points	
  in	
  modeling	
  “edge	
  effects”.	
  
	
  	
  	
  	
  	
  	
  consider	
  a	
  case	
  where	
  we	
  have	
  a	
  two-­‐class	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  system	
  with	
  a	
  cutoff	
  of	
  100	
  μM.	
  A	
  value	
  of	
  99	
  μ	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  M	
  will	
  be	
  considered	
  insoluble	
  while	
  a	
  value	
  of	
  	
  	
  
	
  	
  	
  	
  	
  	
  101	
  μ	
  M	
  will	
  be	
  considered	
  soluble.	
  
	
  
•  other	
  difficulty	
  with	
  classifica7on	
  models	
  is	
  that	
  
they	
  provide	
  limited	
  direc7on	
  for	
  improving	
  the	
  
proper7es	
  of	
  a	
  compound	
  
	
  
Types	
  of	
  Models	
  
Regression	
  Model	
  :	
  	
  	
  
	
  
•  	
  difficult	
  to	
  create	
  a	
  regression	
  model	
  given	
  data	
  	
  	
  
	
  	
  	
  	
  with	
  a	
  limited	
  dynamic	
  range.	
  
•  	
  limited	
  dynamic	
  range	
  unreliable	
  model	
  
	
  
	
  
EvaluaCng	
  a	
  predicCve	
  model	
  
•  Pearson’s	
  r:	
  	
  commonly	
  referred	
  to	
  as	
  Pearson’s	
  r	
  ,	
  or	
  
its	
  square	
  r^2	
  

	
  
	
  	
  	
  	
  	
  Values	
  of	
  r	
  	
  can	
  vary	
  between	
  −1	
  and	
  1,	
  
•  Kendall’s	
  Tau:	
  	
  Pearson’s	
  r	
  	
  is	
  that	
  it	
  is	
  sensi7ve	
  to	
  
outliers	
  and	
  to	
  the	
  distribu7on	
  of	
  the	
  underlying	
  data.	
  
Employ	
  rank	
  order	
  or	
  values.	
  
•  RMSD:	
  	
  If	
  we	
  consider	
  paired	
  values	
  X	
  	
  and	
  Y	
  ,	
  RMSD	
  can	
  
be	
  calculated	
  using	
  the	
  following	
  equa7on.	
  
Steps	
  involved	
  in	
  building	
  a	
  predicCve	
  model	
  
•  Integrate	
  the	
  experimental	
  data	
  and	
  molecular	
  
descriptors	
  
•  Divide	
  the	
  data	
  into	
  training	
  and	
  test	
  sets	
  
•  Build	
  a	
  model	
  from	
  the	
  training	
  set	
  
•  Use	
  this	
  model	
  to	
  predict	
  the	
  test	
  set	
  
Random	
  forest	
  model	
  	
  

The	
  dynamic	
  range	
  in	
  a	
  dataset	
  can	
  have	
  a	
  large	
  
impact	
  on	
  the	
  apparent	
  correla7on	
  between	
  
experimental	
  and	
  predicted	
  ac7vity.	
  
 Experimental	
  Error	
  and	
  Model	
  Performance	
  
•  	
  experimental	
  data	
  point	
  has	
  an	
  error	
  associated	
  	
  	
  	
  	
  
	
  	
  	
  	
  with	
  it.	
  

	
  	
  	
  	
  	
  If	
  we	
  measure	
  the	
  Log	
  S	
  	
  of	
  a	
  compound	
  as	
  −6	
  and	
  that	
  data	
  point	
  has	
  an	
  error	
  of	
  	
  	
  	
  	
  
	
  	
  	
  	
  0.3	
  log	
  units,	
  the	
  actual	
  value	
  could	
  be	
  anywhere	
  between	
  −6.3	
  and	
  −5.7.	
  	
  

•  Brown	
  examined	
  the	
  rela7onship	
  between	
  experimental	
  
error	
  and	
  model	
  performance.	
  	
  	
  
•  Gaussian	
  distributed	
  random	
  values	
  were	
  added	
  to	
  
data	
  to	
  simulate	
  experimental	
  errors.	
  
	
  
•  	
  Correla7on	
  between	
  the	
  measured	
  values	
  and	
  the	
  same	
  
values	
  with	
  simulated	
  error	
  is	
  measured.	
  
Experimental	
  Error	
  and	
  Model	
  Performance	
  
•  Table	
  shows	
  the	
  maximum	
  possible	
  correla7on	
  for	
  
each	
  of	
  the	
  three	
  solubility	
  datasets	
  we	
  have	
  been	
  
examining	
  when	
  experimental	
  errors	
  of	
  0.3,	
  0.5,	
  
and	
  1.0	
  log	
  are	
  considered.	
  

•  Error	
  is	
  more	
  for	
  a	
  dataset	
  like	
  pubchem.	
  
Model	
  Applicability	
  
•  Models	
  ofen	
  perform	
  poorly	
  on	
  molecules	
  that	
  
bear	
  ligle	
  resemblance	
  to	
  those	
  in	
  the	
  training	
  set.	
  
Dataset	
  

	
  

Mean	
  

Median	
  

Huuskonen_Test	
  

0.76	
  

0.78	
  

JCIM	
  

0.74	
  

0.62	
  

Pubchem	
  

0.56	
  

0.56	
  

Similarity	
  of	
  Each	
  Test	
  Set	
  
Dataset	
  

R2	
  

Kendall	
  

RMS	
  
Error	
  

Huuskonen_Test	
  

0.92	
  

0.82	
  

0.58	
  

JCIM	
  

0.58	
  

0.59	
  

0.83	
  

Pubchem	
  

0.11	
  

0.22	
  

1.12	
  
 Comparing	
  Predic7ve	
  Models	
  
•  	
  When	
  comparing	
  correla7on	
  coefficients,	
  we	
  must	
  not	
  only	
  consider	
  the	
  value	
  of	
  the	
  
correla7on	
  coefficient,	
  but	
  also	
  the	
  confidence	
  intervals	
  around	
  the	
  correla7on	
  
coefficient.	
  
•  	
  If	
  the	
  confidence	
  intervals	
  of	
  two	
  correla7ons	
  overlap,	
  we	
  cannot	
  claim	
  that	
  
	
  	
  	
  	
  	
  	
  one	
  predic7ve	
  model	
  is	
  superior	
  to	
  another.	
  
•  For	
  subset	
  of	
  25	
  compounds	
  confidence	
  intervals	
  overlap	
  so	
  ,	
  we	
  cannot	
  say	
  that	
  one	
  
correla7on	
  is	
  superior	
  to	
  the	
  other.	
  
•  For	
  subset	
  of	
  50	
  compounds,	
  there	
  is	
  a	
  very	
  small	
  difference	
  between	
  the	
  upper	
  
bound	
  of	
  the	
  95%	
  confidence	
  interval.	
  
•  For	
  subset	
  of	
  100	
  compounds,	
  there	
  is	
  clear	
  separa7on	
  between	
  the	
  confidence	
  
intervals	
  so	
  it	
  implies	
  that	
  there	
  is	
  clear	
  separa7on	
  between	
  correla7on	
  coefficients.	
  
	
  
	
  
References	
  
•  hgp://www.wiley.com/WileyCDA/WileyTitle/
productCd-­‐1118139100.html	
  
•  hgps://github.com/PatWalters/
cheminforma7csbook	
  
	
  

Mais conteúdo relacionado

Mais procurados

Standardization and calibration -dr.mallik
Standardization and calibration -dr.mallikStandardization and calibration -dr.mallik
Standardization and calibration -dr.mallikDr. Mallikarjunaswamy C
 
DSUS_MAO_2012_Jie
DSUS_MAO_2012_JieDSUS_MAO_2012_Jie
DSUS_MAO_2012_JieMDO_Lab
 
ModelSelection1_WCSMO_2013_Ali
ModelSelection1_WCSMO_2013_AliModelSelection1_WCSMO_2013_Ali
ModelSelection1_WCSMO_2013_AliMDO_Lab
 
AIAA-SciTech-ModelSelection-2014-Mehmani
AIAA-SciTech-ModelSelection-2014-MehmaniAIAA-SciTech-ModelSelection-2014-Mehmani
AIAA-SciTech-ModelSelection-2014-MehmaniOptiModel
 
Case Study: Overview of Metabolomic Data Normalization Strategies
Case Study: Overview of Metabolomic Data Normalization StrategiesCase Study: Overview of Metabolomic Data Normalization Strategies
Case Study: Overview of Metabolomic Data Normalization StrategiesDmitry Grapov
 
QSAR statistical methods for drug discovery(pharmacology m.pharm2nd sem)
QSAR statistical methods for drug discovery(pharmacology m.pharm2nd sem)QSAR statistical methods for drug discovery(pharmacology m.pharm2nd sem)
QSAR statistical methods for drug discovery(pharmacology m.pharm2nd sem)Satigayatri
 
Data Normalization Approaches for Large-scale Biological Studies
Data Normalization Approaches for Large-scale Biological StudiesData Normalization Approaches for Large-scale Biological Studies
Data Normalization Approaches for Large-scale Biological StudiesDmitry Grapov
 
Activity coefficient models
Activity coefficient modelsActivity coefficient models
Activity coefficient modelsmasudvalavi
 
Special Double Sampling Plan for truncated life tests based on the Marshall-O...
Special Double Sampling Plan for truncated life tests based on the Marshall-O...Special Double Sampling Plan for truncated life tests based on the Marshall-O...
Special Double Sampling Plan for truncated life tests based on the Marshall-O...ijceronline
 
Dynamic emulation modelling for the optimal operation of water systems: an ov...
Dynamic emulation modelling for the optimal operation of water systems: an ov...Dynamic emulation modelling for the optimal operation of water systems: an ov...
Dynamic emulation modelling for the optimal operation of water systems: an ov...Andrea Castelletti
 
Script for Comparison vertical flow models BHR Cannes June 14 2013 presentation
Script for Comparison vertical flow models BHR Cannes June 14 2013 presentationScript for Comparison vertical flow models BHR Cannes June 14 2013 presentation
Script for Comparison vertical flow models BHR Cannes June 14 2013 presentationPablo Adames
 
In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...Kamel Mansouri
 
Statistics for Data Analytics
Statistics for Data AnalyticsStatistics for Data Analytics
Statistics for Data AnalyticsABHISHEKDAHALE
 

Mais procurados (16)

Standardization and calibration -dr.mallik
Standardization and calibration -dr.mallikStandardization and calibration -dr.mallik
Standardization and calibration -dr.mallik
 
DSUS_MAO_2012_Jie
DSUS_MAO_2012_JieDSUS_MAO_2012_Jie
DSUS_MAO_2012_Jie
 
ModelSelection1_WCSMO_2013_Ali
ModelSelection1_WCSMO_2013_AliModelSelection1_WCSMO_2013_Ali
ModelSelection1_WCSMO_2013_Ali
 
AIAA-SciTech-ModelSelection-2014-Mehmani
AIAA-SciTech-ModelSelection-2014-MehmaniAIAA-SciTech-ModelSelection-2014-Mehmani
AIAA-SciTech-ModelSelection-2014-Mehmani
 
Figures of merit dr.mallik
Figures of merit dr.mallikFigures of merit dr.mallik
Figures of merit dr.mallik
 
Case Study: Overview of Metabolomic Data Normalization Strategies
Case Study: Overview of Metabolomic Data Normalization StrategiesCase Study: Overview of Metabolomic Data Normalization Strategies
Case Study: Overview of Metabolomic Data Normalization Strategies
 
Logistics regression
Logistics regressionLogistics regression
Logistics regression
 
Cohen
CohenCohen
Cohen
 
QSAR statistical methods for drug discovery(pharmacology m.pharm2nd sem)
QSAR statistical methods for drug discovery(pharmacology m.pharm2nd sem)QSAR statistical methods for drug discovery(pharmacology m.pharm2nd sem)
QSAR statistical methods for drug discovery(pharmacology m.pharm2nd sem)
 
Data Normalization Approaches for Large-scale Biological Studies
Data Normalization Approaches for Large-scale Biological StudiesData Normalization Approaches for Large-scale Biological Studies
Data Normalization Approaches for Large-scale Biological Studies
 
Activity coefficient models
Activity coefficient modelsActivity coefficient models
Activity coefficient models
 
Special Double Sampling Plan for truncated life tests based on the Marshall-O...
Special Double Sampling Plan for truncated life tests based on the Marshall-O...Special Double Sampling Plan for truncated life tests based on the Marshall-O...
Special Double Sampling Plan for truncated life tests based on the Marshall-O...
 
Dynamic emulation modelling for the optimal operation of water systems: an ov...
Dynamic emulation modelling for the optimal operation of water systems: an ov...Dynamic emulation modelling for the optimal operation of water systems: an ov...
Dynamic emulation modelling for the optimal operation of water systems: an ov...
 
Script for Comparison vertical flow models BHR Cannes June 14 2013 presentation
Script for Comparison vertical flow models BHR Cannes June 14 2013 presentationScript for Comparison vertical flow models BHR Cannes June 14 2013 presentation
Script for Comparison vertical flow models BHR Cannes June 14 2013 presentation
 
In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...
 
Statistics for Data Analytics
Statistics for Data AnalyticsStatistics for Data Analytics
Statistics for Data Analytics
 

Destaque

Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on rAbhik Seal
 
Data handling in r
Data handling in rData handling in r
Data handling in rAbhik Seal
 
Clinicaldataanalysis in r
Clinicaldataanalysis in rClinicaldataanalysis in r
Clinicaldataanalysis in rAbhik Seal
 
Virtual Screening in Drug Discovery
Virtual Screening in Drug DiscoveryVirtual Screening in Drug Discovery
Virtual Screening in Drug DiscoveryAbhik Seal
 
Introduction to Adverse Drug Reactions
Introduction to Adverse Drug ReactionsIntroduction to Adverse Drug Reactions
Introduction to Adverse Drug ReactionsAbhik Seal
 
Q plot tutorial
Q plot tutorialQ plot tutorial
Q plot tutorialAbhik Seal
 
Learning chemistry with google
Learning chemistry with googleLearning chemistry with google
Learning chemistry with googleAbhik Seal
 
Chemical File Formats for storing chemical data
Chemical File Formats for storing chemical dataChemical File Formats for storing chemical data
Chemical File Formats for storing chemical dataAbhik Seal
 
eDiscovery for Dummies "The Book"
eDiscovery for Dummies "The Book"eDiscovery for Dummies "The Book"
eDiscovery for Dummies "The Book"J. David Morris
 
Mapping protein to function
Mapping protein to functionMapping protein to function
Mapping protein to functionAbhik Seal
 
Sequencedatabases
SequencedatabasesSequencedatabases
SequencedatabasesAbhik Seal
 
Pharmacohoreppt
PharmacohorepptPharmacohoreppt
PharmacohorepptAbhik Seal
 
1 -val_gillet_-_ligand-based_and_structure-based_virtual_screening
1  -val_gillet_-_ligand-based_and_structure-based_virtual_screening1  -val_gillet_-_ligand-based_and_structure-based_virtual_screening
1 -val_gillet_-_ligand-based_and_structure-based_virtual_screeningDeependra Ban
 
Computer aided drug designing (CADD)
Computer aided drug designing (CADD)Computer aided drug designing (CADD)
Computer aided drug designing (CADD)Aakshay Subramaniam
 
Qsar and drug design ppt
Qsar and drug design pptQsar and drug design ppt
Qsar and drug design pptAbhik Seal
 

Destaque (19)

Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on r
 
Data handling in r
Data handling in rData handling in r
Data handling in r
 
Clinicaldataanalysis in r
Clinicaldataanalysis in rClinicaldataanalysis in r
Clinicaldataanalysis in r
 
Networks
NetworksNetworks
Networks
 
Chemical data
Chemical dataChemical data
Chemical data
 
Virtual Screening in Drug Discovery
Virtual Screening in Drug DiscoveryVirtual Screening in Drug Discovery
Virtual Screening in Drug Discovery
 
Introduction to Adverse Drug Reactions
Introduction to Adverse Drug ReactionsIntroduction to Adverse Drug Reactions
Introduction to Adverse Drug Reactions
 
Q plot tutorial
Q plot tutorialQ plot tutorial
Q plot tutorial
 
Learning chemistry with google
Learning chemistry with googleLearning chemistry with google
Learning chemistry with google
 
Document1
Document1Document1
Document1
 
Indo us 2012
Indo us 2012Indo us 2012
Indo us 2012
 
Chemical File Formats for storing chemical data
Chemical File Formats for storing chemical dataChemical File Formats for storing chemical data
Chemical File Formats for storing chemical data
 
eDiscovery for Dummies "The Book"
eDiscovery for Dummies "The Book"eDiscovery for Dummies "The Book"
eDiscovery for Dummies "The Book"
 
Mapping protein to function
Mapping protein to functionMapping protein to function
Mapping protein to function
 
Sequencedatabases
SequencedatabasesSequencedatabases
Sequencedatabases
 
Pharmacohoreppt
PharmacohorepptPharmacohoreppt
Pharmacohoreppt
 
1 -val_gillet_-_ligand-based_and_structure-based_virtual_screening
1  -val_gillet_-_ligand-based_and_structure-based_virtual_screening1  -val_gillet_-_ligand-based_and_structure-based_virtual_screening
1 -val_gillet_-_ligand-based_and_structure-based_virtual_screening
 
Computer aided drug designing (CADD)
Computer aided drug designing (CADD)Computer aided drug designing (CADD)
Computer aided drug designing (CADD)
 
Qsar and drug design ppt
Qsar and drug design pptQsar and drug design ppt
Qsar and drug design ppt
 

Semelhante a Modeling Chemical Datasets

Guide for building GLMS
Guide for building GLMSGuide for building GLMS
Guide for building GLMSAli T. Lotia
 
How predictive models help Medicinal Chemists design better drugs_webinar
How predictive models help Medicinal Chemists design better drugs_webinarHow predictive models help Medicinal Chemists design better drugs_webinar
How predictive models help Medicinal Chemists design better drugs_webinarAnn-Marie Roche
 
Making solubility models with reaxy
Making solubility models with reaxyMaking solubility models with reaxy
Making solubility models with reaxyAnn-Marie Roche
 
Making solubility models with reaxy
Making solubility models with reaxyMaking solubility models with reaxy
Making solubility models with reaxyAnn-Marie Roche
 
Bioactivity Predictive ModelingMay2016
Bioactivity Predictive ModelingMay2016Bioactivity Predictive ModelingMay2016
Bioactivity Predictive ModelingMay2016Matthew Clark
 
AIAA-Aviation-VariableFidelity-2014-Mehmani
AIAA-Aviation-VariableFidelity-2014-MehmaniAIAA-Aviation-VariableFidelity-2014-Mehmani
AIAA-Aviation-VariableFidelity-2014-MehmaniOptiModel
 
Slides sem on pls-complete
Slides sem on pls-completeSlides sem on pls-complete
Slides sem on pls-completeDr Hemant Sharma
 
German credit score shivaram prakash
German credit score shivaram prakashGerman credit score shivaram prakash
German credit score shivaram prakashShivaram Prakash
 
Module 05 – Hypothesis Tests Using Two SamplesClass Objectives
Module 05 – Hypothesis Tests Using Two SamplesClass ObjectivesModule 05 – Hypothesis Tests Using Two SamplesClass Objectives
Module 05 – Hypothesis Tests Using Two SamplesClass ObjectivesIlonaThornburg83
 
Model Calibration and Uncertainty Analysis
Model Calibration and Uncertainty AnalysisModel Calibration and Uncertainty Analysis
Model Calibration and Uncertainty AnalysisJ Boisvert-Chouinard
 
Parameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionParameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionDario Panada
 
Descriptive versus mechanistic modelling
Descriptive versus mechanistic modellingDescriptive versus mechanistic modelling
Descriptive versus mechanistic modellingSayeda Salma S.A.
 
Tales of correlation inflation (2013 CADD GRC)
Tales of correlation inflation (2013 CADD GRC) Tales of correlation inflation (2013 CADD GRC)
Tales of correlation inflation (2013 CADD GRC) Peter Kenny
 

Semelhante a Modeling Chemical Datasets (20)

Guide for building GLMS
Guide for building GLMSGuide for building GLMS
Guide for building GLMS
 
How predictive models help Medicinal Chemists design better drugs_webinar
How predictive models help Medicinal Chemists design better drugs_webinarHow predictive models help Medicinal Chemists design better drugs_webinar
How predictive models help Medicinal Chemists design better drugs_webinar
 
Making solubility models with reaxy
Making solubility models with reaxyMaking solubility models with reaxy
Making solubility models with reaxy
 
Making solubility models with reaxy
Making solubility models with reaxyMaking solubility models with reaxy
Making solubility models with reaxy
 
Bioactivity Predictive ModelingMay2016
Bioactivity Predictive ModelingMay2016Bioactivity Predictive ModelingMay2016
Bioactivity Predictive ModelingMay2016
 
AIAA-Aviation-VariableFidelity-2014-Mehmani
AIAA-Aviation-VariableFidelity-2014-MehmaniAIAA-Aviation-VariableFidelity-2014-Mehmani
AIAA-Aviation-VariableFidelity-2014-Mehmani
 
Sem with amos ii
Sem with amos iiSem with amos ii
Sem with amos ii
 
SEM
SEMSEM
SEM
 
Slides sem on pls-complete
Slides sem on pls-completeSlides sem on pls-complete
Slides sem on pls-complete
 
cadd.pptx
cadd.pptxcadd.pptx
cadd.pptx
 
German credit score shivaram prakash
German credit score shivaram prakashGerman credit score shivaram prakash
German credit score shivaram prakash
 
ICUR poster
ICUR posterICUR poster
ICUR poster
 
Econometrics chapter 8
Econometrics chapter 8Econometrics chapter 8
Econometrics chapter 8
 
Module 05 – Hypothesis Tests Using Two SamplesClass Objectives
Module 05 – Hypothesis Tests Using Two SamplesClass ObjectivesModule 05 – Hypothesis Tests Using Two SamplesClass Objectives
Module 05 – Hypothesis Tests Using Two SamplesClass Objectives
 
report
reportreport
report
 
Validity andreliability
Validity andreliabilityValidity andreliability
Validity andreliability
 
Model Calibration and Uncertainty Analysis
Model Calibration and Uncertainty AnalysisModel Calibration and Uncertainty Analysis
Model Calibration and Uncertainty Analysis
 
Parameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionParameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point Detection
 
Descriptive versus mechanistic modelling
Descriptive versus mechanistic modellingDescriptive versus mechanistic modelling
Descriptive versus mechanistic modelling
 
Tales of correlation inflation (2013 CADD GRC)
Tales of correlation inflation (2013 CADD GRC) Tales of correlation inflation (2013 CADD GRC)
Tales of correlation inflation (2013 CADD GRC)
 

Último

Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationRosabel UA
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 

Último (20)

Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translation
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 

Modeling Chemical Datasets

  • 1. Modeling  Chemical  datasets     with  a  focus  on  regression  based  methods   dsdht.wikispaces.com  
  • 2. Aims •  How does the dynamic range of the data being modeled impact the apparent performance of the model? " •  How does experimental error impact the apparent predictivity of a model? " •  How can we determine whether a model is applicable to a new dataset?" •  How should we compare the performance of regression models? "                                                                                                                                                                                                                                h0p://media.johnwiley.com.au/product_data/excerpt/00/11181391/1118139100-­‐4.pdf  
  • 3. Example     Examine  a  number  of  datasets  containing   measured  values  for  aqueous  solubility  and  use   these  datasets  to  build  and  evaluate  predic7ve   models.  
  • 4. CChallenges  in  modeling  solubility   Aqueous solubility of a compound can vary depending on a number of factors: •   Temperature   •   Purity   •   polymorph  
  • 5. Datasets  under  study   •   The  Huuskonen  Dataset  :    1274  experimental   solubility  values  first  largest  solubility  dataset.   •   The  JCIM  Dataset  :    94  experimental  solubility  2008   •   The  PubChem  Dataset  (AID1996):  A  randomly   selected  subset  of  1000  measured  solubility  values   selected  from  a  set  of  58,000  values  that  were   experimentally  determined  using  chemilumenescent   nitrogen  detec7on  (CLND).  
  • 6. Formula   LogS = log10((solubility in µg/ml)/(1000.0 MW))  
  • 7.            Solubility  Comparison                                        A  boxplot  comparison  of  Log  S  for  the  three  datasets  
  • 8. Requirements  for  PredicCve  model   •  Reliable experimental data •   Sets  of  molecular  descriptors   •   Sta7s7cal  or  machine-­‐learning  methods  
  • 9. Types  of  Models   ClassificaCon  Model  :       •  Taking  cutoffs  points  in  modeling  “edge  effects”.              consider  a  case  where  we  have  a  two-­‐class                          system  with  a  cutoff  of  100  μM.  A  value  of  99  μ                    M  will  be  considered  insoluble  while  a  value  of                  101  μ  M  will  be  considered  soluble.     •  other  difficulty  with  classifica7on  models  is  that   they  provide  limited  direc7on  for  improving  the   proper7es  of  a  compound    
  • 10. Types  of  Models   Regression  Model  :         •   difficult  to  create  a  regression  model  given  data              with  a  limited  dynamic  range.   •   limited  dynamic  range  unreliable  model      
  • 11. EvaluaCng  a  predicCve  model   •  Pearson’s  r:    commonly  referred  to  as  Pearson’s  r  ,  or   its  square  r^2              Values  of  r    can  vary  between  −1  and  1,   •  Kendall’s  Tau:    Pearson’s  r    is  that  it  is  sensi7ve  to   outliers  and  to  the  distribu7on  of  the  underlying  data.   Employ  rank  order  or  values.   •  RMSD:    If  we  consider  paired  values  X    and  Y  ,  RMSD  can   be  calculated  using  the  following  equa7on.  
  • 12. Steps  involved  in  building  a  predicCve  model   •  Integrate  the  experimental  data  and  molecular   descriptors   •  Divide  the  data  into  training  and  test  sets   •  Build  a  model  from  the  training  set   •  Use  this  model  to  predict  the  test  set  
  • 13. Random  forest  model     The  dynamic  range  in  a  dataset  can  have  a  large   impact  on  the  apparent  correla7on  between   experimental  and  predicted  ac7vity.  
  • 14.  Experimental  Error  and  Model  Performance   •   experimental  data  point  has  an  error  associated                  with  it.            If  we  measure  the  Log  S    of  a  compound  as  −6  and  that  data  point  has  an  error  of                  0.3  log  units,  the  actual  value  could  be  anywhere  between  −6.3  and  −5.7.     •  Brown  examined  the  rela7onship  between  experimental   error  and  model  performance.       •  Gaussian  distributed  random  values  were  added  to   data  to  simulate  experimental  errors.     •   Correla7on  between  the  measured  values  and  the  same   values  with  simulated  error  is  measured.  
  • 15. Experimental  Error  and  Model  Performance   •  Table  shows  the  maximum  possible  correla7on  for   each  of  the  three  solubility  datasets  we  have  been   examining  when  experimental  errors  of  0.3,  0.5,   and  1.0  log  are  considered.   •  Error  is  more  for  a  dataset  like  pubchem.  
  • 16. Model  Applicability   •  Models  ofen  perform  poorly  on  molecules  that   bear  ligle  resemblance  to  those  in  the  training  set.   Dataset     Mean   Median   Huuskonen_Test   0.76   0.78   JCIM   0.74   0.62   Pubchem   0.56   0.56   Similarity  of  Each  Test  Set   Dataset   R2   Kendall   RMS   Error   Huuskonen_Test   0.92   0.82   0.58   JCIM   0.58   0.59   0.83   Pubchem   0.11   0.22   1.12  
  • 17.  Comparing  Predic7ve  Models   •   When  comparing  correla7on  coefficients,  we  must  not  only  consider  the  value  of  the   correla7on  coefficient,  but  also  the  confidence  intervals  around  the  correla7on   coefficient.   •   If  the  confidence  intervals  of  two  correla7ons  overlap,  we  cannot  claim  that              one  predic7ve  model  is  superior  to  another.   •  For  subset  of  25  compounds  confidence  intervals  overlap  so  ,  we  cannot  say  that  one   correla7on  is  superior  to  the  other.   •  For  subset  of  50  compounds,  there  is  a  very  small  difference  between  the  upper   bound  of  the  95%  confidence  interval.   •  For  subset  of  100  compounds,  there  is  clear  separa7on  between  the  confidence   intervals  so  it  implies  that  there  is  clear  separa7on  between  correla7on  coefficients.      
  • 18. References   •  hgp://www.wiley.com/WileyCDA/WileyTitle/ productCd-­‐1118139100.html   •  hgps://github.com/PatWalters/ cheminforma7csbook