SlideShare uma empresa Scribd logo
1 de 29
Baixar para ler offline
Introduction
          Methodologies
              Summary




Making the Most of Predictive Models

               Rajarshi Guha

             School of Informatics
              Indiana University


          28th February, 2007
           CUP 8, Santa Fe
Introduction
                       Methodologies
                           Summary


Outline



  1   Introduction


  2   Methodologies
       Linear Methods
       Nonlinear Methods
       Feature Selection


  3   Summary
Introduction
                       Methodologies
                           Summary


Outline



  1   Introduction


  2   Methodologies
       Linear Methods
       Nonlinear Methods
       Feature Selection


  3   Summary
Introduction
                         Methodologies
                             Summary


All Models Predict Something . . .




    Physics based models
        Docking
        Force fields
        MD
    Statistical/ML models
        Indirect description of the physical situation
        Not always clear as to how a prediction is made
Introduction
                         Methodologies
                             Summary


The Scope of Predictive Models


  Predictive models can be used for
      filtering
      analysis
  We can use predictive models in
      chemometrics
      bioinformatics                     OR
      QSAR
      ...
  What do we look for in a model?
      Validity
      Accuracy
      Applicability
      Interpretability
Introduction
                         Methodologies
                             Summary


The Scope of Predictive Models


  Predictive models can be used for
      filtering
      analysis
  We can use predictive models in
      chemometrics
      bioinformatics                     OR
      QSAR
      ...
  What do we look for in a model?
      Validity
      Accuracy
      Applicability
      Interpretability
Introduction
                         Methodologies
                             Summary


The Scope of Predictive Models


  Predictive models can be used for
      filtering
      analysis
  We can use predictive models in
      chemometrics
      bioinformatics                     OR
      QSAR
      ...
  What do we look for in a model?
      Validity
      Accuracy
      Applicability
      Interpretability
Introduction
                        Methodologies
                            Summary


Why Interpret a Model?




   A model encodes relationships between features and the property
   Understanding these relationships allows us to
       understand why a molecules is active or not
       suggest structural modifications
       explain anomalous observations
Introduction
                         Methodologies
                             Summary


How Much Detail Can We Extract?




   We can look at a model very broadly
       Which descriptors are important to it’s predictive ability?
   We can consider a more detailed analysis
       What is the effect of descriptor Xi on Yˆ
       Which observations highlight this relationship?
   Depends on how much effort you want to put in
Introduction
                        Methodologies
                            Summary


The Accuracy - Interpretability Tradeoff




  OLS models are generally easier to
  interpret but not always accurate
  Neural networks give better
  accuracy, but are black boxes
  Some lie in between, such as
  random forests
Introduction   Linear Methods
                       Methodologies   Nonlinear Methods
                           Summary     Feature Selection


Outline



  1   Introduction


  2   Methodologies
       Linear Methods
       Nonlinear Methods
       Feature Selection


  3   Summary
Introduction       Linear Methods
                                        Methodologies       Nonlinear Methods
                                            Summary         Feature Selection


Model Types That We Can Interpret

     Depends on the
             Level of interpretation desired
             Nature of the problem (classification and regression)
     Some models are interpretable by design
             Decision trees
             Bayesian networks
     Other require an interpretation protocol
             Linear regression
             Random forests
             Neural networks
             Support vector machines


  Guha, R.; Jurs, P.C.,J. Chem. Inf. Comput. Sci., 2004, 44, 2179–2189
  Guha R. et al., J. Chem. Inf. Model., 2005, 46, 321–333
  Do, T.N.; Poulet, F., Enhancing SVM with Visualization in Discovery Science, Springer, 2004, pp. 183–194
Introduction          Linear Methods
                                        Methodologies          Nonlinear Methods
                                            Summary            Feature Selection


Linear Regression Interpretations


            pKa = −37.54Qσ,o + 12.27Aaccess,o + 0.11χπ,αc
                             −1.02αo − 1.89Iamino + 19.10

     Simply looking at the magnitude of the coefficients describes
     which descriptors are playing an important role
     Signs of the coefficients indicate the effect of the descriptor on
     the predicted property
     Interpretation is still quite broad
             We’d like to see more detailed SAR’s applied to individual
             molecules



  Zhang, J. et al., J. Chem. Inf. Model, 2006, 46, 2256–2266
Introduction       Linear Methods
                                       Methodologies       Nonlinear Methods
                                           Summary         Feature Selection


Linear Regression Interpretations via PLS


     PLS overview
             Creates a model with latent variables
             Latent variables (components) are linear combinations of the origi
             nal variables (X’s)
             Each latent variable is used to predict a pseudo dependent
             variable (Y’s)
     Interpretation
             The linear model is subjected to PLS analysis
             This also validates the model
             Choose the number of components to use
             Interpretation uses the X-weights, X-scores & Y-scores




  Stanton D.T.; J. Chem. Inf. Comput. Sci., 2003, 43, 1423-1433
Introduction        Linear Methods
                                        Methodologies        Nonlinear Methods
                                            Summary          Feature Selection


Linear Regression Interpretations via PLS


 Choosing components                                                           X variance    R2       Q2
          2
       Q allows us to choose how many
                                                                          C1     0.51        0.52     0.45
       components                                                         C2     0.78        0.60     0.56
       For a valid model cumulative                                       C3     1.00        0.61     0.56
       variance should be 1
 Descriptor weights
       Descriptors are ranked by their                                Desc              C1     C2       C3
       weights                                                        MDEN-23        -0.16    0.93      0.30
       Sign of weight indicates how the                               RNHS-3         0.55     -0.17     0.81
       descriptor correlates to predicted                             SURR-5         -0.82    -0.29     0.48
       activity



  Guha, R.; Jurs, P.C., J. Chem. Inf. Comput. Sci., 2004, 44, 2179–2189
Introduction   Linear Methods
                          Methodologies   Nonlinear Methods
                              Summary     Feature Selection


Linear Regression Interpretations via PLS



  Component 1
      SURR-5 is most weighted
      Low values of SURR-5 ⇒ high
      values of predicted activity
  Interpretation
      Active compounds have high
      absolute values of SURR-5
      Indicates large hydrophobic
      surface area
      Consistent with cell based assay
      which depends on cell                             MDEN-23       RNHS-3   SURR-5
      membrane transport
                                                 C1           -0.16    0.55     -0.82
Introduction       Linear Methods
                                         Methodologies       Nonlinear Methods
                                             Summary         Feature Selection


Random Forest Interpretations



  A RF provides a measure of
  descriptor importance
          Utilizes the whole descriptor
          pool and ranks the descriptors
          Based on randomization
  We can also get more information
  on individual trees
          Find the most important trees
          Consider a tree-space and find
          clusters of trees




  Chipman, H.A. et al., Making Sense of a Forest of Trees in Proc. 30th Symposium on the Interface, Weisberg, S. Ed
Introduction       Linear Methods
                                        Methodologies       Nonlinear Methods
                                            Summary         Feature Selection


Neural Network Interpretations


                                                                          Effective Weight Matrix

                                                                                     Hidden Neuron
                                                                        Descriptor        1          2
                                                                        Desc 1        52.41    29.30
                                                                        Desc 2        37.65    22.14
                                                                        Desc 3       −10.50   −16.85




  Linearizes the network and consequently looses some details of the
  encoded SAR’s


  Guha R. et al., J. Chem. Inf. Model., 2005, 46, 321–333
Introduction    Linear Methods
                          Methodologies    Nonlinear Methods
                              Summary      Feature Selection


Neural Network Interpretations
    The most weighted descriptors are very similar to those in the
    OLS model
    The signs of the effective weights match those from the OLS
    model as well as chemical reasoning
                                              Size effects, higher
                    Hidden Neuron             MW
  Descriptor      1       3         2         4
  PNSA-3       -1.80   -6.57   0.39       -1.43
                                                          H-bonding, HSA, polar
  RSHM-1       4.03     6.15   1.50        1.01           surface area
  V4P-5        9.45     2.15   3.24        0.60
  S4PC-12       3.36    2.73   1.99        0.56
  MW            3.94    8.42   1.94        0.76
  WTPT-2        1.71    2.61   1.17       -0.13
  DPHS-1        0.66    0.44   0.33        1.65
  SCV          0.52     0.33    0.13      0.01
Introduction   Linear Methods
                        Methodologies   Nonlinear Methods
                            Summary     Feature Selection


Stepping Back . . .




    We’ve been focusing on single model types
    Different models for different purposes
    Descriptors are optimal for a specific model
    Are we sure that the different models encode the same SAR’s?
Introduction   Linear Methods
                             Methodologies   Nonlinear Methods
                                 Summary     Feature Selection


Getting the Best of Both Worlds




    Why not force multiple models to have the same descriptors?
         The descriptor set will not be optimal for either model
         Degradation in accuracy
    Is it really that bad?
Introduction   Linear Methods
                                         Methodologies   Nonlinear Methods
                                             Summary     Feature Selection


Eating Our Cake . . .

     Ensemble feature selection
             Select a descriptor subset that is simultaneously optimal for two
             different model types
     Allows us to build an OLS model and a CNN model using the
     same set of descriptors
     We use a genetic algorithm, where the objective function is of
     the form
                         RMSEOLS + RMSECNN


  We have one model for interpretability and one for accuracy, but
  they should now incorporate the same SAR’s


  Dutta, D. et al., J. Chem. Inf. Model., submitted
Introduction   Linear Methods
                                         Methodologies   Nonlinear Methods
                                             Summary     Feature Selection


Eating Our Cake . . .

     Ensemble feature selection
             Select a descriptor subset that is simultaneously optimal for two
             different model types
     Allows us to build an OLS model and a CNN model using the
     same set of descriptors
     We use a genetic algorithm, where the objective function is of
     the form
                         RMSEOLS + RMSECNN


  We have one model for interpretability and one for accuracy, but
  they should now incorporate the same SAR’s


  Dutta, D. et al., J. Chem. Inf. Model., submitted
Introduction     Linear Methods
                                    Methodologies     Nonlinear Methods
                                        Summary       Feature Selection


. . . and Having it Too


                PDGFR Inhibitors                                                       HIV Integrase
        0.8




                                                                            40
                             Feature Selection Mode                                               Feature Selection Mode
                                     Individual                                                           Individual
                                     Ensemble                                                             Ensemble
        0.6




                                                                            30
                                                      % Misclassification
 RMSE
        0.4




                                                                            20
        0.2




                                                                            10
        0.0




                                                                            0




              CNN                 OLS                                            CNN                   OLS
Introduction   Linear Methods
                         Methodologies   Nonlinear Methods
                             Summary     Feature Selection


Selecting for Interpretability




    Uptil now we’ve considered the models themselves
    But it’s the descriptors we interpret
    Can we build models out of interpretable descriptors?
        Design descriptors so that they have physical meaning
        Exclude uninterpretable descriptors from the pool
        Add semantic annotation to descriptors and modify feature
        selection algorithms to take this into account
Introduction
                       Methodologies
                           Summary


Outline



  1   Introduction


  2   Methodologies
       Linear Methods
       Nonlinear Methods
       Feature Selection


  3   Summary
Introduction
                        Methodologies
                            Summary


Summary




   Using models just for prediction is fine, but there’s lots of extra
   information we can extract from them
   The extent of interpretability is guided by the nature of the
   problem, the choice of model and descriptors
   Interpretation of 2D-QSAR models allows us to get a little closer
   to the real physical problem
Introduction
Methodologies
    Summary
Introduction
                         Methodologies
                             Summary


PLS Interpretation - Understanding Outliers




  Compound 55 is mispredicted by each
  component
  It is also an outlier in both linear & CNN
  models
  Has high absolute value of SURR-5 but low
  measured activity

Mais conteúdo relacionado

Destaque

Numerical Characterization of Structure-Activity Relationships from a Medicin...
Numerical Characterization of Structure-Activity Relationships from a Medicin...Numerical Characterization of Structure-Activity Relationships from a Medicin...
Numerical Characterization of Structure-Activity Relationships from a Medicin...Rajarshi Guha
 
PubChem Bioassays as a Source of Polypharmacology
PubChem Bioassays as a Source of PolypharmacologyPubChem Bioassays as a Source of Polypharmacology
PubChem Bioassays as a Source of PolypharmacologyRajarshi Guha
 
Interactive Machine Learning
Interactive  Machine LearningInteractive  Machine Learning
Interactive Machine LearningZitao Liu
 
Prioritizing Scaffolds for Hit Selection in High Throughput Screening Programs
Prioritizing Scaffolds for Hit Selection in High Throughput Screening ProgramsPrioritizing Scaffolds for Hit Selection in High Throughput Screening Programs
Prioritizing Scaffolds for Hit Selection in High Throughput Screening ProgramsRajarshi Guha
 
Chemical Spaces: Modeling, Exploration & Understanding
Chemical Spaces: Modeling, Exploration & UnderstandingChemical Spaces: Modeling, Exploration & Understanding
Chemical Spaces: Modeling, Exploration & UnderstandingRajarshi Guha
 
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action: Bridging Chemistry and Biology with Informatics at NCATSFrom Data to Action: Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATSRajarshi Guha
 
High throughput screening
High throughput screening High throughput screening
High throughput screening RewariBhavya
 

Destaque (7)

Numerical Characterization of Structure-Activity Relationships from a Medicin...
Numerical Characterization of Structure-Activity Relationships from a Medicin...Numerical Characterization of Structure-Activity Relationships from a Medicin...
Numerical Characterization of Structure-Activity Relationships from a Medicin...
 
PubChem Bioassays as a Source of Polypharmacology
PubChem Bioassays as a Source of PolypharmacologyPubChem Bioassays as a Source of Polypharmacology
PubChem Bioassays as a Source of Polypharmacology
 
Interactive Machine Learning
Interactive  Machine LearningInteractive  Machine Learning
Interactive Machine Learning
 
Prioritizing Scaffolds for Hit Selection in High Throughput Screening Programs
Prioritizing Scaffolds for Hit Selection in High Throughput Screening ProgramsPrioritizing Scaffolds for Hit Selection in High Throughput Screening Programs
Prioritizing Scaffolds for Hit Selection in High Throughput Screening Programs
 
Chemical Spaces: Modeling, Exploration & Understanding
Chemical Spaces: Modeling, Exploration & UnderstandingChemical Spaces: Modeling, Exploration & Understanding
Chemical Spaces: Modeling, Exploration & Understanding
 
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action: Bridging Chemistry and Biology with Informatics at NCATSFrom Data to Action: Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
 
High throughput screening
High throughput screening High throughput screening
High throughput screening
 

Semelhante a Making the Most of Predictive Models

Small Molecules and siRNA: Methods to Explore Bioactivity Data
Small Molecules and siRNA: Methods to Explore Bioactivity DataSmall Molecules and siRNA: Methods to Explore Bioactivity Data
Small Molecules and siRNA: Methods to Explore Bioactivity DataRajarshi Guha
 
A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...
A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...
A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...ijcsa
 
U.S. Nuclear Facilities - Annie Kammerer
U.S. Nuclear Facilities - Annie KammererU.S. Nuclear Facilities - Annie Kammerer
U.S. Nuclear Facilities - Annie KammererEERI
 
Performance analysis of regularized linear regression models for oxazolines a...
Performance analysis of regularized linear regression models for oxazolines a...Performance analysis of regularized linear regression models for oxazolines a...
Performance analysis of regularized linear regression models for oxazolines a...ijcsity
 
Predicting Hospital Readmission Using TreeNet
Predicting Hospital Readmission Using TreeNetPredicting Hospital Readmission Using TreeNet
Predicting Hospital Readmission Using TreeNetSalford Systems
 
MULTI-PARAMETER BASED PERFORMANCE EVALUATION OF CLASSIFICATION ALGORITHMS
MULTI-PARAMETER BASED PERFORMANCE EVALUATION OF CLASSIFICATION ALGORITHMSMULTI-PARAMETER BASED PERFORMANCE EVALUATION OF CLASSIFICATION ALGORITHMS
MULTI-PARAMETER BASED PERFORMANCE EVALUATION OF CLASSIFICATION ALGORITHMSijcsit
 
Virscidian Poster Asms2010 Final Version Letter
Virscidian Poster Asms2010 Final Version LetterVirscidian Poster Asms2010 Final Version Letter
Virscidian Poster Asms2010 Final Version LetterMark Bayliss
 
Network meta-analysis with integrated nested Laplace approximations
Network meta-analysis with integrated nested Laplace approximationsNetwork meta-analysis with integrated nested Laplace approximations
Network meta-analysis with integrated nested Laplace approximationsBurak Kürsad Günhan
 
Multivariate data analysis and visualization tools for biological data
Multivariate data analysis and visualization tools for biological dataMultivariate data analysis and visualization tools for biological data
Multivariate data analysis and visualization tools for biological dataDmitry Grapov
 
Joint Image Registration And Example-Based Super-Resolution Algorithm
Joint Image Registration And Example-Based Super-Resolution AlgorithmJoint Image Registration And Example-Based Super-Resolution Algorithm
Joint Image Registration And Example-Based Super-Resolution Algorithmaciijournal
 
Prediction Of Bioactivity From Chemical Structure
Prediction Of Bioactivity From Chemical StructurePrediction Of Bioactivity From Chemical Structure
Prediction Of Bioactivity From Chemical StructureJeremy Besnard
 
Multivariate analyses & decoding
Multivariate analyses & decodingMultivariate analyses & decoding
Multivariate analyses & decodingkhbrodersen
 
Ibc biological assay development & validation 2011 gra presentation
Ibc biological assay development & validation 2011 gra presentationIbc biological assay development & validation 2011 gra presentation
Ibc biological assay development & validation 2011 gra presentationGreyRigge Associates Ltd
 
QSAR statistical methods for drug discovery(pharmacology m.pharm2nd sem)
QSAR statistical methods for drug discovery(pharmacology m.pharm2nd sem)QSAR statistical methods for drug discovery(pharmacology m.pharm2nd sem)
QSAR statistical methods for drug discovery(pharmacology m.pharm2nd sem)Satigayatri
 
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...IJDKP
 

Semelhante a Making the Most of Predictive Models (20)

Small Molecules and siRNA: Methods to Explore Bioactivity Data
Small Molecules and siRNA: Methods to Explore Bioactivity DataSmall Molecules and siRNA: Methods to Explore Bioactivity Data
Small Molecules and siRNA: Methods to Explore Bioactivity Data
 
3D QSAR
3D QSAR3D QSAR
3D QSAR
 
A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...
A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...
A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...
 
Data Analyst - Interview Guide
Data Analyst - Interview GuideData Analyst - Interview Guide
Data Analyst - Interview Guide
 
U.S. Nuclear Facilities - Annie Kammerer
U.S. Nuclear Facilities - Annie KammererU.S. Nuclear Facilities - Annie Kammerer
U.S. Nuclear Facilities - Annie Kammerer
 
Unit 2 cadd assignment
Unit 2 cadd assignmentUnit 2 cadd assignment
Unit 2 cadd assignment
 
Performance analysis of regularized linear regression models for oxazolines a...
Performance analysis of regularized linear regression models for oxazolines a...Performance analysis of regularized linear regression models for oxazolines a...
Performance analysis of regularized linear regression models for oxazolines a...
 
Predicting Hospital Readmission Using TreeNet
Predicting Hospital Readmission Using TreeNetPredicting Hospital Readmission Using TreeNet
Predicting Hospital Readmission Using TreeNet
 
MULTI-PARAMETER BASED PERFORMANCE EVALUATION OF CLASSIFICATION ALGORITHMS
MULTI-PARAMETER BASED PERFORMANCE EVALUATION OF CLASSIFICATION ALGORITHMSMULTI-PARAMETER BASED PERFORMANCE EVALUATION OF CLASSIFICATION ALGORITHMS
MULTI-PARAMETER BASED PERFORMANCE EVALUATION OF CLASSIFICATION ALGORITHMS
 
Virscidian Poster Asms2010 Final Version Letter
Virscidian Poster Asms2010 Final Version LetterVirscidian Poster Asms2010 Final Version Letter
Virscidian Poster Asms2010 Final Version Letter
 
Network meta-analysis with integrated nested Laplace approximations
Network meta-analysis with integrated nested Laplace approximationsNetwork meta-analysis with integrated nested Laplace approximations
Network meta-analysis with integrated nested Laplace approximations
 
Multivariate data analysis and visualization tools for biological data
Multivariate data analysis and visualization tools for biological dataMultivariate data analysis and visualization tools for biological data
Multivariate data analysis and visualization tools for biological data
 
Joint Image Registration And Example-Based Super-Resolution Algorithm
Joint Image Registration And Example-Based Super-Resolution AlgorithmJoint Image Registration And Example-Based Super-Resolution Algorithm
Joint Image Registration And Example-Based Super-Resolution Algorithm
 
Prediction Of Bioactivity From Chemical Structure
Prediction Of Bioactivity From Chemical StructurePrediction Of Bioactivity From Chemical Structure
Prediction Of Bioactivity From Chemical Structure
 
Multivariate analyses & decoding
Multivariate analyses & decodingMultivariate analyses & decoding
Multivariate analyses & decoding
 
Dd31720725
Dd31720725Dd31720725
Dd31720725
 
Bebpa Nice 29 Sept 2011
Bebpa Nice 29 Sept 2011Bebpa Nice 29 Sept 2011
Bebpa Nice 29 Sept 2011
 
Ibc biological assay development & validation 2011 gra presentation
Ibc biological assay development & validation 2011 gra presentationIbc biological assay development & validation 2011 gra presentation
Ibc biological assay development & validation 2011 gra presentation
 
QSAR statistical methods for drug discovery(pharmacology m.pharm2nd sem)
QSAR statistical methods for drug discovery(pharmacology m.pharm2nd sem)QSAR statistical methods for drug discovery(pharmacology m.pharm2nd sem)
QSAR statistical methods for drug discovery(pharmacology m.pharm2nd sem)
 
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...
 

Mais de Rajarshi Guha

Pharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark GenomePharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark GenomeRajarshi Guha
 
Pharos: Putting targets in context
Pharos: Putting targets in contextPharos: Putting targets in context
Pharos: Putting targets in contextRajarshi Guha
 
Pharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark GenomePharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark GenomeRajarshi Guha
 
Pharos - Face of the KMC
Pharos - Face of the KMCPharos - Face of the KMC
Pharos - Face of the KMCRajarshi Guha
 
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS PlatformEnhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS PlatformRajarshi Guha
 
What can your library do for you?
What can your library do for you?What can your library do for you?
What can your library do for you?Rajarshi Guha
 
So I have an SD File … What do I do next?
So I have an SD File … What do I do next?So I have an SD File … What do I do next?
So I have an SD File … What do I do next?Rajarshi Guha
 
Characterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network ModelsCharacterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network ModelsRajarshi Guha
 
Robots, Small Molecules & R
Robots, Small Molecules & RRobots, Small Molecules & R
Robots, Small Molecules & RRajarshi Guha
 
Fingerprinting Chemical Structures
Fingerprinting Chemical StructuresFingerprinting Chemical Structures
Fingerprinting Chemical StructuresRajarshi Guha
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...Rajarshi Guha
 
When the whole is better than the parts
When the whole is better than the partsWhen the whole is better than the parts
When the whole is better than the partsRajarshi Guha
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...Rajarshi Guha
 
Pushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the PipesPushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the PipesRajarshi Guha
 
Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...Rajarshi Guha
 
The BioAssay Research Database
The BioAssay Research DatabaseThe BioAssay Research Database
The BioAssay Research DatabaseRajarshi Guha
 
Cloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsCloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsRajarshi Guha
 
Chemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & ReproducibleChemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & ReproducibleRajarshi Guha
 
Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Rajarshi Guha
 
Quantifying Text Sentiment in R
Quantifying Text Sentiment in RQuantifying Text Sentiment in R
Quantifying Text Sentiment in RRajarshi Guha
 

Mais de Rajarshi Guha (20)

Pharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark GenomePharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark Genome
 
Pharos: Putting targets in context
Pharos: Putting targets in contextPharos: Putting targets in context
Pharos: Putting targets in context
 
Pharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark GenomePharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark Genome
 
Pharos - Face of the KMC
Pharos - Face of the KMCPharos - Face of the KMC
Pharos - Face of the KMC
 
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS PlatformEnhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
 
What can your library do for you?
What can your library do for you?What can your library do for you?
What can your library do for you?
 
So I have an SD File … What do I do next?
So I have an SD File … What do I do next?So I have an SD File … What do I do next?
So I have an SD File … What do I do next?
 
Characterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network ModelsCharacterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network Models
 
Robots, Small Molecules & R
Robots, Small Molecules & RRobots, Small Molecules & R
Robots, Small Molecules & R
 
Fingerprinting Chemical Structures
Fingerprinting Chemical StructuresFingerprinting Chemical Structures
Fingerprinting Chemical Structures
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
 
When the whole is better than the parts
When the whole is better than the partsWhen the whole is better than the parts
When the whole is better than the parts
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
 
Pushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the PipesPushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the Pipes
 
Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...
 
The BioAssay Research Database
The BioAssay Research DatabaseThe BioAssay Research Database
The BioAssay Research Database
 
Cloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsCloudy with a Touch of Cheminformatics
Cloudy with a Touch of Cheminformatics
 
Chemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & ReproducibleChemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & Reproducible
 
Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?
 
Quantifying Text Sentiment in R
Quantifying Text Sentiment in RQuantifying Text Sentiment in R
Quantifying Text Sentiment in R
 

Último

Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxCeline George
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxPooja Bhuva
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxPooja Bhuva
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsKarakKing
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Pooja Bhuva
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...Amil baba
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Pooja Bhuva
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxEsquimalt MFRC
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxmarlenawright1
 

Último (20)

Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 

Making the Most of Predictive Models

  • 1. Introduction Methodologies Summary Making the Most of Predictive Models Rajarshi Guha School of Informatics Indiana University 28th February, 2007 CUP 8, Santa Fe
  • 2. Introduction Methodologies Summary Outline 1 Introduction 2 Methodologies Linear Methods Nonlinear Methods Feature Selection 3 Summary
  • 3. Introduction Methodologies Summary Outline 1 Introduction 2 Methodologies Linear Methods Nonlinear Methods Feature Selection 3 Summary
  • 4. Introduction Methodologies Summary All Models Predict Something . . . Physics based models Docking Force fields MD Statistical/ML models Indirect description of the physical situation Not always clear as to how a prediction is made
  • 5. Introduction Methodologies Summary The Scope of Predictive Models Predictive models can be used for filtering analysis We can use predictive models in chemometrics bioinformatics OR QSAR ... What do we look for in a model? Validity Accuracy Applicability Interpretability
  • 6. Introduction Methodologies Summary The Scope of Predictive Models Predictive models can be used for filtering analysis We can use predictive models in chemometrics bioinformatics OR QSAR ... What do we look for in a model? Validity Accuracy Applicability Interpretability
  • 7. Introduction Methodologies Summary The Scope of Predictive Models Predictive models can be used for filtering analysis We can use predictive models in chemometrics bioinformatics OR QSAR ... What do we look for in a model? Validity Accuracy Applicability Interpretability
  • 8. Introduction Methodologies Summary Why Interpret a Model? A model encodes relationships between features and the property Understanding these relationships allows us to understand why a molecules is active or not suggest structural modifications explain anomalous observations
  • 9. Introduction Methodologies Summary How Much Detail Can We Extract? We can look at a model very broadly Which descriptors are important to it’s predictive ability? We can consider a more detailed analysis What is the effect of descriptor Xi on Yˆ Which observations highlight this relationship? Depends on how much effort you want to put in
  • 10. Introduction Methodologies Summary The Accuracy - Interpretability Tradeoff OLS models are generally easier to interpret but not always accurate Neural networks give better accuracy, but are black boxes Some lie in between, such as random forests
  • 11. Introduction Linear Methods Methodologies Nonlinear Methods Summary Feature Selection Outline 1 Introduction 2 Methodologies Linear Methods Nonlinear Methods Feature Selection 3 Summary
  • 12. Introduction Linear Methods Methodologies Nonlinear Methods Summary Feature Selection Model Types That We Can Interpret Depends on the Level of interpretation desired Nature of the problem (classification and regression) Some models are interpretable by design Decision trees Bayesian networks Other require an interpretation protocol Linear regression Random forests Neural networks Support vector machines Guha, R.; Jurs, P.C.,J. Chem. Inf. Comput. Sci., 2004, 44, 2179–2189 Guha R. et al., J. Chem. Inf. Model., 2005, 46, 321–333 Do, T.N.; Poulet, F., Enhancing SVM with Visualization in Discovery Science, Springer, 2004, pp. 183–194
  • 13. Introduction Linear Methods Methodologies Nonlinear Methods Summary Feature Selection Linear Regression Interpretations pKa = −37.54Qσ,o + 12.27Aaccess,o + 0.11χπ,αc −1.02αo − 1.89Iamino + 19.10 Simply looking at the magnitude of the coefficients describes which descriptors are playing an important role Signs of the coefficients indicate the effect of the descriptor on the predicted property Interpretation is still quite broad We’d like to see more detailed SAR’s applied to individual molecules Zhang, J. et al., J. Chem. Inf. Model, 2006, 46, 2256–2266
  • 14. Introduction Linear Methods Methodologies Nonlinear Methods Summary Feature Selection Linear Regression Interpretations via PLS PLS overview Creates a model with latent variables Latent variables (components) are linear combinations of the origi nal variables (X’s) Each latent variable is used to predict a pseudo dependent variable (Y’s) Interpretation The linear model is subjected to PLS analysis This also validates the model Choose the number of components to use Interpretation uses the X-weights, X-scores & Y-scores Stanton D.T.; J. Chem. Inf. Comput. Sci., 2003, 43, 1423-1433
  • 15. Introduction Linear Methods Methodologies Nonlinear Methods Summary Feature Selection Linear Regression Interpretations via PLS Choosing components X variance R2 Q2 2 Q allows us to choose how many C1 0.51 0.52 0.45 components C2 0.78 0.60 0.56 For a valid model cumulative C3 1.00 0.61 0.56 variance should be 1 Descriptor weights Descriptors are ranked by their Desc C1 C2 C3 weights MDEN-23 -0.16 0.93 0.30 Sign of weight indicates how the RNHS-3 0.55 -0.17 0.81 descriptor correlates to predicted SURR-5 -0.82 -0.29 0.48 activity Guha, R.; Jurs, P.C., J. Chem. Inf. Comput. Sci., 2004, 44, 2179–2189
  • 16. Introduction Linear Methods Methodologies Nonlinear Methods Summary Feature Selection Linear Regression Interpretations via PLS Component 1 SURR-5 is most weighted Low values of SURR-5 ⇒ high values of predicted activity Interpretation Active compounds have high absolute values of SURR-5 Indicates large hydrophobic surface area Consistent with cell based assay which depends on cell MDEN-23 RNHS-3 SURR-5 membrane transport C1 -0.16 0.55 -0.82
  • 17. Introduction Linear Methods Methodologies Nonlinear Methods Summary Feature Selection Random Forest Interpretations A RF provides a measure of descriptor importance Utilizes the whole descriptor pool and ranks the descriptors Based on randomization We can also get more information on individual trees Find the most important trees Consider a tree-space and find clusters of trees Chipman, H.A. et al., Making Sense of a Forest of Trees in Proc. 30th Symposium on the Interface, Weisberg, S. Ed
  • 18. Introduction Linear Methods Methodologies Nonlinear Methods Summary Feature Selection Neural Network Interpretations Effective Weight Matrix Hidden Neuron Descriptor 1 2 Desc 1 52.41 29.30 Desc 2 37.65 22.14 Desc 3 −10.50 −16.85 Linearizes the network and consequently looses some details of the encoded SAR’s Guha R. et al., J. Chem. Inf. Model., 2005, 46, 321–333
  • 19. Introduction Linear Methods Methodologies Nonlinear Methods Summary Feature Selection Neural Network Interpretations The most weighted descriptors are very similar to those in the OLS model The signs of the effective weights match those from the OLS model as well as chemical reasoning Size effects, higher Hidden Neuron MW Descriptor 1 3 2 4 PNSA-3 -1.80 -6.57 0.39 -1.43 H-bonding, HSA, polar RSHM-1 4.03 6.15 1.50 1.01 surface area V4P-5 9.45 2.15 3.24 0.60 S4PC-12 3.36 2.73 1.99 0.56 MW 3.94 8.42 1.94 0.76 WTPT-2 1.71 2.61 1.17 -0.13 DPHS-1 0.66 0.44 0.33 1.65 SCV 0.52 0.33 0.13 0.01
  • 20. Introduction Linear Methods Methodologies Nonlinear Methods Summary Feature Selection Stepping Back . . . We’ve been focusing on single model types Different models for different purposes Descriptors are optimal for a specific model Are we sure that the different models encode the same SAR’s?
  • 21. Introduction Linear Methods Methodologies Nonlinear Methods Summary Feature Selection Getting the Best of Both Worlds Why not force multiple models to have the same descriptors? The descriptor set will not be optimal for either model Degradation in accuracy Is it really that bad?
  • 22. Introduction Linear Methods Methodologies Nonlinear Methods Summary Feature Selection Eating Our Cake . . . Ensemble feature selection Select a descriptor subset that is simultaneously optimal for two different model types Allows us to build an OLS model and a CNN model using the same set of descriptors We use a genetic algorithm, where the objective function is of the form RMSEOLS + RMSECNN We have one model for interpretability and one for accuracy, but they should now incorporate the same SAR’s Dutta, D. et al., J. Chem. Inf. Model., submitted
  • 23. Introduction Linear Methods Methodologies Nonlinear Methods Summary Feature Selection Eating Our Cake . . . Ensemble feature selection Select a descriptor subset that is simultaneously optimal for two different model types Allows us to build an OLS model and a CNN model using the same set of descriptors We use a genetic algorithm, where the objective function is of the form RMSEOLS + RMSECNN We have one model for interpretability and one for accuracy, but they should now incorporate the same SAR’s Dutta, D. et al., J. Chem. Inf. Model., submitted
  • 24. Introduction Linear Methods Methodologies Nonlinear Methods Summary Feature Selection . . . and Having it Too PDGFR Inhibitors HIV Integrase 0.8 40 Feature Selection Mode Feature Selection Mode Individual Individual Ensemble Ensemble 0.6 30 % Misclassification RMSE 0.4 20 0.2 10 0.0 0 CNN OLS CNN OLS
  • 25. Introduction Linear Methods Methodologies Nonlinear Methods Summary Feature Selection Selecting for Interpretability Uptil now we’ve considered the models themselves But it’s the descriptors we interpret Can we build models out of interpretable descriptors? Design descriptors so that they have physical meaning Exclude uninterpretable descriptors from the pool Add semantic annotation to descriptors and modify feature selection algorithms to take this into account
  • 26. Introduction Methodologies Summary Outline 1 Introduction 2 Methodologies Linear Methods Nonlinear Methods Feature Selection 3 Summary
  • 27. Introduction Methodologies Summary Summary Using models just for prediction is fine, but there’s lots of extra information we can extract from them The extent of interpretability is guided by the nature of the problem, the choice of model and descriptors Interpretation of 2D-QSAR models allows us to get a little closer to the real physical problem
  • 29. Introduction Methodologies Summary PLS Interpretation - Understanding Outliers Compound 55 is mispredicted by each component It is also an outlier in both linear & CNN models Has high absolute value of SURR-5 but low measured activity