SlideShare a Scribd company logo
1 of 32
Download to read offline
Introduction R & CDK R & PubChem Conclusions




                      Integrating R and the CDK
                       Enhanced Chemical Data Mining


                                    Rajarshi Guha

                                   School of Informatics
                                    Indiana University


                                    21st May, 2007
                                    Covington, KY
Introduction R & CDK R & PubChem Conclusions

Outline



  1   Introduction


  2   R & CDK


  3   R & PubChem


  4   Conclusions
Introduction R & CDK R & PubChem Conclusions

Outline



  1   Introduction


  2   R & CDK


  3   R & PubChem


  4   Conclusions
Introduction R & CDK R & PubChem Conclusions

Cheminformatics & Modeling



        Chemical information is available from multiple sources
               Databases
               Algorithms
        The information must be used in some way
               Summarized in the form of a report
               Modeled (numerically / statistically)
        A variety of software is available for both cheminformatics and
        modeling
               Can we get something that allows us the best of both worlds?
Introduction R & CDK R & PubChem Conclusions

Cheminformatics & Modeling



        Chemical information is available from multiple sources
               Databases
               Algorithms
        The information must be used in some way
               Summarized in the form of a report
               Modeled (numerically / statistically)
        A variety of software is available for both cheminformatics and
        modeling
               Can we get something that allows us the best of both worlds?
Introduction R & CDK R & PubChem Conclusions

Modeling Environments


        Many cheminformatics applications include statistical
        functionality
        Useful, but has a number of drawbacks
               Their focus is on cheminformatics, not necessarily statistics
               Usually limited in the statistics offerings
        Makes sense to use software designed for statistics
               SAS, SPSS
               Minitab, R
               ...
        However, these focus on statistics and not cheminformatics

     Combine a statistics package with a cheminformatics package
Introduction R & CDK R & PubChem Conclusions

Modeling Environments


        Many cheminformatics applications include statistical
        functionality
        Useful, but has a number of drawbacks
               Their focus is on cheminformatics, not necessarily statistics
               Usually limited in the statistics offerings
        Makes sense to use software designed for statistics
               SAS, SPSS
               Minitab, R
               ...
        However, these focus on statistics and not cheminformatics

     Combine a statistics package with a cheminformatics package
Introduction R & CDK R & PubChem Conclusions

Why Use R?



        Open source statistical programming environment
        Full fledged programming language
        A wide variety of statistical methods - traditional, well tested
        methods as well as state of the art methods
        Extensive and active community
        Easy to integrate into other environments
               Workflow tools (Pipeline Pilot, Knime)
               Java programs (CDK)
Introduction R & CDK R & PubChem Conclusions

Why Use the CDK?




          Open source Java library for cheminformatics
          Covers basic cheminformatics functionality
          Also provides descriptors, 3D coordinate generation, rigid
          alignment etc.
          Active development community




  Steinbeck, C. et al., Curr. Pharm. Des., 2007, 12, 2110 – 2120
Introduction R & CDK R & PubChem Conclusions

Outline



  1   Introduction


  2   R & CDK


  3   R & PubChem


  4   Conclusions
Introduction R & CDK R & PubChem Conclusions

Goals of the Integration




        Allow one to use cheminformatics tools within R
        Ensure that cheminformatics functionality follows R idioms
        Provide access to PubChem data - compound and bioassay
Introduction R & CDK R & PubChem Conclusions

Using the CDK in R
  Overview
      Based on the rJava package
        Access to basic cheminformatics
        View 3D structures via Jmol
        View and edit 2D structures via JChemPaint
        Single R package to install - no extra downloads
Introduction R & CDK R & PubChem Conclusions

rcdk Functionality

                                 Function                   Description

        Editing and viewing      draw.molecule              Draw 2D structures
                                 view.molecule              View a molecule in 3D
                                 view.molecule.2d           View a molecule in 2D
                                 view.molecule.table        View 3D molecules along with data

        IO                       load.molecules             Load arbitrary molecular file formats
                                 write.molecules            Write molecules to MDL SD formatted files

        Descriptors              eval.desc                  Evaluate a single descriptor
                                 get.desc.engine            Get the descriptor engine which can be used to
                                                            automatically generate all descriptors
                                 get.desc.classnames        Get a list of available descriptors

        Miscellaneous            get.fingerprint            Generate binary fingerprints
                                 get.smiles                 Obtain SMILES representations
                                 parse.smiles               Parse a SMILES representation into a CDK
                                                            molecule object
                                 get.property               Retrieve arbitrary properties on molecules
                                 set.property               Set arbitrary properties on molecules
                                 remove.property            Remove a property from a molecule
                                 get.totalcharge            Get the total charge of a molecule
                                 get.total.hydrogen.count   Get the count of (implicit and explicit) hydro-
                                                            gens
                                 remove.hydrogens           Remove hydrogens from a molecule




  Guha, R., J. Stat. Soft. , 2007, 18 (6)
Introduction R & CDK R & PubChem Conclusions

Clustering Molecules



  Generating fingerprints
    > library(rcdk)
    > mols <- load.molecules(’dhfr_3d.sd’)
    > fp.list <- lapply(mols, get.fingerprint)

        The fingerprints are hashed, 1024-bit
        Gives us a list of numeric vectors, indicating the bit
        positions that are on
        The next step is to evaluate a distance matrix
Introduction R & CDK R & PubChem Conclusions

Clustering Molecules



  Fingerprint manipulation
    > library(fingerprint)
    > fp.dist <- fp.sim.matrix(fp.list)
    > fp.dist <- as.dist(1-fp.dist)

        The fingerprint package for R allows one to easily handle
        fingerprint data
        Recognizes CDK, BCI and MOE fingerprint formats
        We can now perform the clustering
Introduction R & CDK R & PubChem Conclusions

Clustering Molecules

  Hierarchical clustering
    > clus.hier <- hclust(fp.dist, method=’ward’)


                Hierarchical Clustering of the Sutherland DHFR Dataset
                                                                                               Average Disimilarity per Cluster
           35




                                                                                         0.7
           30




                                                                                         0.6
           25




                                                                                         0.5
                                                                         Dissimilarity
           20




                                                                                         0.4
  Height

           15




                                                                                         0.3
           10




                                                                                         0.2
                                                                                         0.1
           5




                                                                                         0.0
           0




                                                                                                 1        2        3       4

                                                                                                        Cluster Number
Introduction R & CDK R & PubChem Conclusions

QSAR Modeling
Introduction R & CDK R & PubChem Conclusions

QSAR Modeling



  Representation
      Molecules must be represented in ome machine readable
      format
        SD files are commonly used
        The draw.molecule function can be used to get a 2D
        structure editor
        The resultant molecule can be accessed in the R workspace
        Currently can’t generate 3D structures
Introduction R & CDK R & PubChem Conclusions

QSAR Modeling
  Calculating descriptors
   > mols <- load.molecules(’bp.sdf’)
   > desc.names <- c(
   + ’org.openscience.cdk.qsar.descriptors.molecular.BCUTDescriptor’,
   + ’org.openscience.cdk.qsar.descriptors.molecular.CPSADescriptor’,
   + ’org.openscience.cdk.qsar.descriptors.molecular.XLogPDescriptor’,
   + ’org.openscience.cdk.qsar.descriptors.molecular.KappaShapeIndicesDescriptor’)

   >   desc.list <- list()
   >   for (i in 1:length(mols)) {
   +     tmp <- c()
   +     for (j in 1:length(desc.names)) {
   +       values <- eval.desc(mols.qsar[[i]], desc.names[j])
   +       tmp <- c(tmp, values)
   +       if (i == 1) data.names <- c(data.names, paste(prefix[j], 1:length(values), sep=’.’))
   +     }
   +     desc.list[[i]] <- tmp
   +   }
   >   desc.data <- as.data.frame(do.call(’rbind’, desc.list))



         Currently have to specify descriptors and generate names
         This will be automated in the next release
         The result is a data matrix which can then be modeled using
         the various methods available in R
Introduction R & CDK R & PubChem Conclusions

QSAR Modeling
                                                                                                   q       Training Set
                                                                                                           Prediction Set




                                                                                             600
                                                                                                                                                                   q q

                                                                                                                                                             q
                                                                                                                                                             q




                                                                Observed Boiling Point (K)
                                                                                                                                                                         q
                                                                                                                                                            q qqq




                                                                                             500
                                                                                                                                                           q
                                                                                                                                                          q
                                                                                                                                                      q qq q q      q
                                                                                                                                                           q
                                                                                                                                                        qq q
                                                                                                                                                          q
                                                                                                                                                        q qq q    q
                                                                                                                                                    q
                                                                                                                                                  qqq q
                                                                                                                                                      q
                                                                                                                                                               q
                                                                                                                                                 q q    q    q
                                                                                                                                                  q   q q
                                                                                                                                           qqq q q q
                                                                                                                                      q q qq
                                                                                                                                         q q q q qq
                                                                                                                                           qq
                                                                                                                                         qq q




                                                                                             400
                                                                                                                                q q qq q q      q
                                                                                                                                          q q
                                                                                                                                          q
                                                                                                                                    q q qqq
                                                                                                                                          q
                                                                                                                                     qq q
                                                                                                                                     q
                                                                                                                                     q
                                                                                                                                    qqqq q q
                                                                                                                                                     q
                                                                                                                                  q q qq q q q
                                                                                                                                       qq
                                                                                                                               q q q q q qq qq
                                                                                                                                   q
                                                                                                                                q qq q
                                                                                                                               q q q q
                                                                                                                                   q             q
                                                                                                                                      q q q q
                                                                                                                         q q q q qq q qq
                                                                                                                                       q q

The results for an OLS model                                                                                       q
                                                                                                                      q
                                                                                                                    q qq
                                                                                                                        qq q



                                                                                                                     q qq q
                                                                                                                      q qq
                                                                                                                           q
                                                                                                                             q q
                                                                                                                             q
                                                                                                                            q qq qqq q
                                                                                                                                qq
                                                                                                                                qq q
                                                                                                                                   qq
                                                                                                                                  qq q
                                                                                                                                   q
                                                                                                                                             q
                                                                                                                                                       q




                                                                                             300
                                                                                                                               q
                                                                                                                               q      q
                                                                                                                           q
                                                                                                                  q       q
                                                                                                                  qq   q     q
> summary(model)                                                                                           q         qq

Coefficients:                                                                                      q
                                                                                                           q
                                                                                                             q




                                                                                             200
                                                                                                                     q
               Estimate Std. Error t value Pr(>|t|)                                                    q
                                                                                                                         q


(Intercept)      25.611     13.743   1.864   0.0639 .
                                                                                                                             300                400         500                 600
BCUT.6           26.155      1.503 17.406 < 2e-16 ***
                                                                                                                                    Predicted Boiling Point (K)
CPSA.12        2565.552    181.668 14.122 < 2e-16 ***
                                                                                                                                                                                      q
WeightedPath.1    7.645      0.608 12.573 < 2e-16 ***                                                             q                                                 q




                                                                                             2
                                                                                                                                                                    q

MDE.11           91.355     17.916   5.099 8.05e-07 ***                                                              q
                                                                                                                               q
                                                                                                                                            q
                                                                                                                                                              q
                                                                                                                                                                    q
                                                                                                       q    q q
---                                                                                                                q q        q
                                                                                                                                          q
                                                                                                                                                  qq
                                                                                                                                                        q              qq
                                                                                                                                                                         q                   q
                                                                                                                                                                                             q




                                                                                             1
                                                                                                                             q         q                      q                   q
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1                                               q q

                                                                                                                q    qqq q
                                                                                                                            q
                                                                                                                             q
                                                                                                                                  q
                                                                                                                                     qq
                                                                                                                                        qq q
                                                                                                                                                  q
                                                                                                                                                 q q q        q
                                                                                                                                                               q qqq q q
                                                                                                                                                               q
                                                                                                                                                                     q
                                                                                                                                                                     q
                                                                                                                                                                               q
                                                                                                                                                                                 q q
                                                                                                                                                                                   q
                                                                                                                                                                                             qq
                                                                                                                                                                                              q
                                                                                                                    q q q                                          q                q




                                                                Standardized Residual
                                                                                                                                                                                          qq
                                                                                                               qq q
                                                                                                                 q                q          q
                                                                                                                                                                          q
                                                                                                                                                                        q q
                                                                                                                         q       q                                q                   q
                                                                                                              q          q          q       q          q q q              q q
                                                                                                               q                              q                 q           q        q      q
                                                                                                    q       q                         q q                         q                    q q
                                                                                                             q                                qq                                         q




                                                                                             0
Residual standard error: 28.88 on 195 degrees of freedom                                               qq
                                                                                                                  q
                                                                                                                  q     q
                                                                                                                          qq
                                                                                                                               q
                                                                                                                                   q
                                                                                                                                     q
                                                                                                                                         q q
                                                                                                                                        q q
                                                                                                                                           q          q q
                                                                                                                                                          q
                                                                                                                                                           q
                                                                                                                                                           q                 q q q
                                                                                                                                                                                    q      q

                                                                                                                     q              q                qq                            q      q qq
Multiple R-Squared: 0.8311,     Adjusted R-squared: 0.8276                                            q
                                                                                                           q                          q     q
                                                                                                                                                q           q
                                                                                                                                                                        q q q q
                                                                                                                                                                             qq
                                                                                                       qq                   q                      q                            q
                                                                                                                      q                                  q
F-statistic: 239.9 on 4 and 195 DF, p-value: < 2.2e-16




                                                                                             −1
                                                                                                          q                     qq
                                                                                                                        q                      q q                                      q
                                                                                                     q                     q qq          q                          q
                                                                                                                   q                             q     q     q
                                                                                                      q         q                                        q
                                                                                                     q                                                                      q
                                                                                                                                                                       q
                                                                                                                                                     q                          q      q
                                                                                                                                                             q                           q
                                                                                                   q                                                q           q                    q
                                                                                                    q




                                                                                             −2
                                                                                                                                                      q
                                                                                                                                        q

                                                                                                                 q




                                                                                             −3
                                                                                                            q
                                                                                                                                                               q



                                                                                                   0                               50                 100            150                   200

                                                                                                                                            Training Set Index
Introduction R & CDK R & PubChem Conclusions

Molecular Visualization
Introduction R & CDK R & PubChem Conclusions

Outline



  1   Introduction


  2   R & CDK


  3   R & PubChem


  4   Conclusions
Introduction R & CDK R & PubChem Conclusions

Accesing PubChem

  What can we get?
     2D structures for ∼10M compounds
        Some precomputed properties
        ∼500 bioassay datasets ranging from 100 observations to
        80,000 observations
        Searchable by keyword

  What’s the data like?
     Structure data can be retrieved in the form of SD or XML files
        Structures are represented as 2D coordinates as well as
        SMILES
        Bioassay data consists of XML or CSV files
               Column descriptions etc. are located in a separate XML file
Introduction R & CDK R & PubChem Conclusions

Accessing PubChem

  What’s the problem?
     Combining bioassay data & metadata by hand is painful
          The XML descriptions are readable but verbose
          Getting data can be a multi-step process, involving search,
          download and processing

  How does rpubchem help?
          Automated download of data & metadata files for compounds
          and bioassays
          Extracts data into a data.frame
          Assigns appropriate column names and adds extra information
          (such as units) as attributes
          Allows keyword searches for bioassays

  Guha, R., J. Stat. Soft. , 2007, 18 (6)
Introduction R & CDK R & PubChem Conclusions

Example Usage


  Getting structural data
    >   library(rpubchem)
    >   library(rcdk)
    >   structs <- get.cid( sample(1:10000, 10) )
    >   mols <- lapply(structs[,2], parse.smiles)
    >   view.molecule.2d(mols)

        We retrieve the SMILES along with some of the precomputed
        values
        Heavy atom count, formal charge, H count, H-bond donors
        and acceptors
Introduction R & CDK R & PubChem Conclusions

Example Usage



  Getting bioassay data
    > aids <- find.assay.id(quot;lung+cancer+carcinomaquot;)
    > sapply(aids, function(x) get.assay.desc(x)$assay.desc)
    > adata <- get.assay(175)


        Search for assays by keywords
        Get short descriptions
        Get the actual assay data
Introduction R & CDK R & PubChem Conclusions

Example Usage

  What do we get?
   Field Name                                    Meaning
   PUBCHEM.SID                                   Substance ID
   PUBCHEM.CID                                   Compound ID
   PUBCHEM.ACTIVITY.OUTCOME                      Activity outcome, which can be one
                                                 of active, inactive, inconclusive, un-
                                                 specified
   PUBCHEM.ACTIVITY.SCORE                        PubChem activity score, higher is
                                                 more active
   PUBCHEM.ASSAYDATA.COMMENT                     Test result specific comment


        A set of fixed fields
        Arbitrary number of assay specific fields
        Have to process a description file
Introduction R & CDK R & PubChem Conclusions

Example Usage
  What do we get?
  > names(adata)
  [1] quot;PUBCHEM.SIDquot;                 quot;PUBCHEM.CIDquot;
  [3] quot;PUBCHEM.ACTIVITY.OUTCOMEquot;    quot;PUBCHEM.ACTIVITY.SCOREquot;
  [5] quot;PUBCHEM.ASSAYDATA.COMMENTquot;   quot;concquot;
  [7] quot;giprctquot;                      quot;nbrtestquot;
  [9] quot;rangequot;

  > attr(adata, quot;commentsquot;)
  [1] quot;These data are a subset of the data from the NCI yeast anticancer
  drug screen. Compounds are identified by the NCI NSC number. In the
  Seattle Project numbering system, the mlh1 rad18 strain is SPY number
  50858. Compounds with inhibition of growth(giprct) >= 70% were
  considered active. Activity score was based on increasing values of giprct.quot;

  > attr(adata, ’types’)
  $conc
  [1] quot;concentration tested, unit: micromolarquot;
  [2] quot;uMquot;

  $giprct
  [1] quot;Inhibition of growth compared to untreated controls (%)quot;
  [2] quot;NAquot;

  $nbrtest
  [1] quot;Number of tests averaged for this data pointquot;
  [2] quot;NAquot;

  $range
  [1] quot;Range of values for this data pointquot; quot;NAquot;
Introduction R & CDK R & PubChem Conclusions

Outline



  1   Introduction


  2   R & CDK


  3   R & PubChem


  4   Conclusions
Introduction R & CDK R & PubChem Conclusions

Future Work




        Improve the coverage of the CDK functionality
        Improve descriptor generation and naming
        Include support for more chemical file formats
        2D Structure-data tables
        Avoid in-memory processing of large PubChem assay datasets
Introduction R & CDK R & PubChem Conclusions

Conclusions




        Integrating the CDK with R enhances the utility of the
        statistical environment for cheminformatics applications
        The integration is still at the programmer level
               Applications can be built up from the available functionality
        rpubchem provides easy access to large amounts of biological
        and chemical data
Introduction R & CDK R & PubChem Conclusions

Acknowledgements




        NIH
        CDK developers
        Simon Urbanek

More Related Content

Similar to RCDK PubChem Integrating R and CDK for Chemical Data Mining

R & CDK: A Sturdy Platform in the Oceans of Chemical Data}
R & CDK: A Sturdy Platform in the Oceans of Chemical Data}R & CDK: A Sturdy Platform in the Oceans of Chemical Data}
R & CDK: A Sturdy Platform in the Oceans of Chemical Data}Rajarshi Guha
 
Chemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & ReproducibleChemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & ReproducibleRajarshi Guha
 
Datascience Training with Hadoop, Python Machine Learning & Scala, Spark
Datascience Training with Hadoop, Python Machine Learning & Scala, SparkDatascience Training with Hadoop, Python Machine Learning & Scala, Spark
Datascience Training with Hadoop, Python Machine Learning & Scala, SparkSequelGate
 
MineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White PaperMineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White PaperDerek Diamond
 
Cinfony - Bring cheminformatics toolkits into tune
Cinfony - Bring cheminformatics toolkits into tuneCinfony - Bring cheminformatics toolkits into tune
Cinfony - Bring cheminformatics toolkits into tunebaoilleach
 
Structure mapping your way to better software
Structure mapping your way to better softwareStructure mapping your way to better software
Structure mapping your way to better softwarematthoneycutt
 
MongoDB for Java Developers with Spring Data
MongoDB for Java Developers with Spring DataMongoDB for Java Developers with Spring Data
MongoDB for Java Developers with Spring DataChris Richardson
 
Object Oriented Concepts and Principles
Object Oriented Concepts and PrinciplesObject Oriented Concepts and Principles
Object Oriented Concepts and Principlesdeonpmeyer
 
Web sphere application server performance tuning workshop
Web sphere application server performance tuning workshopWeb sphere application server performance tuning workshop
Web sphere application server performance tuning workshopRohit Kelapure
 
Cinfony - Combining disparate cheminformatics resources into a single toolkit
Cinfony - Combining disparate cheminformatics resources into a single toolkitCinfony - Combining disparate cheminformatics resources into a single toolkit
Cinfony - Combining disparate cheminformatics resources into a single toolkitbaoilleach
 
Mac ruby deployment
Mac ruby deploymentMac ruby deployment
Mac ruby deploymentThilo Utke
 
Java database programming with jdbc
Java database programming with jdbcJava database programming with jdbc
Java database programming with jdbcsriram raj
 
MV(C, mvvm) in iOS and ReactiveCocoa
MV(C, mvvm) in iOS and ReactiveCocoaMV(C, mvvm) in iOS and ReactiveCocoa
MV(C, mvvm) in iOS and ReactiveCocoaYi-Shou Chen
 
Framework Engineering_Final
Framework Engineering_FinalFramework Engineering_Final
Framework Engineering_FinalYoungSu Son
 
Reactive Relational Database Connectivity
Reactive Relational Database ConnectivityReactive Relational Database Connectivity
Reactive Relational Database ConnectivityVMware Tanzu
 
A tale of bug prediction in software development
A tale of bug prediction in software developmentA tale of bug prediction in software development
A tale of bug prediction in software developmentMartin Pinzger
 

Similar to RCDK PubChem Integrating R and CDK for Chemical Data Mining (20)

R & CDK: A Sturdy Platform in the Oceans of Chemical Data}
R & CDK: A Sturdy Platform in the Oceans of Chemical Data}R & CDK: A Sturdy Platform in the Oceans of Chemical Data}
R & CDK: A Sturdy Platform in the Oceans of Chemical Data}
 
Corba 2
Corba 2Corba 2
Corba 2
 
Chemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & ReproducibleChemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & Reproducible
 
Datascience Training with Hadoop, Python Machine Learning & Scala, Spark
Datascience Training with Hadoop, Python Machine Learning & Scala, SparkDatascience Training with Hadoop, Python Machine Learning & Scala, Spark
Datascience Training with Hadoop, Python Machine Learning & Scala, Spark
 
MineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White PaperMineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White Paper
 
Cinfony - Bring cheminformatics toolkits into tune
Cinfony - Bring cheminformatics toolkits into tuneCinfony - Bring cheminformatics toolkits into tune
Cinfony - Bring cheminformatics toolkits into tune
 
Structure mapping your way to better software
Structure mapping your way to better softwareStructure mapping your way to better software
Structure mapping your way to better software
 
Sdlc
SdlcSdlc
Sdlc
 
MongoDB for Java Developers with Spring Data
MongoDB for Java Developers with Spring DataMongoDB for Java Developers with Spring Data
MongoDB for Java Developers with Spring Data
 
Object Oriented Concepts and Principles
Object Oriented Concepts and PrinciplesObject Oriented Concepts and Principles
Object Oriented Concepts and Principles
 
Web sphere application server performance tuning workshop
Web sphere application server performance tuning workshopWeb sphere application server performance tuning workshop
Web sphere application server performance tuning workshop
 
OpenCV+Android.pptx
OpenCV+Android.pptxOpenCV+Android.pptx
OpenCV+Android.pptx
 
Cinfony - Combining disparate cheminformatics resources into a single toolkit
Cinfony - Combining disparate cheminformatics resources into a single toolkitCinfony - Combining disparate cheminformatics resources into a single toolkit
Cinfony - Combining disparate cheminformatics resources into a single toolkit
 
Mac ruby deployment
Mac ruby deploymentMac ruby deployment
Mac ruby deployment
 
Java database programming with jdbc
Java database programming with jdbcJava database programming with jdbc
Java database programming with jdbc
 
Openobject bi
Openobject biOpenobject bi
Openobject bi
 
MV(C, mvvm) in iOS and ReactiveCocoa
MV(C, mvvm) in iOS and ReactiveCocoaMV(C, mvvm) in iOS and ReactiveCocoa
MV(C, mvvm) in iOS and ReactiveCocoa
 
Framework Engineering_Final
Framework Engineering_FinalFramework Engineering_Final
Framework Engineering_Final
 
Reactive Relational Database Connectivity
Reactive Relational Database ConnectivityReactive Relational Database Connectivity
Reactive Relational Database Connectivity
 
A tale of bug prediction in software development
A tale of bug prediction in software developmentA tale of bug prediction in software development
A tale of bug prediction in software development
 

More from Rajarshi Guha

Pharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark GenomePharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark GenomeRajarshi Guha
 
Pharos: Putting targets in context
Pharos: Putting targets in contextPharos: Putting targets in context
Pharos: Putting targets in contextRajarshi Guha
 
Pharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark GenomePharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark GenomeRajarshi Guha
 
Pharos - Face of the KMC
Pharos - Face of the KMCPharos - Face of the KMC
Pharos - Face of the KMCRajarshi Guha
 
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS PlatformEnhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS PlatformRajarshi Guha
 
What can your library do for you?
What can your library do for you?What can your library do for you?
What can your library do for you?Rajarshi Guha
 
So I have an SD File … What do I do next?
So I have an SD File … What do I do next?So I have an SD File … What do I do next?
So I have an SD File … What do I do next?Rajarshi Guha
 
Characterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network ModelsCharacterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network ModelsRajarshi Guha
 
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action: Bridging Chemistry and Biology with Informatics at NCATSFrom Data to Action: Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATSRajarshi Guha
 
Robots, Small Molecules & R
Robots, Small Molecules & RRobots, Small Molecules & R
Robots, Small Molecules & RRajarshi Guha
 
Fingerprinting Chemical Structures
Fingerprinting Chemical StructuresFingerprinting Chemical Structures
Fingerprinting Chemical StructuresRajarshi Guha
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...Rajarshi Guha
 
When the whole is better than the parts
When the whole is better than the partsWhen the whole is better than the parts
When the whole is better than the partsRajarshi Guha
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...Rajarshi Guha
 
Pushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the PipesPushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the PipesRajarshi Guha
 
Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...Rajarshi Guha
 
The BioAssay Research Database
The BioAssay Research DatabaseThe BioAssay Research Database
The BioAssay Research DatabaseRajarshi Guha
 
Cloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsCloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsRajarshi Guha
 
Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Rajarshi Guha
 
Quantifying Text Sentiment in R
Quantifying Text Sentiment in RQuantifying Text Sentiment in R
Quantifying Text Sentiment in RRajarshi Guha
 

More from Rajarshi Guha (20)

Pharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark GenomePharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark Genome
 
Pharos: Putting targets in context
Pharos: Putting targets in contextPharos: Putting targets in context
Pharos: Putting targets in context
 
Pharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark GenomePharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark Genome
 
Pharos - Face of the KMC
Pharos - Face of the KMCPharos - Face of the KMC
Pharos - Face of the KMC
 
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS PlatformEnhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
 
What can your library do for you?
What can your library do for you?What can your library do for you?
What can your library do for you?
 
So I have an SD File … What do I do next?
So I have an SD File … What do I do next?So I have an SD File … What do I do next?
So I have an SD File … What do I do next?
 
Characterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network ModelsCharacterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network Models
 
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action: Bridging Chemistry and Biology with Informatics at NCATSFrom Data to Action: Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
 
Robots, Small Molecules & R
Robots, Small Molecules & RRobots, Small Molecules & R
Robots, Small Molecules & R
 
Fingerprinting Chemical Structures
Fingerprinting Chemical StructuresFingerprinting Chemical Structures
Fingerprinting Chemical Structures
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
 
When the whole is better than the parts
When the whole is better than the partsWhen the whole is better than the parts
When the whole is better than the parts
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
 
Pushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the PipesPushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the Pipes
 
Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...
 
The BioAssay Research Database
The BioAssay Research DatabaseThe BioAssay Research Database
The BioAssay Research Database
 
Cloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsCloudy with a Touch of Cheminformatics
Cloudy with a Touch of Cheminformatics
 
Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?
 
Quantifying Text Sentiment in R
Quantifying Text Sentiment in RQuantifying Text Sentiment in R
Quantifying Text Sentiment in R
 

Recently uploaded

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Recently uploaded (20)

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 

RCDK PubChem Integrating R and CDK for Chemical Data Mining

  • 1. Introduction R & CDK R & PubChem Conclusions Integrating R and the CDK Enhanced Chemical Data Mining Rajarshi Guha School of Informatics Indiana University 21st May, 2007 Covington, KY
  • 2. Introduction R & CDK R & PubChem Conclusions Outline 1 Introduction 2 R & CDK 3 R & PubChem 4 Conclusions
  • 3. Introduction R & CDK R & PubChem Conclusions Outline 1 Introduction 2 R & CDK 3 R & PubChem 4 Conclusions
  • 4. Introduction R & CDK R & PubChem Conclusions Cheminformatics & Modeling Chemical information is available from multiple sources Databases Algorithms The information must be used in some way Summarized in the form of a report Modeled (numerically / statistically) A variety of software is available for both cheminformatics and modeling Can we get something that allows us the best of both worlds?
  • 5. Introduction R & CDK R & PubChem Conclusions Cheminformatics & Modeling Chemical information is available from multiple sources Databases Algorithms The information must be used in some way Summarized in the form of a report Modeled (numerically / statistically) A variety of software is available for both cheminformatics and modeling Can we get something that allows us the best of both worlds?
  • 6. Introduction R & CDK R & PubChem Conclusions Modeling Environments Many cheminformatics applications include statistical functionality Useful, but has a number of drawbacks Their focus is on cheminformatics, not necessarily statistics Usually limited in the statistics offerings Makes sense to use software designed for statistics SAS, SPSS Minitab, R ... However, these focus on statistics and not cheminformatics Combine a statistics package with a cheminformatics package
  • 7. Introduction R & CDK R & PubChem Conclusions Modeling Environments Many cheminformatics applications include statistical functionality Useful, but has a number of drawbacks Their focus is on cheminformatics, not necessarily statistics Usually limited in the statistics offerings Makes sense to use software designed for statistics SAS, SPSS Minitab, R ... However, these focus on statistics and not cheminformatics Combine a statistics package with a cheminformatics package
  • 8. Introduction R & CDK R & PubChem Conclusions Why Use R? Open source statistical programming environment Full fledged programming language A wide variety of statistical methods - traditional, well tested methods as well as state of the art methods Extensive and active community Easy to integrate into other environments Workflow tools (Pipeline Pilot, Knime) Java programs (CDK)
  • 9. Introduction R & CDK R & PubChem Conclusions Why Use the CDK? Open source Java library for cheminformatics Covers basic cheminformatics functionality Also provides descriptors, 3D coordinate generation, rigid alignment etc. Active development community Steinbeck, C. et al., Curr. Pharm. Des., 2007, 12, 2110 – 2120
  • 10. Introduction R & CDK R & PubChem Conclusions Outline 1 Introduction 2 R & CDK 3 R & PubChem 4 Conclusions
  • 11. Introduction R & CDK R & PubChem Conclusions Goals of the Integration Allow one to use cheminformatics tools within R Ensure that cheminformatics functionality follows R idioms Provide access to PubChem data - compound and bioassay
  • 12. Introduction R & CDK R & PubChem Conclusions Using the CDK in R Overview Based on the rJava package Access to basic cheminformatics View 3D structures via Jmol View and edit 2D structures via JChemPaint Single R package to install - no extra downloads
  • 13. Introduction R & CDK R & PubChem Conclusions rcdk Functionality Function Description Editing and viewing draw.molecule Draw 2D structures view.molecule View a molecule in 3D view.molecule.2d View a molecule in 2D view.molecule.table View 3D molecules along with data IO load.molecules Load arbitrary molecular file formats write.molecules Write molecules to MDL SD formatted files Descriptors eval.desc Evaluate a single descriptor get.desc.engine Get the descriptor engine which can be used to automatically generate all descriptors get.desc.classnames Get a list of available descriptors Miscellaneous get.fingerprint Generate binary fingerprints get.smiles Obtain SMILES representations parse.smiles Parse a SMILES representation into a CDK molecule object get.property Retrieve arbitrary properties on molecules set.property Set arbitrary properties on molecules remove.property Remove a property from a molecule get.totalcharge Get the total charge of a molecule get.total.hydrogen.count Get the count of (implicit and explicit) hydro- gens remove.hydrogens Remove hydrogens from a molecule Guha, R., J. Stat. Soft. , 2007, 18 (6)
  • 14. Introduction R & CDK R & PubChem Conclusions Clustering Molecules Generating fingerprints > library(rcdk) > mols <- load.molecules(’dhfr_3d.sd’) > fp.list <- lapply(mols, get.fingerprint) The fingerprints are hashed, 1024-bit Gives us a list of numeric vectors, indicating the bit positions that are on The next step is to evaluate a distance matrix
  • 15. Introduction R & CDK R & PubChem Conclusions Clustering Molecules Fingerprint manipulation > library(fingerprint) > fp.dist <- fp.sim.matrix(fp.list) > fp.dist <- as.dist(1-fp.dist) The fingerprint package for R allows one to easily handle fingerprint data Recognizes CDK, BCI and MOE fingerprint formats We can now perform the clustering
  • 16. Introduction R & CDK R & PubChem Conclusions Clustering Molecules Hierarchical clustering > clus.hier <- hclust(fp.dist, method=’ward’) Hierarchical Clustering of the Sutherland DHFR Dataset Average Disimilarity per Cluster 35 0.7 30 0.6 25 0.5 Dissimilarity 20 0.4 Height 15 0.3 10 0.2 0.1 5 0.0 0 1 2 3 4 Cluster Number
  • 17. Introduction R & CDK R & PubChem Conclusions QSAR Modeling
  • 18. Introduction R & CDK R & PubChem Conclusions QSAR Modeling Representation Molecules must be represented in ome machine readable format SD files are commonly used The draw.molecule function can be used to get a 2D structure editor The resultant molecule can be accessed in the R workspace Currently can’t generate 3D structures
  • 19. Introduction R & CDK R & PubChem Conclusions QSAR Modeling Calculating descriptors > mols <- load.molecules(’bp.sdf’) > desc.names <- c( + ’org.openscience.cdk.qsar.descriptors.molecular.BCUTDescriptor’, + ’org.openscience.cdk.qsar.descriptors.molecular.CPSADescriptor’, + ’org.openscience.cdk.qsar.descriptors.molecular.XLogPDescriptor’, + ’org.openscience.cdk.qsar.descriptors.molecular.KappaShapeIndicesDescriptor’) > desc.list <- list() > for (i in 1:length(mols)) { + tmp <- c() + for (j in 1:length(desc.names)) { + values <- eval.desc(mols.qsar[[i]], desc.names[j]) + tmp <- c(tmp, values) + if (i == 1) data.names <- c(data.names, paste(prefix[j], 1:length(values), sep=’.’)) + } + desc.list[[i]] <- tmp + } > desc.data <- as.data.frame(do.call(’rbind’, desc.list)) Currently have to specify descriptors and generate names This will be automated in the next release The result is a data matrix which can then be modeled using the various methods available in R
  • 20. Introduction R & CDK R & PubChem Conclusions QSAR Modeling q Training Set Prediction Set 600 q q q q Observed Boiling Point (K) q q qqq 500 q q q qq q q q q qq q q q qq q q q qqq q q q q q q q q q q qqq q q q q q qq q q q q qq qq qq q 400 q q qq q q q q q q q q qqq q qq q q q qqqq q q q q q qq q q q qq q q q q q qq qq q q qq q q q q q q q q q q q q q q q qq q qq q q The results for an OLS model q q q qq qq q q qq q q qq q q q q q qq qqq q qq qq q qq qq q q q q 300 q q q q q q qq q q > summary(model) q qq Coefficients: q q q 200 q Estimate Std. Error t value Pr(>|t|) q q (Intercept) 25.611 13.743 1.864 0.0639 . 300 400 500 600 BCUT.6 26.155 1.503 17.406 < 2e-16 *** Predicted Boiling Point (K) CPSA.12 2565.552 181.668 14.122 < 2e-16 *** q WeightedPath.1 7.645 0.608 12.573 < 2e-16 *** q q 2 q MDE.11 91.355 17.916 5.099 8.05e-07 *** q q q q q q q q --- q q q q qq q qq q q q 1 q q q q Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 q q q qqq q q q q qq qq q q q q q q q qqq q q q q q q q q q qq q q q q q q Standardized Residual qq qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q 0 Residual standard error: 28.88 on 195 degrees of freedom qq q q q qq q q q q q q q q q q q q q q q q q q q q qq q q qq Multiple R-Squared: 0.8311, Adjusted R-squared: 0.8276 q q q q q q q q q q qq qq q q q q q F-statistic: 239.9 on 4 and 195 DF, p-value: < 2.2e-16 −1 q qq q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q −2 q q q −3 q q 0 50 100 150 200 Training Set Index
  • 21. Introduction R & CDK R & PubChem Conclusions Molecular Visualization
  • 22. Introduction R & CDK R & PubChem Conclusions Outline 1 Introduction 2 R & CDK 3 R & PubChem 4 Conclusions
  • 23. Introduction R & CDK R & PubChem Conclusions Accesing PubChem What can we get? 2D structures for ∼10M compounds Some precomputed properties ∼500 bioassay datasets ranging from 100 observations to 80,000 observations Searchable by keyword What’s the data like? Structure data can be retrieved in the form of SD or XML files Structures are represented as 2D coordinates as well as SMILES Bioassay data consists of XML or CSV files Column descriptions etc. are located in a separate XML file
  • 24. Introduction R & CDK R & PubChem Conclusions Accessing PubChem What’s the problem? Combining bioassay data & metadata by hand is painful The XML descriptions are readable but verbose Getting data can be a multi-step process, involving search, download and processing How does rpubchem help? Automated download of data & metadata files for compounds and bioassays Extracts data into a data.frame Assigns appropriate column names and adds extra information (such as units) as attributes Allows keyword searches for bioassays Guha, R., J. Stat. Soft. , 2007, 18 (6)
  • 25. Introduction R & CDK R & PubChem Conclusions Example Usage Getting structural data > library(rpubchem) > library(rcdk) > structs <- get.cid( sample(1:10000, 10) ) > mols <- lapply(structs[,2], parse.smiles) > view.molecule.2d(mols) We retrieve the SMILES along with some of the precomputed values Heavy atom count, formal charge, H count, H-bond donors and acceptors
  • 26. Introduction R & CDK R & PubChem Conclusions Example Usage Getting bioassay data > aids <- find.assay.id(quot;lung+cancer+carcinomaquot;) > sapply(aids, function(x) get.assay.desc(x)$assay.desc) > adata <- get.assay(175) Search for assays by keywords Get short descriptions Get the actual assay data
  • 27. Introduction R & CDK R & PubChem Conclusions Example Usage What do we get? Field Name Meaning PUBCHEM.SID Substance ID PUBCHEM.CID Compound ID PUBCHEM.ACTIVITY.OUTCOME Activity outcome, which can be one of active, inactive, inconclusive, un- specified PUBCHEM.ACTIVITY.SCORE PubChem activity score, higher is more active PUBCHEM.ASSAYDATA.COMMENT Test result specific comment A set of fixed fields Arbitrary number of assay specific fields Have to process a description file
  • 28. Introduction R & CDK R & PubChem Conclusions Example Usage What do we get? > names(adata) [1] quot;PUBCHEM.SIDquot; quot;PUBCHEM.CIDquot; [3] quot;PUBCHEM.ACTIVITY.OUTCOMEquot; quot;PUBCHEM.ACTIVITY.SCOREquot; [5] quot;PUBCHEM.ASSAYDATA.COMMENTquot; quot;concquot; [7] quot;giprctquot; quot;nbrtestquot; [9] quot;rangequot; > attr(adata, quot;commentsquot;) [1] quot;These data are a subset of the data from the NCI yeast anticancer drug screen. Compounds are identified by the NCI NSC number. In the Seattle Project numbering system, the mlh1 rad18 strain is SPY number 50858. Compounds with inhibition of growth(giprct) >= 70% were considered active. Activity score was based on increasing values of giprct.quot; > attr(adata, ’types’) $conc [1] quot;concentration tested, unit: micromolarquot; [2] quot;uMquot; $giprct [1] quot;Inhibition of growth compared to untreated controls (%)quot; [2] quot;NAquot; $nbrtest [1] quot;Number of tests averaged for this data pointquot; [2] quot;NAquot; $range [1] quot;Range of values for this data pointquot; quot;NAquot;
  • 29. Introduction R & CDK R & PubChem Conclusions Outline 1 Introduction 2 R & CDK 3 R & PubChem 4 Conclusions
  • 30. Introduction R & CDK R & PubChem Conclusions Future Work Improve the coverage of the CDK functionality Improve descriptor generation and naming Include support for more chemical file formats 2D Structure-data tables Avoid in-memory processing of large PubChem assay datasets
  • 31. Introduction R & CDK R & PubChem Conclusions Conclusions Integrating the CDK with R enhances the utility of the statistical environment for cheminformatics applications The integration is still at the programmer level Applications can be built up from the available functionality rpubchem provides easy access to large amounts of biological and chemical data
  • 32. Introduction R & CDK R & PubChem Conclusions Acknowledgements NIH CDK developers Simon Urbanek