SlideShare uma empresa Scribd logo
1 de 28
Baixar para ler offline
Open Source
                                              Cheminformatics

                                               Rajarshi Guha


                                              Open Source

                                              Open Standards
Open Source Cheminformatics                   Open Data

      Tools and Data

             Rajarshi Guha

  School of Informatics, Indiana University


             Bio IT World

            29th April, 2009
Open Source
Open Source Cheminformatics                                       Cheminformatics

                                                                   Rajarshi Guha


                                                                  Open Source
       Been around for some time, niche field
                                                                  Open Standards
       OSS snippets/code based on closed source API’s versus      Open Data
       fully open source tools

Why use OSS cheminformatics?
       Articulated nicely by Delano
       Reverse also articulated nicely by Stahl

Goal
       Not argue for or against Open Source
       Show what’s there, how it fits in with other technologies


Delano, W. L., Drug Discovery Today, 2005, 10, 213–217
Stahl, M. T., Drug Discovery Today, 2005, 10, 219–222
Open Source
Open Source Cheminformatics                                       Cheminformatics

                                                                   Rajarshi Guha


                                                                  Open Source
       Been around for some time, niche field
                                                                  Open Standards
       OSS snippets/code based on closed source API’s versus      Open Data
       fully open source tools

Why use OSS cheminformatics?
       Articulated nicely by Delano
       Reverse also articulated nicely by Stahl

Goal
       Not argue for or against Open Source
       Show what’s there, how it fits in with other technologies


Delano, W. L., Drug Discovery Today, 2005, 10, 213–217
Stahl, M. T., Drug Discovery Today, 2005, 10, 219–222
Open Source
Cheminformatics Software                                         Cheminformatics

                                                                  Rajarshi Guha


                                                                 Open Source

                                                                 Open Standards

                                                                 Open Data
    The ecosystem is composed of developer- and
    user-oriented software
    Most applications will depend on lower level functionality
    Choice of toolkit influences
        robustness
        performance
        ease of distribution
        integration with other libraries
    Won’t be talking about user-oriented software
Open Source
The Toolkit Ecosystem                                                                                                                                                                                                                                                Cheminformatics

                                                                                                                                                                                                                                                                      Rajarshi Guha


                                                                                                                                                                                                                                                                     Open Source
                                                           Timeline of cheminformatics toolkits*                                                                                                                                                                     Open Standards
                                                                                                                                                                                                *(runs on Unix and supports SMILES and SMARTS)


       1995 and earlier                   1996       1997            1998            1999            2000           2001                   2002             2003            2004          2005               2006               2007          2008
                                                                                                                                                                                                                                                                     Open Data
        Daylight
                       C and Fortran
                                                                                                                                                                                                                         Is a wrapper
                            DayPerl
                                                                                                                                                                                                                         Developer moved
                                          DaySWIG                                                                                                                                                                        between projects
                                                   Tcl, Python and more

                                                 PyDaylight               higher-level Python API
                                                                                                                            frowns
                                                                                                                                              Python; API based on PyDaylight

                                                                 (OBabel)                           OELib                                  OEChem                                              +Ogham &Lexichem
            Babel
                                                                                                           C++        +Python                  C++
                        (not a library)                                                                                                                                     +Python          +Java
                                                                                                                   (third-party package)




                                  Guidelines                                                            OpenBabel
                                                                                                                                                                                      +Python, Perl                             +Java, Ruby
OEChem and its sister libraries for molecular modeling are fast, flexible, powerful
                                                                                                                                                                                      Pybel
and complete (except for fingerprints). It is designed for high-end users who know
the nuances of cheminformatics. Expensive. My choice for C++, Java and Python.                                                                                                                        higher-level Python API

                                                                                                                                RDKit
CDK is the toolkit to use if you are on the JDK and OEChem is too pricey. It has a
strong structure and structural biology component, close ties with 2D and 3D                                                     C++/Python - internal library
                                                                                                                                                                                                                   Public release on Sourceforge
display programs, and integration with Bioclipse, Taverna, and Knime.
                                                                                                            Accessible from the C version of Python
RDKit is relatively new and with a small user community. The software                                                                                                                                                                 cinfony
                                                                                                            Accessible from the Java version of Python (Jython)
engineering skills are the best of the free projects. Includes 2D layout, 2D→3D,
                                                                                                                                                                                                                                                   abstraction API
QSAR, forcefield, shape and machine learning components. Worth a look!
                                                                                                                                  JOELib
OpenBabel is the most community driven. Its strength is file format conversion, for                                                               Java; API based on OELib
both small molecules and biomolecules. It is expanding towards more modeling
                                                                                                                   CDK
support, including several forcefield implementations. Often used as a test-bed for             Part of JChemDraw
new algorithms. Code quality is variable, reflecting the diverse contributor base.
                                                                                                                   Java
Do not use the Daylight toolkit for new code. It is expensive, there's very little new
development, and you can get nearly all of its functionality elsewhere.




Andrew Dalke’s EuroQSAR 2008 poster
Open Source
What’s available?               Cheminformatics

                                 Rajarshi Guha


                                Open Source

                                Open Standards

                                Open Data




    CDK (Java)
    Openbabel (C++)
    RDKit (C++)

    Licensing varies
    A large degree of overlap
Open Source
Toolkits - A Comparison                                                Cheminformatics

                                                                        Rajarshi Guha


                                                                       Open Source
       Feature                     CDK      OpenBabel      RDKit
                                                                       Open Standards
       License                    LGPL        GPL         new BSD
                                                                       Open Data
       Language                    Java       C++       C++ / Python
       SLOC                       188,554    194,358      173,219
       Fingerprints
                                                     
         Hashed
                                                     
         Substructure
                                                        
       File format support
                                                           
       Aromaticity models
                                                        
       Stereochemistry
                                                     
       Canonicalization
                                                       
       Descriptors
                                                       
       2D coordinate generation
                                                       
       3D coordinate generation
                                                       
       2D depictions
                                                           
       Conformer generation
                                                     
       Rigid alignment
                                                     
       SMARTS searching
                                                        
       Pharmacophore searching
Open Source
CDK Overview                                                    Cheminformatics

                                                                 Rajarshi Guha

  Category         functionality                                Open Source

  Input / Output   Support for various formats including SDF,   Open Standards

                   SMILES, CML, PDB, InChI, PubChem             Open Data

                   XML formats, Canonical SMILES support,
                   Pharmacophore serialization
  Visualization    2D coordinate generation and depiction
  Properties       Fingerprinting Gasteiger-Marsilli and
                   MMFF94 partial charges, Atom, bond and
                   molecular descriptors, NMR prediction via
                   HOSE codes, Aromaticity perception
  Graph            Isomorphism and Sub-graph isomorphism
                   detection, SMARTS support, Ring
                   perception, pharmacophore searching. A
                   variety of graph theoretical algorithms
                   (including traversal, shortest paths,
                   distance matrix)
Open Source
Data Visualization                                         Cheminformatics

                                                            Rajarshi Guha


                                                           Open Source

     Lots of OSS molecular visualization tools available   Open Standards

                                                           Open Data
     Needs to be combined with data analysis tools
     R is great for analytics, has powerful graphics
     Not cheminformatics aware, not user-friendly

Possibilities
     Rattle
     GGobi
     Processing - developer oriented, good for ad-hoc,
     multiple data type visualizations
     Bioclipse
Open Source
Data Visualization - Bioclipse   Cheminformatics

                                  Rajarshi Guha


                                 Open Source

                                 Open Standards

                                 Open Data
Open Source
Open Source Cheminformatics Workflows                         Cheminformatics

                                                              Rajarshi Guha

Requirements                                                 Open Source

                                                             Open Standards
    Core cheminformatics
                                                             Open Data
    Analytics
    Database backends
    Integration

Can it be done?
    Yes, in various ways
    For the non-expert user, pipeline tools provide a nice
    platform for integrating all the above
    For expert users, it’s useful to go lower level
    Integration between R and the CDK provides a
    cheminformatics enhanced modeling platform
Open Source
CDK and R                                                    Cheminformatics

                                                              Rajarshi Guha


                                                             Open Source

                                                             Open Standards

       R is oriented towards statistical modeling and        Open Data

       computations
       Cheminformatics agnostic
       rcdk integrates the CDK into the R environment
       Read and process molecular structure information
               Descriptors
               Fingerprints
               General molecule manipulation
       Provides access to CDK functionality in idiomatic R




http://cran.r-project.org/web/packages/rcdk/index.html
Open Source
Accessing Chemical Information from R                             Cheminformatics

                                                                   Rajarshi Guha


                                                                  Open Source

                                                                  Open Standards
       rcdk is good for processing and manipulating molecules     Open Data
       in R
       Also useful to be able to access chemical information
       directly from databases
       rpubchem provides access to PubChem compound,
       substance and bioassay collections
               By compound, substance, assay ID’s
               By keyword searches
               Packages assay information into a data.frame and
               includes associated metadata
       Supplements the rcdk package




http://cran.r-project.org/web/packages/rpubchem/index.html
Open Source
Standards for Cheminformatics?                    Cheminformatics

                                                   Rajarshi Guha


                                                  Open Source

                                                  Open Standards

                                                  Open Data


    Open standards/specifications help everybody
    Most refer to file formats
        CML, JCAMP-DX
        InChI, AniML
    Who sets them? How are they constructued?
    Are there usage restrictions?
Open Source
Standards for Cheminformatics                      Cheminformatics

                                                    Rajarshi Guha


                                                   Open Source

Open definition                                     Open Standards

                                                   Open Data
    Public participation in defining the standard
    Mailing lists, wiki’s for transparency
    Possibility of forking the standard
    FlexMol, OpenSmiles, JCAMP-DX

Open use
    No royalties for usage
    No patents, trademarks, copyrights etc
    SMILES, SDF, InChI, SLN
Open Source
Standards for Cheminformatics                                 Cheminformatics

                                                               Rajarshi Guha


                                                              Open Source

                                                              Open Standards
De facto standard                                             Open Data


    In wide use, few or no variants
    Data exchange is easy and reliable
    SDF, SMILES, PDB

Formal standard
    Endorsed by some sort of recognized group, academic, or
    government body
    InChI, OpenSMILES, JCAMP-DX
Open Source
The Blue Obelisk                                                   Cheminformatics

                                                                    Rajarshi Guha


                                                                   Open Source

                                                                   Open Standards

                                                                   Open Data



        Umbrella for a variety of OSS projects
        Covers code, data, standards
        Open to everybody
        OpenSMILES is a recent project aiming to provide
        explicit description of the SMILES grammar




http://blueobelisk.sourceforge.net/   http://www.opensmiles.org/
Open Source
The Pistoia Alliance                                         Cheminformatics

                                                              Rajarshi Guha


                                                             Open Source

                                                             Open Standards
. . . established to streamline non
                                                             Open Data
competitive elements of the pharmaceutical
drug discovery workflow by the specification
of common business terms, relationships and
processes . . .


        An opportunity for the Open Source cheminformatics
        community to link with industrial users
                ontology developments
                web service interfaces
                database schema




http://pistoiaalliance.sourceforge.net/
Open Source
The Distributed Future   Cheminformatics

                          Rajarshi Guha


                         Open Source

                         Open Standards

                         Open Data
Open Source
The Distributed Future                                    Cheminformatics

                                                           Rajarshi Guha


    Web services, cloud computing, . . .                  Open Source

                                                          Open Standards
    The OSS cheminformatics
                                                          Open Data
    ecosystem integrates with these
    scenarios very easily
    Cost and licenses are one aspect
    Redundancy is a big benefit
    Data / functionality mashups can lead to innovative
    solutions

Cheminformatics web services
    CDK based services (hosted at various places)
    Daylight web services
    NCI, Chemspider
Open Source
There’s Data in Them Thar Internets                     Cheminformatics

                                                         Rajarshi Guha


                                                        Open Source

                                                        Open Standards

                                                        Open Data

    Many significant public resources of chemical
    information
        PubChem
        ChemSpider
        NMRShiftDB
    Use anything to access them
    Does OSS have a role to play here?
    Open Access is likely more important in this case
Open Source
Data Access                                                 Cheminformatics

                                                             Rajarshi Guha


                                                            Open Source

                                                            Open Standards

                                                            Open Data
    Good to have access to data in open fashion
    What about adding value to the data?
    Could replicate databases
        Easier if the data source is built on a OSS stack
        Raw data dumps obviate this need
    But open, well defined API’s are preferable
        Avoiding hosting/update hassles
        Easier to mash multiple data sources
    Made easier when data sources support standards
Open Source
Benchmark Datasets                                                     Cheminformatics

                                                                        Rajarshi Guha


                                                                       Open Source

                                                                       Open Standards
        Benchmarking is vital                                          Open Data
        Some sub-fields have collections of benchmark datasets
                Docking (DUD)
                Virtual screening (MUV)
        No general datasets or attempts for benchmarking core
        cheminformatics operations
        Initial attempt at cheminfbenchmark on GitHub
                Restricted to Java libraries at this point (CDK, MX)
                Uses datasets taken from PubChem
                Fingerprinting, SD parsing, SMARTS parsing,
                substructure searching




Rohrer, S. G. et al., J. Chem. Inf. Model., 2009, 49, 169–184
Open Source
Open Source  Open Notebook Science                               Cheminformatics

                                                                   Rajarshi Guha


                                                                  Open Source

    ONS is a paradigm whereby some or all experimental            Open Standards

                                                                  Open Data
    results are published in an open form with little or no lag
    time
    Championed by Jean Claude Bradley, Cameron Neylon,
    Raf Aerts and others
    Closed source versus open source cheminformatics
    doesn’t necessarily hinder ONS practise
    But open source cheminformatics makes life easier
Open Source
ONS Solubility Challenge                                      Cheminformatics

                                                               Rajarshi Guha


                                                              Open Source

                                                              Open Standards

                                                              Open Data
    Led Jean-Claude Bradley (Drexel U.)
    Solubility measurements in various non-aqueous solvents
    Part of a larger project to identify anti-malarial
    compounds
    Very distributed
         Multiple groups generating and modeling data
         Data hosted on wiki’s and Google spreadsheets
         Multiple views, enhanced via cheminformatics web
         services
Open Source
ONS Solubility Challenge                                               Cheminformatics

                                                                        Rajarshi Guha


                                                                       Open Source
 Data Storage                   Data Storage                           Open Standards

                                                                       Open Data




                                               Data Views
          Data Generation




                                                              Web
                                                            Services

                Data Modeling
Open Source
What’s Holding OSS Cheminformatics Back?   Cheminformatics

                                            Rajarshi Guha


                                           Open Source

                                           Open Standards

                                           Open Data




    Niche field
    Comprehensiveness, polish
    Funding
Open Source
Conclusions                                             Cheminformatics

                                                         Rajarshi Guha


                                                        Open Source

                                                        Open Standards

                                                        Open Data


    The ecosystem is alive with activity
    Distributed systems are important - OSS
    cheminformatics fits in nicely
    OSS projects should coordinate with users
        industrial and academic
    Quality and effectiveness will be the final arbiter

Mais conteúdo relacionado

Destaque

2014 Culinary Forecast: Top 20 Trends from Natl Rest Assoc
2014 Culinary Forecast: Top 20 Trends from Natl Rest Assoc2014 Culinary Forecast: Top 20 Trends from Natl Rest Assoc
2014 Culinary Forecast: Top 20 Trends from Natl Rest AssocCambro Manufacturing
 
La atencion primaria en salud y la bioetica
La atencion primaria en salud y la bioeticaLa atencion primaria en salud y la bioetica
La atencion primaria en salud y la bioeticaSamanta Tapia
 
Aanikaaaaaalien
AanikaaaaaalienAanikaaaaaalien
Aanikaaaaaaliennainish
 
Textus; szövegek hálójában
Textus; szövegek hálójábanTextus; szövegek hálójában
Textus; szövegek hálójábanZoltan Varju
 
Dig Deep: Uncovering the hidden costs of health care
Dig Deep: Uncovering the hidden costs of health careDig Deep: Uncovering the hidden costs of health care
Dig Deep: Uncovering the hidden costs of health careDeloitte United States
 
Studentqgworkingoncampus 110209171107-phpapp02
Studentqgworkingoncampus 110209171107-phpapp02Studentqgworkingoncampus 110209171107-phpapp02
Studentqgworkingoncampus 110209171107-phpapp02emathews
 
Ce 150210107029 presentation
Ce 150210107029 presentationCe 150210107029 presentation
Ce 150210107029 presentationShyam Kanani
 
Telling stories with data
Telling stories with dataTelling stories with data
Telling stories with datamattsadler
 

Destaque (8)

2014 Culinary Forecast: Top 20 Trends from Natl Rest Assoc
2014 Culinary Forecast: Top 20 Trends from Natl Rest Assoc2014 Culinary Forecast: Top 20 Trends from Natl Rest Assoc
2014 Culinary Forecast: Top 20 Trends from Natl Rest Assoc
 
La atencion primaria en salud y la bioetica
La atencion primaria en salud y la bioeticaLa atencion primaria en salud y la bioetica
La atencion primaria en salud y la bioetica
 
Aanikaaaaaalien
AanikaaaaaalienAanikaaaaaalien
Aanikaaaaaalien
 
Textus; szövegek hálójában
Textus; szövegek hálójábanTextus; szövegek hálójában
Textus; szövegek hálójában
 
Dig Deep: Uncovering the hidden costs of health care
Dig Deep: Uncovering the hidden costs of health careDig Deep: Uncovering the hidden costs of health care
Dig Deep: Uncovering the hidden costs of health care
 
Studentqgworkingoncampus 110209171107-phpapp02
Studentqgworkingoncampus 110209171107-phpapp02Studentqgworkingoncampus 110209171107-phpapp02
Studentqgworkingoncampus 110209171107-phpapp02
 
Ce 150210107029 presentation
Ce 150210107029 presentationCe 150210107029 presentation
Ce 150210107029 presentation
 
Telling stories with data
Telling stories with dataTelling stories with data
Telling stories with data
 

Semelhante a Open Source Cheminformatics

2011-11-07 Open PHACTS Poster
2011-11-07 Open PHACTS Poster2011-11-07 Open PHACTS Poster
2011-11-07 Open PHACTS Posteropen_phacts
 
Intro to R for SAS and SPSS User Webinar
Intro to R for SAS and SPSS User WebinarIntro to R for SAS and SPSS User Webinar
Intro to R for SAS and SPSS User WebinarRevolution Analytics
 
Study of R Programming
Study of R ProgrammingStudy of R Programming
Study of R ProgrammingIRJET Journal
 
Open source analytics
Open source analyticsOpen source analytics
Open source analyticsAjay Ohri
 
Applying static code analysis for domain-specific languages
Applying static code analysis for domain-specific languagesApplying static code analysis for domain-specific languages
Applying static code analysis for domain-specific languagesIván Ruiz-Rube
 
Cheng_Wang_resume
Cheng_Wang_resumeCheng_Wang_resume
Cheng_Wang_resumeCheng Wang
 
ROLE Vision RWTH Aachen
ROLE Vision RWTH AachenROLE Vision RWTH Aachen
ROLE Vision RWTH AachenRalf Klamma
 
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Data Con LA
 
IRJET- Segmentation in Digital Signal Processing
IRJET-  	  Segmentation in Digital Signal ProcessingIRJET-  	  Segmentation in Digital Signal Processing
IRJET- Segmentation in Digital Signal ProcessingIRJET Journal
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
Performance Evaluation of Open Source Data Mining Tools
Performance Evaluation of Open Source Data Mining ToolsPerformance Evaluation of Open Source Data Mining Tools
Performance Evaluation of Open Source Data Mining Toolsijsrd.com
 
FAIR Computational Workflows
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational WorkflowsCarole Goble
 
OSNF - Open Sensor Network Framework
OSNF - Open Sensor Network FrameworkOSNF - Open Sensor Network Framework
OSNF - Open Sensor Network FrameworkAntonio Di Cello
 

Semelhante a Open Source Cheminformatics (20)

2011-11-07 Open PHACTS Poster
2011-11-07 Open PHACTS Poster2011-11-07 Open PHACTS Poster
2011-11-07 Open PHACTS Poster
 
Intro to R for SAS and SPSS User Webinar
Intro to R for SAS and SPSS User WebinarIntro to R for SAS and SPSS User Webinar
Intro to R for SAS and SPSS User Webinar
 
Study of R Programming
Study of R ProgrammingStudy of R Programming
Study of R Programming
 
useR 2014 jskim
useR 2014 jskimuseR 2014 jskim
useR 2014 jskim
 
Open source analytics
Open source analyticsOpen source analytics
Open source analytics
 
RAW 2012
RAW 2012RAW 2012
RAW 2012
 
Bonneau - Software and Systems - Spring Review 2012
Bonneau - Software and Systems - Spring Review 2012Bonneau - Software and Systems - Spring Review 2012
Bonneau - Software and Systems - Spring Review 2012
 
R_L1-Aug-2022.pptx
R_L1-Aug-2022.pptxR_L1-Aug-2022.pptx
R_L1-Aug-2022.pptx
 
Big data analytics using R
Big data analytics using RBig data analytics using R
Big data analytics using R
 
Applying static code analysis for domain-specific languages
Applying static code analysis for domain-specific languagesApplying static code analysis for domain-specific languages
Applying static code analysis for domain-specific languages
 
Cheng_Wang_resume
Cheng_Wang_resumeCheng_Wang_resume
Cheng_Wang_resume
 
ROLE Vision RWTH Aachen
ROLE Vision RWTH AachenROLE Vision RWTH Aachen
ROLE Vision RWTH Aachen
 
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
 
IRJET- Segmentation in Digital Signal Processing
IRJET-  	  Segmentation in Digital Signal ProcessingIRJET-  	  Segmentation in Digital Signal Processing
IRJET- Segmentation in Digital Signal Processing
 
Data mining weka
Data mining wekaData mining weka
Data mining weka
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Performance Evaluation of Open Source Data Mining Tools
Performance Evaluation of Open Source Data Mining ToolsPerformance Evaluation of Open Source Data Mining Tools
Performance Evaluation of Open Source Data Mining Tools
 
Roberto Santoro Apollon
Roberto Santoro ApollonRoberto Santoro Apollon
Roberto Santoro Apollon
 
FAIR Computational Workflows
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational Workflows
 
OSNF - Open Sensor Network Framework
OSNF - Open Sensor Network FrameworkOSNF - Open Sensor Network Framework
OSNF - Open Sensor Network Framework
 

Mais de Rajarshi Guha

Pharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark GenomePharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark GenomeRajarshi Guha
 
Pharos: Putting targets in context
Pharos: Putting targets in contextPharos: Putting targets in context
Pharos: Putting targets in contextRajarshi Guha
 
Pharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark GenomePharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark GenomeRajarshi Guha
 
Pharos - Face of the KMC
Pharos - Face of the KMCPharos - Face of the KMC
Pharos - Face of the KMCRajarshi Guha
 
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS PlatformEnhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS PlatformRajarshi Guha
 
What can your library do for you?
What can your library do for you?What can your library do for you?
What can your library do for you?Rajarshi Guha
 
So I have an SD File … What do I do next?
So I have an SD File … What do I do next?So I have an SD File … What do I do next?
So I have an SD File … What do I do next?Rajarshi Guha
 
Characterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network ModelsCharacterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network ModelsRajarshi Guha
 
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action: Bridging Chemistry and Biology with Informatics at NCATSFrom Data to Action: Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATSRajarshi Guha
 
Robots, Small Molecules & R
Robots, Small Molecules & RRobots, Small Molecules & R
Robots, Small Molecules & RRajarshi Guha
 
Fingerprinting Chemical Structures
Fingerprinting Chemical StructuresFingerprinting Chemical Structures
Fingerprinting Chemical StructuresRajarshi Guha
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...Rajarshi Guha
 
When the whole is better than the parts
When the whole is better than the partsWhen the whole is better than the parts
When the whole is better than the partsRajarshi Guha
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...Rajarshi Guha
 
Pushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the PipesPushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the PipesRajarshi Guha
 
Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...Rajarshi Guha
 
The BioAssay Research Database
The BioAssay Research DatabaseThe BioAssay Research Database
The BioAssay Research DatabaseRajarshi Guha
 
Cloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsCloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsRajarshi Guha
 
Chemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & ReproducibleChemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & ReproducibleRajarshi Guha
 
Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Rajarshi Guha
 

Mais de Rajarshi Guha (20)

Pharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark GenomePharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark Genome
 
Pharos: Putting targets in context
Pharos: Putting targets in contextPharos: Putting targets in context
Pharos: Putting targets in context
 
Pharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark GenomePharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark Genome
 
Pharos - Face of the KMC
Pharos - Face of the KMCPharos - Face of the KMC
Pharos - Face of the KMC
 
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS PlatformEnhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
 
What can your library do for you?
What can your library do for you?What can your library do for you?
What can your library do for you?
 
So I have an SD File … What do I do next?
So I have an SD File … What do I do next?So I have an SD File … What do I do next?
So I have an SD File … What do I do next?
 
Characterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network ModelsCharacterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network Models
 
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action: Bridging Chemistry and Biology with Informatics at NCATSFrom Data to Action: Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
 
Robots, Small Molecules & R
Robots, Small Molecules & RRobots, Small Molecules & R
Robots, Small Molecules & R
 
Fingerprinting Chemical Structures
Fingerprinting Chemical StructuresFingerprinting Chemical Structures
Fingerprinting Chemical Structures
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
 
When the whole is better than the parts
When the whole is better than the partsWhen the whole is better than the parts
When the whole is better than the parts
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
 
Pushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the PipesPushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the Pipes
 
Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...
 
The BioAssay Research Database
The BioAssay Research DatabaseThe BioAssay Research Database
The BioAssay Research Database
 
Cloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsCloudy with a Touch of Cheminformatics
Cloudy with a Touch of Cheminformatics
 
Chemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & ReproducibleChemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & Reproducible
 
Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?
 

Último

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Último (20)

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

Open Source Cheminformatics

  • 1. Open Source Cheminformatics Rajarshi Guha Open Source Open Standards Open Source Cheminformatics Open Data Tools and Data Rajarshi Guha School of Informatics, Indiana University Bio IT World 29th April, 2009
  • 2. Open Source Open Source Cheminformatics Cheminformatics Rajarshi Guha Open Source Been around for some time, niche field Open Standards OSS snippets/code based on closed source API’s versus Open Data fully open source tools Why use OSS cheminformatics? Articulated nicely by Delano Reverse also articulated nicely by Stahl Goal Not argue for or against Open Source Show what’s there, how it fits in with other technologies Delano, W. L., Drug Discovery Today, 2005, 10, 213–217 Stahl, M. T., Drug Discovery Today, 2005, 10, 219–222
  • 3. Open Source Open Source Cheminformatics Cheminformatics Rajarshi Guha Open Source Been around for some time, niche field Open Standards OSS snippets/code based on closed source API’s versus Open Data fully open source tools Why use OSS cheminformatics? Articulated nicely by Delano Reverse also articulated nicely by Stahl Goal Not argue for or against Open Source Show what’s there, how it fits in with other technologies Delano, W. L., Drug Discovery Today, 2005, 10, 213–217 Stahl, M. T., Drug Discovery Today, 2005, 10, 219–222
  • 4. Open Source Cheminformatics Software Cheminformatics Rajarshi Guha Open Source Open Standards Open Data The ecosystem is composed of developer- and user-oriented software Most applications will depend on lower level functionality Choice of toolkit influences robustness performance ease of distribution integration with other libraries Won’t be talking about user-oriented software
  • 5. Open Source The Toolkit Ecosystem Cheminformatics Rajarshi Guha Open Source Timeline of cheminformatics toolkits* Open Standards *(runs on Unix and supports SMILES and SMARTS) 1995 and earlier 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 Open Data Daylight C and Fortran Is a wrapper DayPerl Developer moved DaySWIG between projects Tcl, Python and more PyDaylight higher-level Python API frowns Python; API based on PyDaylight (OBabel) OELib OEChem +Ogham &Lexichem Babel C++ +Python C++ (not a library) +Python +Java (third-party package) Guidelines OpenBabel +Python, Perl +Java, Ruby OEChem and its sister libraries for molecular modeling are fast, flexible, powerful Pybel and complete (except for fingerprints). It is designed for high-end users who know the nuances of cheminformatics. Expensive. My choice for C++, Java and Python. higher-level Python API RDKit CDK is the toolkit to use if you are on the JDK and OEChem is too pricey. It has a strong structure and structural biology component, close ties with 2D and 3D C++/Python - internal library Public release on Sourceforge display programs, and integration with Bioclipse, Taverna, and Knime. Accessible from the C version of Python RDKit is relatively new and with a small user community. The software cinfony Accessible from the Java version of Python (Jython) engineering skills are the best of the free projects. Includes 2D layout, 2D→3D, abstraction API QSAR, forcefield, shape and machine learning components. Worth a look! JOELib OpenBabel is the most community driven. Its strength is file format conversion, for Java; API based on OELib both small molecules and biomolecules. It is expanding towards more modeling CDK support, including several forcefield implementations. Often used as a test-bed for Part of JChemDraw new algorithms. Code quality is variable, reflecting the diverse contributor base. Java Do not use the Daylight toolkit for new code. It is expensive, there's very little new development, and you can get nearly all of its functionality elsewhere. Andrew Dalke’s EuroQSAR 2008 poster
  • 6. Open Source What’s available? Cheminformatics Rajarshi Guha Open Source Open Standards Open Data CDK (Java) Openbabel (C++) RDKit (C++) Licensing varies A large degree of overlap
  • 7. Open Source Toolkits - A Comparison Cheminformatics Rajarshi Guha Open Source Feature CDK OpenBabel RDKit Open Standards License LGPL GPL new BSD Open Data Language Java C++ C++ / Python SLOC 188,554 194,358 173,219 Fingerprints Hashed Substructure File format support Aromaticity models Stereochemistry Canonicalization Descriptors 2D coordinate generation 3D coordinate generation 2D depictions Conformer generation Rigid alignment SMARTS searching Pharmacophore searching
  • 8. Open Source CDK Overview Cheminformatics Rajarshi Guha Category functionality Open Source Input / Output Support for various formats including SDF, Open Standards SMILES, CML, PDB, InChI, PubChem Open Data XML formats, Canonical SMILES support, Pharmacophore serialization Visualization 2D coordinate generation and depiction Properties Fingerprinting Gasteiger-Marsilli and MMFF94 partial charges, Atom, bond and molecular descriptors, NMR prediction via HOSE codes, Aromaticity perception Graph Isomorphism and Sub-graph isomorphism detection, SMARTS support, Ring perception, pharmacophore searching. A variety of graph theoretical algorithms (including traversal, shortest paths, distance matrix)
  • 9. Open Source Data Visualization Cheminformatics Rajarshi Guha Open Source Lots of OSS molecular visualization tools available Open Standards Open Data Needs to be combined with data analysis tools R is great for analytics, has powerful graphics Not cheminformatics aware, not user-friendly Possibilities Rattle GGobi Processing - developer oriented, good for ad-hoc, multiple data type visualizations Bioclipse
  • 10. Open Source Data Visualization - Bioclipse Cheminformatics Rajarshi Guha Open Source Open Standards Open Data
  • 11. Open Source Open Source Cheminformatics Workflows Cheminformatics Rajarshi Guha Requirements Open Source Open Standards Core cheminformatics Open Data Analytics Database backends Integration Can it be done? Yes, in various ways For the non-expert user, pipeline tools provide a nice platform for integrating all the above For expert users, it’s useful to go lower level Integration between R and the CDK provides a cheminformatics enhanced modeling platform
  • 12. Open Source CDK and R Cheminformatics Rajarshi Guha Open Source Open Standards R is oriented towards statistical modeling and Open Data computations Cheminformatics agnostic rcdk integrates the CDK into the R environment Read and process molecular structure information Descriptors Fingerprints General molecule manipulation Provides access to CDK functionality in idiomatic R http://cran.r-project.org/web/packages/rcdk/index.html
  • 13. Open Source Accessing Chemical Information from R Cheminformatics Rajarshi Guha Open Source Open Standards rcdk is good for processing and manipulating molecules Open Data in R Also useful to be able to access chemical information directly from databases rpubchem provides access to PubChem compound, substance and bioassay collections By compound, substance, assay ID’s By keyword searches Packages assay information into a data.frame and includes associated metadata Supplements the rcdk package http://cran.r-project.org/web/packages/rpubchem/index.html
  • 14. Open Source Standards for Cheminformatics? Cheminformatics Rajarshi Guha Open Source Open Standards Open Data Open standards/specifications help everybody Most refer to file formats CML, JCAMP-DX InChI, AniML Who sets them? How are they constructued? Are there usage restrictions?
  • 15. Open Source Standards for Cheminformatics Cheminformatics Rajarshi Guha Open Source Open definition Open Standards Open Data Public participation in defining the standard Mailing lists, wiki’s for transparency Possibility of forking the standard FlexMol, OpenSmiles, JCAMP-DX Open use No royalties for usage No patents, trademarks, copyrights etc SMILES, SDF, InChI, SLN
  • 16. Open Source Standards for Cheminformatics Cheminformatics Rajarshi Guha Open Source Open Standards De facto standard Open Data In wide use, few or no variants Data exchange is easy and reliable SDF, SMILES, PDB Formal standard Endorsed by some sort of recognized group, academic, or government body InChI, OpenSMILES, JCAMP-DX
  • 17. Open Source The Blue Obelisk Cheminformatics Rajarshi Guha Open Source Open Standards Open Data Umbrella for a variety of OSS projects Covers code, data, standards Open to everybody OpenSMILES is a recent project aiming to provide explicit description of the SMILES grammar http://blueobelisk.sourceforge.net/ http://www.opensmiles.org/
  • 18. Open Source The Pistoia Alliance Cheminformatics Rajarshi Guha Open Source Open Standards . . . established to streamline non Open Data competitive elements of the pharmaceutical drug discovery workflow by the specification of common business terms, relationships and processes . . . An opportunity for the Open Source cheminformatics community to link with industrial users ontology developments web service interfaces database schema http://pistoiaalliance.sourceforge.net/
  • 19. Open Source The Distributed Future Cheminformatics Rajarshi Guha Open Source Open Standards Open Data
  • 20. Open Source The Distributed Future Cheminformatics Rajarshi Guha Web services, cloud computing, . . . Open Source Open Standards The OSS cheminformatics Open Data ecosystem integrates with these scenarios very easily Cost and licenses are one aspect Redundancy is a big benefit Data / functionality mashups can lead to innovative solutions Cheminformatics web services CDK based services (hosted at various places) Daylight web services NCI, Chemspider
  • 21. Open Source There’s Data in Them Thar Internets Cheminformatics Rajarshi Guha Open Source Open Standards Open Data Many significant public resources of chemical information PubChem ChemSpider NMRShiftDB Use anything to access them Does OSS have a role to play here? Open Access is likely more important in this case
  • 22. Open Source Data Access Cheminformatics Rajarshi Guha Open Source Open Standards Open Data Good to have access to data in open fashion What about adding value to the data? Could replicate databases Easier if the data source is built on a OSS stack Raw data dumps obviate this need But open, well defined API’s are preferable Avoiding hosting/update hassles Easier to mash multiple data sources Made easier when data sources support standards
  • 23. Open Source Benchmark Datasets Cheminformatics Rajarshi Guha Open Source Open Standards Benchmarking is vital Open Data Some sub-fields have collections of benchmark datasets Docking (DUD) Virtual screening (MUV) No general datasets or attempts for benchmarking core cheminformatics operations Initial attempt at cheminfbenchmark on GitHub Restricted to Java libraries at this point (CDK, MX) Uses datasets taken from PubChem Fingerprinting, SD parsing, SMARTS parsing, substructure searching Rohrer, S. G. et al., J. Chem. Inf. Model., 2009, 49, 169–184
  • 24. Open Source Open Source Open Notebook Science Cheminformatics Rajarshi Guha Open Source ONS is a paradigm whereby some or all experimental Open Standards Open Data results are published in an open form with little or no lag time Championed by Jean Claude Bradley, Cameron Neylon, Raf Aerts and others Closed source versus open source cheminformatics doesn’t necessarily hinder ONS practise But open source cheminformatics makes life easier
  • 25. Open Source ONS Solubility Challenge Cheminformatics Rajarshi Guha Open Source Open Standards Open Data Led Jean-Claude Bradley (Drexel U.) Solubility measurements in various non-aqueous solvents Part of a larger project to identify anti-malarial compounds Very distributed Multiple groups generating and modeling data Data hosted on wiki’s and Google spreadsheets Multiple views, enhanced via cheminformatics web services
  • 26. Open Source ONS Solubility Challenge Cheminformatics Rajarshi Guha Open Source Data Storage Data Storage Open Standards Open Data Data Views Data Generation Web Services Data Modeling
  • 27. Open Source What’s Holding OSS Cheminformatics Back? Cheminformatics Rajarshi Guha Open Source Open Standards Open Data Niche field Comprehensiveness, polish Funding
  • 28. Open Source Conclusions Cheminformatics Rajarshi Guha Open Source Open Standards Open Data The ecosystem is alive with activity Distributed systems are important - OSS cheminformatics fits in nicely OSS projects should coordinate with users industrial and academic Quality and effectiveness will be the final arbiter