SlideShare uma empresa Scribd logo
1 de 47
Baixar para ler offline
Cheminformatics
Software Development:
     Case Studies
      Direct Observations and Informed Opinions




                                                                                     Jeremy J Yang
                                         PhD Student, IU Cheminformatics
                          Mgr, Systems & Programming, UNM Biocomputing

  Indiana University School of Informatics and Computing - I571, Intro to Cheminformatics - Fall 2011
My experience
1)  Daylight (1989 - 2002): Support Coordinator, Software
Engineer
  –  support, user education, software engineering,
  meetings, application science, web apps, QA,
  databases, methodology research, etc.
2)  OpenEye (2002 - 2007): Director/VP of Support, Sr.
Software Engineer
  –  support, management, software engineering, QA, web
  apps, methodology research, etc.
3)  UNM (2002 - ): Mgr., Systems & Programming
  –  software engineering, management, support,
  computational methodolgy, bioassay screening
  informatics, biomedical informatics research, etc.
Concept of this talk...
Describe direct observations from experiences in
cheminformatics over last 22 years, relevant today
to understand and navigate complex landscape of
cheminformatics software roles and choices.
Avoiding excessive idle reminiscing, include some
of the colorful personalities and curious events.
Suggest some lessons learned and trends
observed, in the opinion of the author.
Outline
Case studies included:
1)  Daylight
2)  OpenEye
3)  Symyx a.k.a. MDL           Some interesting
                               others also mentioned
4)  Accelrys                   briefly.
5)  OpenBabel

(Chosen mostly based on
     my familiarity.)
Perspectives on scientific software
•  Developer, programmer
•  Computational/informatics scientist, scholar
•  Support, educator, maintainer
•  Software as scientific publishing
•  Open source collaboration
•  Consumer: toolkit user (programmer)
•  Consumer: app user, scientist
•  Licensing, intellectual property, legal
•  Business (commercial or non-commercial)
Case Study #1

Daylight Chemical Information
        Systems, Inc.




         http://www.daylight.com
Daylight Chemical Information Systems, Inc.

l Founded 1987 by Dave Weininger, Art
   Weininger, Yosi Taitz, in Claremont, CA
l Ancestry: Pomona College MedChem program
   (Corwin Hansch, Al Leo)
l Innovations: SMILES, SMARTS, SMIRKS,
   fingerprints, rigorous syntax/grammar/semantics
l Products: ClogP, Thor, Merlin, C toolkits, Oracle
   Chemistry Cartridge
l Fortran → C ~1990, oop-ish API (Scofields).
l DEC-VAX/VMS → Unix → Linux, Windows
Daylight “MUG” meetings known for scientific
                 impact




               Daylight MUG ‘98, Santa Fe, NM
Daylight Chemical Information Systems, Inc.

l Success via appeal to examplars who became
   advocates at various pharmas.
l Published, supported APIs and open
   transparent system appealed to “hackers, nurds
   and geeks”.
l From research computing to enterprise (IT).
l Many other software developers learned and
   borrowed from Daylight.
l Focused on software, to enable science.
l Max ~12 employees, $5M annual revenue.
Some idle reminiscing...
              Daylight Krewe @ Zulu Ball, New Orleans 1992




                                 Dave Weininger

                                                            Art
Craig James                                              Weininger




                  Jeremy Yang
Daylight Chemical Information Systems, Inc.

l Scientific mission: To effectively store, retrieve,
   process all existing chemical information.
l This mission often guided what to do, and also
   what to not do.
l  Often that conflicted with profit motives.
l  Avoided: consulting, macromolecules, biology,
  Windows, investors, sales, marketing
If I have seen further it is only by standing on the shoulders of
giants. - Issac Newton

I can't see far because there are giants on my shoulders! - Dave
Weininger
Daylight Thor, Merlin database apps




                       ~ 1995
Daylight toolkit API (C, Perl)
Example research project using Daylight tools




    http://www.daylight.com/meetings/mug01/Yang/gadd/
Example research project using Daylight tools




                                          Slide 15
Case Study #2

OpenEye Scientific Software, Inc.




          http://www.eyesopen.com
OpenEye Scientific Software, Inc.

l Founded 1997 by Anthony Nicholls in Santa Fe,
   NM.
l Ancestry: Honig lab, biophysics, Columbia.
l Innovations: molecular shape and continuum
   dielectric Poisson-Boltzmann electrostatics via
   atom centered Gaussians, 3D speed, rigor,
   accuracy and validation, comprehensive APIs
l Products: Rocs, Fred, OEChem, Vida
l APIs: C++, Python, Java, many OS's
OpenEye apps




 Rocs shape overlay                Fred docking result




                               Szmap hydrophillic regions
Brood bioisosteric fragments
OpenEye toolkit API
OpenEye Scientific Software, Inc.


l Focused on software and science, sometimes a
   difficult balance.
l High-throughput 3D virtual screening (esp.
   Rocs) has become standard practice
   (enterprise-like).
l Spectrum of functionality between 2D
   cheminformatics and 3D high performance
   computing.
l Max 34 employees (current).
OpenEye Example Code (2D & 3D)


$ pdb2lig.py
Usage:
pdb2lig.py [options] [<infile] [>outfile]
    --i=<INFILE>
    --outlig=<OUTLIGFILE>
    --outpro=<OUTPROFILE> ... output protein or other macromol
    --multiligfiles ... one output file per ligand
    --minatoms=<N> ... minimum atomcount cutoff for ligand [7]
    --maxatoms=<N> ... maximum atomcount cutoff for ligand [100]
    --metal ... disconnected metal ions stay with protein
    --f      ... force processing of non-PDB file
    --v      ... verbose
    --vv     ... very verbose
    --h      ... help
ROCS web app (Python, source code available)
ROCS web app (Python, source code available)
FRED web app (Python, source code available)
FRED web app (Python, source code available)
Example research project using OpenEye tools
Example research project using OpenEye tools
Case Study #3

Symyx a.k.a. MDL




        http://www.symyx.com
(since 2010, http://www.accelrys.com)
Symyx a.k.a. MDL
l  Founded 1978 by Stuart Marson and Todd
  Wipke (ancestry: Harvard, Princeton, Stanford).
l  “Molecular Design Ltd”, reflected initial ab initio
  design goals, renamed “MDL Information
  Systems” ('93)
l  Products: REACCS ('82), MACCS ('84),
  MACCS-3D ('88), ISIS ('91), Isentris ('04), chem
  +bio database systems
l  Invented molfile/SD format (proprietary till
  1991), and extensions (query, R-group).
l  Based Chime on Rasmol source code.
l  Purchased by Reed Elsevier ('97), then Symyx
  ('07), then Accelrys ('10).
Symyx a.k.a. MDL

l Max ~300 employees. (My guess.)
l By 1990, all pharma research companies
   used MDL software for their compound
   database.
l Then the MBAs, lawyers and admen took over!
l And innovation ceased.




                                             Slide 30
Case Study #4

   Accelrys




  http://www.accelrys.com
Accelrys   A tale of mergers and acquisitions (~1990 - 2011):
              –  MSI (Molecular Simulations)
                 • Biodesign
                 • Cambridge Molecular Design
                 • Polygen
                 • Biocad
                 • Biosym Technologies
              –  Synopsys Scientific Systems
              –  Oxford Molecular
              –  Genetics Computer Group
              –  Synomyx
              –  SciTegic
              –  Symyx
              –  Contur Software AB
Accelrys
               SELECT
                  hts_ap_archive.runset_number as "RUN",
                  hts_ap_archive.ap_alias as "APlateName",
                  hts_plate.plate_id,
                  hts_plate.alternate_id as "IPlateName",
        example   hts_well.well_no,
                  hts_sample.alternate_id,

             of   hts_result_detail.value_char AS Target,
                  hts_result_type.type_desc AS Result_Type,
                  hts_assay_result.concentration || hts_conc_unit.unit_value AS CONC,
            AEI   hts_assay_result.dilution,
                  hts_assay_result.result_value AS Value

           SQL:   FROM hts_well,
                       hts_plate,
                       hts_sample,
                       hts_conc_unit,
                       ddi_container_master,
corporate              hts_ap_archive,
                       hts_assay_result,

merging                hts_result_type,
                       hts_result_detail
                  WHERE hts_ap_archive.ap_alias='213_20110712_135454-1'
reflected in      AND hts_assay_result.sample_id=hts_well.sample_id
                  AND hts_assay_result.sample_id=hts_sample.sample_id
schema,           AND hts_well.plate_id=hts_plate.plate_id
                  AND hts_assay_result.plate_id=hts_ap_archive.ap_number

technology        AND hts_plate.alternate_id=ddi_container_master.container_name
                  AND hts_well.plate_id=hts_plate.plate_id
                  AND hts_well.sample_id=hts_sample.sample_id
                  AND hts_assay_result.sample_plate=ddi_container_master.container_id
                  AND hts_result_type.result_type=hts_assay_result.result_type
                  AND hts_assay_result.result_id=hts_result_detail.result_id
                  AND hts_assay_result.conc_unit=hts_conc_unit.unit_id
                  ORDER BY hts_well.well_no,Target,Result_Type
Accelrys
        Merging reflected in technology
l  Chemical cartridges: AEI (v6 & v7), Symyx,
new SciTegic cartridge
l  Cheminformatics? E.g., canonicalize a
smiles w/ Accelrys, Scitegic or MDL code.
l  The Accord Enterprise Informatics (AEI
6.2) UNM purchased in 2009 is now a
“legacy product”.
l  Re-branding and re-packaging
l  May be great technically, but challenging
for customers and Accelrys alike.
Accelrys

key product:
Pipeline Pilot

   (Scitegic
founded 1999,
  acquired by
Accelrys 2004
 for $21.5M.)
Opinionated Observation :

There is a sometimes subtle difference
between (1) a software product with a support
contract, and (2) a service contract which
involves customizing and installing and
configuring a software product, requiring or
likely to require ongoing, additional service
contracts. Accelrys and Symyx (“solutions
providers”) have generated much of their
revenue using the latter. In contrast: tools
providers.
Accelrys
    (My opinions -- feel free to disagree.)

1)  M&A's reflected US & global business
trends, but perpetual re-org is stressful and
there are huge costs to employees, customers,
and technical progress.
2)  No coherent scientific mission, only
business growth.
3)  Lots of excellent software, science and
people. But all challenged by the chaos.
         "If in your science you only look for business, then you risk
         finding neither knowledge nor business."

         — Haldor Topsøe
Case Study #5:

 OpenBabel
OpenBabel

l Open-source C++ community project based on
   OELib by Matt Stahl, OpenEye.
l  Founded 2001, by Geoff Hutchinson et al. Now
  ~100 credited contributors.
l  C++ w/ wrappers via SWIG: Python, Perl, Java,
  Ruby, C#.
l  2011 paper in Journal of Cheminformatics*
l >160,000 downloads, >400 citations, used by
   >40 projects

*”Open Babel: An open chemical toolbox”, Noel M. O'Boyle, Michael Banck, Craig A. James, Chris Morley, Tim
Vandermeersch and Geoffrey R. Hutchison, Journal of Cheminformatics 2011, 3:33doi:10.1186/1758-2946-3-33.
$ obscreen

OpenBabel   obscreen - screen molecules based on calculated properties

            OpenBabel v2.2.3 (Dec 4 2010)
Example     syntax: obscreen [options]
              options:
Code           -i <infile>
               -o <outfile>
               -otable <output table>
               -badsmarts <badfile> ... bad smarts file [default:builti
               -minmwt <MINWT> ... minimum molweight [200]
               -maxmwt <MAXWT> ... maximum molweight [600]
               -minhbd <MINHBD> ... min Hbond donors [0]
               -maxhbd <MAXHBD> ... max Hbond donors [5]
               -minhba <MINHBA> ... min Hbond acceptors [0]
               -maxhba <MAXHBA> ... max Hbond acceptors [10]
               -minrot <MINROT> ... min rotors [0]
               -maxrot <MAXROT> ... max rotors [12]
               -minchiral <MINCHIRAL> ... min chiral atoms [0]
               -maxchiral <MAXCHIRAL> ... max chiral atoms [5]
               -minlogp <MINLOGP> ... min logP [-5.0]
               -maxlogp <MAXLOGP> ... max logP [5.0]
               -hbasmarts <HBASMARTS> ... default='[#6,#7;R0]=[#8]'
               -hbdsmarts <HBDSMARTS> ... default='[!H0;#7,#8,#9]'
               -n_inmax <N>         ... input limit
               -n_outmax <N>      ... output limit
               -v             ... verbose
               -vv            ... very verbose
OpenBabel: active discussion lists




indicators of community
         health
OpenBabel
       Used by eMolecules, quite successfully




Craig James, CTO
formerly of Daylight,
Accelrys
OpenBabel
      (My opinions -- feel free to disagree.)

l OB is very good but not yet close to Daylight,
   OEChem or JChem in comprehensiveness,
   quality, and other measures.
l  SMARTS accuracy not state of the art.
l  OB continually improving thanks to capable and
  active community of developers and users.
l  The value reflects the total quality developer
  years invested. So there is no reason OB
  cannot catch up, if the community continues to
  grow.
Some interesting and successful others:
1)  ChemAxon - Budapest, Marvin, Java, JChem
2)  Tripos (now Certara) - St. Louis, Sybyl, SGI
3)  Chem Comp Group - Montreal, MOE, SVL
4)  PyMol - was Delano, now Schrödinger (OSS)
5)  Schrodinger - quantum, etc.
6)  UCSF School of Pharmacy - DOCK, Kuntz & al.
7)  NIH NCTT - OSS, Java, Tripod, bioassay analysis
8)  Scitouch - Russia, Indigo, Dingo, Bingo
9)  CDK - FOSS, Java
10)  RDKit - FOSS, incl. machine learning
11)  Silicos NV - OSS & commercial
12)  Bioreason, VC-funded, bought by Simulations Plus
13)  Mesa Analytics & Computing
14)  NextMove Software
15)  UNM Division of Biocomputing
16)  IU SOIC CCRG
Some General Conclusions
l  Cheminformatics software landscape is very complex
l  Economics & Freakonomics are drivers but also
personalities
l  FOSS can compete with $$$ software.
l  Landscape includes lots of hidden costs.
l  “Software engineering” has been an ideal, far from
reality overall.
l  Software can enable or hinder science
l  Diverse business models, diverse everything
l  Plenty of technological change
l  Some human change too
The End


Feel free to contact me directly with questions or
ideas!


Jeremy J Yang
jejyang@indiana.edu

Mais conteúdo relacionado

Semelhante a Cheminformatics Software Development: Case Studies

Innovation at the Edge_Final
Innovation at the Edge_FinalInnovation at the Edge_Final
Innovation at the Edge_Final
Chris Waller
 
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
University of California, San Diego
 
IT Cluster Skolkovo Presentation at FRUCT.org conference
IT Cluster Skolkovo Presentation at FRUCT.org conferenceIT Cluster Skolkovo Presentation at FRUCT.org conference
IT Cluster Skolkovo Presentation at FRUCT.org conference
Albert Yefimov
 
564 Class Notes July 27, 2010
564 Class Notes July 27, 2010564 Class Notes July 27, 2010
564 Class Notes July 27, 2010
Stephanie Magleby
 

Semelhante a Cheminformatics Software Development: Case Studies (20)

LUISS - Deep Learning and data analyses - 09/01/19
LUISS - Deep Learning and data analyses - 09/01/19LUISS - Deep Learning and data analyses - 09/01/19
LUISS - Deep Learning and data analyses - 09/01/19
 
Practical Chaos Engineering
Practical Chaos EngineeringPractical Chaos Engineering
Practical Chaos Engineering
 
Data mining weka
Data mining wekaData mining weka
Data mining weka
 
Software Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceSoftware Sustainability: Better Software Better Science
Software Sustainability: Better Software Better Science
 
Innovation at the Edge_Final
Innovation at the Edge_FinalInnovation at the Edge_Final
Innovation at the Edge_Final
 
Pistoia Alliance US Conference 2015 - 1.1.2 Innovation in Pharma - Chris Waller
Pistoia Alliance US Conference 2015 - 1.1.2 Innovation in Pharma - Chris WallerPistoia Alliance US Conference 2015 - 1.1.2 Innovation in Pharma - Chris Waller
Pistoia Alliance US Conference 2015 - 1.1.2 Innovation in Pharma - Chris Waller
 
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
 
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
 
IT Cluster Skolkovo Presentation at FRUCT.org conference
IT Cluster Skolkovo Presentation at FRUCT.org conferenceIT Cluster Skolkovo Presentation at FRUCT.org conference
IT Cluster Skolkovo Presentation at FRUCT.org conference
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
 
Harnessing DDS in Next Generation Healthcare Systems
Harnessing DDS in Next Generation Healthcare SystemsHarnessing DDS in Next Generation Healthcare Systems
Harnessing DDS in Next Generation Healthcare Systems
 
Resilience Engineering: A field of study, a community, and some perspective s...
Resilience Engineering: A field of study, a community, and some perspective s...Resilience Engineering: A field of study, a community, and some perspective s...
Resilience Engineering: A field of study, a community, and some perspective s...
 
Tag.bio aws public jun 08 2021
Tag.bio aws public jun 08 2021 Tag.bio aws public jun 08 2021
Tag.bio aws public jun 08 2021
 
564 Class Notes July 27, 2010
564 Class Notes July 27, 2010564 Class Notes July 27, 2010
564 Class Notes July 27, 2010
 
NAME's Appendix - I
NAME's Appendix - INAME's Appendix - I
NAME's Appendix - I
 
MyMediaLite
MyMediaLiteMyMediaLite
MyMediaLite
 
Frankfurt Big Data Lab & Refugee Projeect
Frankfurt Big Data Lab & Refugee ProjeectFrankfurt Big Data Lab & Refugee Projeect
Frankfurt Big Data Lab & Refugee Projeect
 
Software Analytics: Towards Software Mining that Matters (2014)
Software Analytics:Towards Software Mining that Matters (2014)Software Analytics:Towards Software Mining that Matters (2014)
Software Analytics: Towards Software Mining that Matters (2014)
 
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
 
Introduction
IntroductionIntroduction
Introduction
 

Mais de Jeremy Yang

Promiscuous patterns and perils in PubChem and the MLSCN
Promiscuous patterns and perils in PubChem and the MLSCNPromiscuous patterns and perils in PubChem and the MLSCN
Promiscuous patterns and perils in PubChem and the MLSCN
Jeremy Yang
 

Mais de Jeremy Yang (20)

TIGA: Target Illumination GWAS Analytics
TIGA: Target Illumination GWAS AnalyticsTIGA: Target Illumination GWAS Analytics
TIGA: Target Illumination GWAS Analytics
 
DrugCentralDb and BioClients: Dockerized PostgreSql with Python API-tizer
DrugCentralDb and BioClients: Dockerized PostgreSql with Python API-tizerDrugCentralDb and BioClients: Dockerized PostgreSql with Python API-tizer
DrugCentralDb and BioClients: Dockerized PostgreSql with Python API-tizer
 
Mining ClinicalTrials.gov via CTTI AACT for drug target hypotheses
Mining ClinicalTrials.gov via CTTI AACT for drug target hypothesesMining ClinicalTrials.gov via CTTI AACT for drug target hypotheses
Mining ClinicalTrials.gov via CTTI AACT for drug target hypotheses
 
TIN-X v2: modernized architecture with REST API
TIN-X v2: modernized architecture with REST APITIN-X v2: modernized architecture with REST API
TIN-X v2: modernized architecture with REST API
 
Ex-files: Sex-Specific Gene Expression Profiles Explorer
Ex-files: Sex-Specific Gene Expression Profiles ExplorerEx-files: Sex-Specific Gene Expression Profiles Explorer
Ex-files: Sex-Specific Gene Expression Profiles Explorer
 
Illuminating the Druggable Genome with Knowledge Engineering and Machine Lear...
Illuminating the Druggable Genome with Knowledge Engineering and Machine Lear...Illuminating the Druggable Genome with Knowledge Engineering and Machine Lear...
Illuminating the Druggable Genome with Knowledge Engineering and Machine Lear...
 
Open Phenotypic Drug Discovery Resource poster
Open Phenotypic Drug Discovery Resource posterOpen Phenotypic Drug Discovery Resource poster
Open Phenotypic Drug Discovery Resource poster
 
Badapple: promiscuity patterns from noisy evidence (poster)
Badapple: promiscuity patterns from noisy evidence (poster)Badapple: promiscuity patterns from noisy evidence (poster)
Badapple: promiscuity patterns from noisy evidence (poster)
 
Bibliological data science and drug discovery
Bibliological data science and drug discoveryBibliological data science and drug discovery
Bibliological data science and drug discovery
 
BioMISS: Language Diversity of Computing
BioMISS: Language Diversity of ComputingBioMISS: Language Diversity of Computing
BioMISS: Language Diversity of Computing
 
The Language Diversity of Computing
The Language Diversity of ComputingThe Language Diversity of Computing
The Language Diversity of Computing
 
RMSD: routine measure stirs doubts
RMSD: routine measure stirs doubtsRMSD: routine measure stirs doubts
RMSD: routine measure stirs doubts
 
Canonicalized systematic nomenclature in cheminformatics
Canonicalized systematic nomenclature in cheminformaticsCanonicalized systematic nomenclature in cheminformatics
Canonicalized systematic nomenclature in cheminformatics
 
Molecular scaffolds poster
Molecular scaffolds posterMolecular scaffolds poster
Molecular scaffolds poster
 
Molecular scaffolds are special and useful guides to discovery
Molecular scaffolds are special and useful guides to discoveryMolecular scaffolds are special and useful guides to discovery
Molecular scaffolds are special and useful guides to discovery
 
The BADAPPLE promiscuity plugin for BARD
The BADAPPLE promiscuity plugin for BARDThe BADAPPLE promiscuity plugin for BARD
The BADAPPLE promiscuity plugin for BARD
 
How am I supposed to organize a protein database when I can't even organize m...
How am I supposed to organize a protein database when I can't even organize m...How am I supposed to organize a protein database when I can't even organize m...
How am I supposed to organize a protein database when I can't even organize m...
 
UNM Division of Biocomputing public web applications
UNM Division of Biocomputing public web applicationsUNM Division of Biocomputing public web applications
UNM Division of Biocomputing public web applications
 
Cyberinfrastructure Day 2010: Applications in Biocomputing
Cyberinfrastructure Day 2010: Applications in BiocomputingCyberinfrastructure Day 2010: Applications in Biocomputing
Cyberinfrastructure Day 2010: Applications in Biocomputing
 
Promiscuous patterns and perils in PubChem and the MLSCN
Promiscuous patterns and perils in PubChem and the MLSCNPromiscuous patterns and perils in PubChem and the MLSCN
Promiscuous patterns and perils in PubChem and the MLSCN
 

Cheminformatics Software Development: Case Studies

  • 1. Cheminformatics Software Development: Case Studies Direct Observations and Informed Opinions Jeremy J Yang PhD Student, IU Cheminformatics Mgr, Systems & Programming, UNM Biocomputing Indiana University School of Informatics and Computing - I571, Intro to Cheminformatics - Fall 2011
  • 2. My experience 1)  Daylight (1989 - 2002): Support Coordinator, Software Engineer –  support, user education, software engineering, meetings, application science, web apps, QA, databases, methodology research, etc. 2)  OpenEye (2002 - 2007): Director/VP of Support, Sr. Software Engineer –  support, management, software engineering, QA, web apps, methodology research, etc. 3)  UNM (2002 - ): Mgr., Systems & Programming –  software engineering, management, support, computational methodolgy, bioassay screening informatics, biomedical informatics research, etc.
  • 3. Concept of this talk... Describe direct observations from experiences in cheminformatics over last 22 years, relevant today to understand and navigate complex landscape of cheminformatics software roles and choices. Avoiding excessive idle reminiscing, include some of the colorful personalities and curious events. Suggest some lessons learned and trends observed, in the opinion of the author.
  • 4. Outline Case studies included: 1)  Daylight 2)  OpenEye 3)  Symyx a.k.a. MDL Some interesting others also mentioned 4)  Accelrys briefly. 5)  OpenBabel (Chosen mostly based on my familiarity.)
  • 5. Perspectives on scientific software •  Developer, programmer •  Computational/informatics scientist, scholar •  Support, educator, maintainer •  Software as scientific publishing •  Open source collaboration •  Consumer: toolkit user (programmer) •  Consumer: app user, scientist •  Licensing, intellectual property, legal •  Business (commercial or non-commercial)
  • 6. Case Study #1 Daylight Chemical Information Systems, Inc. http://www.daylight.com
  • 7. Daylight Chemical Information Systems, Inc. l Founded 1987 by Dave Weininger, Art Weininger, Yosi Taitz, in Claremont, CA l Ancestry: Pomona College MedChem program (Corwin Hansch, Al Leo) l Innovations: SMILES, SMARTS, SMIRKS, fingerprints, rigorous syntax/grammar/semantics l Products: ClogP, Thor, Merlin, C toolkits, Oracle Chemistry Cartridge l Fortran → C ~1990, oop-ish API (Scofields). l DEC-VAX/VMS → Unix → Linux, Windows
  • 8. Daylight “MUG” meetings known for scientific impact Daylight MUG ‘98, Santa Fe, NM
  • 9. Daylight Chemical Information Systems, Inc. l Success via appeal to examplars who became advocates at various pharmas. l Published, supported APIs and open transparent system appealed to “hackers, nurds and geeks”. l From research computing to enterprise (IT). l Many other software developers learned and borrowed from Daylight. l Focused on software, to enable science. l Max ~12 employees, $5M annual revenue.
  • 10. Some idle reminiscing... Daylight Krewe @ Zulu Ball, New Orleans 1992 Dave Weininger Art Craig James Weininger Jeremy Yang
  • 11. Daylight Chemical Information Systems, Inc. l Scientific mission: To effectively store, retrieve, process all existing chemical information. l This mission often guided what to do, and also what to not do. l  Often that conflicted with profit motives. l  Avoided: consulting, macromolecules, biology, Windows, investors, sales, marketing If I have seen further it is only by standing on the shoulders of giants. - Issac Newton I can't see far because there are giants on my shoulders! - Dave Weininger
  • 12. Daylight Thor, Merlin database apps ~ 1995
  • 13. Daylight toolkit API (C, Perl)
  • 14. Example research project using Daylight tools http://www.daylight.com/meetings/mug01/Yang/gadd/
  • 15. Example research project using Daylight tools Slide 15
  • 16. Case Study #2 OpenEye Scientific Software, Inc. http://www.eyesopen.com
  • 17. OpenEye Scientific Software, Inc. l Founded 1997 by Anthony Nicholls in Santa Fe, NM. l Ancestry: Honig lab, biophysics, Columbia. l Innovations: molecular shape and continuum dielectric Poisson-Boltzmann electrostatics via atom centered Gaussians, 3D speed, rigor, accuracy and validation, comprehensive APIs l Products: Rocs, Fred, OEChem, Vida l APIs: C++, Python, Java, many OS's
  • 18. OpenEye apps Rocs shape overlay Fred docking result Szmap hydrophillic regions Brood bioisosteric fragments
  • 20. OpenEye Scientific Software, Inc. l Focused on software and science, sometimes a difficult balance. l High-throughput 3D virtual screening (esp. Rocs) has become standard practice (enterprise-like). l Spectrum of functionality between 2D cheminformatics and 3D high performance computing. l Max 34 employees (current).
  • 21. OpenEye Example Code (2D & 3D) $ pdb2lig.py Usage: pdb2lig.py [options] [<infile] [>outfile] --i=<INFILE> --outlig=<OUTLIGFILE> --outpro=<OUTPROFILE> ... output protein or other macromol --multiligfiles ... one output file per ligand --minatoms=<N> ... minimum atomcount cutoff for ligand [7] --maxatoms=<N> ... maximum atomcount cutoff for ligand [100] --metal ... disconnected metal ions stay with protein --f ... force processing of non-PDB file --v ... verbose --vv ... very verbose --h ... help
  • 22. ROCS web app (Python, source code available)
  • 23. ROCS web app (Python, source code available)
  • 24. FRED web app (Python, source code available)
  • 25. FRED web app (Python, source code available)
  • 26. Example research project using OpenEye tools
  • 27. Example research project using OpenEye tools
  • 28. Case Study #3 Symyx a.k.a. MDL http://www.symyx.com (since 2010, http://www.accelrys.com)
  • 29. Symyx a.k.a. MDL l  Founded 1978 by Stuart Marson and Todd Wipke (ancestry: Harvard, Princeton, Stanford). l  “Molecular Design Ltd”, reflected initial ab initio design goals, renamed “MDL Information Systems” ('93) l  Products: REACCS ('82), MACCS ('84), MACCS-3D ('88), ISIS ('91), Isentris ('04), chem +bio database systems l  Invented molfile/SD format (proprietary till 1991), and extensions (query, R-group). l  Based Chime on Rasmol source code. l  Purchased by Reed Elsevier ('97), then Symyx ('07), then Accelrys ('10).
  • 30. Symyx a.k.a. MDL l Max ~300 employees. (My guess.) l By 1990, all pharma research companies used MDL software for their compound database. l Then the MBAs, lawyers and admen took over! l And innovation ceased. Slide 30
  • 31. Case Study #4 Accelrys http://www.accelrys.com
  • 32. Accelrys A tale of mergers and acquisitions (~1990 - 2011): –  MSI (Molecular Simulations) • Biodesign • Cambridge Molecular Design • Polygen • Biocad • Biosym Technologies –  Synopsys Scientific Systems –  Oxford Molecular –  Genetics Computer Group –  Synomyx –  SciTegic –  Symyx –  Contur Software AB
  • 33. Accelrys SELECT hts_ap_archive.runset_number as "RUN", hts_ap_archive.ap_alias as "APlateName", hts_plate.plate_id, hts_plate.alternate_id as "IPlateName", example hts_well.well_no, hts_sample.alternate_id, of hts_result_detail.value_char AS Target, hts_result_type.type_desc AS Result_Type, hts_assay_result.concentration || hts_conc_unit.unit_value AS CONC, AEI hts_assay_result.dilution, hts_assay_result.result_value AS Value SQL: FROM hts_well, hts_plate, hts_sample, hts_conc_unit, ddi_container_master, corporate hts_ap_archive, hts_assay_result, merging hts_result_type, hts_result_detail WHERE hts_ap_archive.ap_alias='213_20110712_135454-1' reflected in AND hts_assay_result.sample_id=hts_well.sample_id AND hts_assay_result.sample_id=hts_sample.sample_id schema, AND hts_well.plate_id=hts_plate.plate_id AND hts_assay_result.plate_id=hts_ap_archive.ap_number technology AND hts_plate.alternate_id=ddi_container_master.container_name AND hts_well.plate_id=hts_plate.plate_id AND hts_well.sample_id=hts_sample.sample_id AND hts_assay_result.sample_plate=ddi_container_master.container_id AND hts_result_type.result_type=hts_assay_result.result_type AND hts_assay_result.result_id=hts_result_detail.result_id AND hts_assay_result.conc_unit=hts_conc_unit.unit_id ORDER BY hts_well.well_no,Target,Result_Type
  • 34. Accelrys Merging reflected in technology l  Chemical cartridges: AEI (v6 & v7), Symyx, new SciTegic cartridge l  Cheminformatics? E.g., canonicalize a smiles w/ Accelrys, Scitegic or MDL code. l  The Accord Enterprise Informatics (AEI 6.2) UNM purchased in 2009 is now a “legacy product”. l  Re-branding and re-packaging l  May be great technically, but challenging for customers and Accelrys alike.
  • 35. Accelrys key product: Pipeline Pilot (Scitegic founded 1999, acquired by Accelrys 2004 for $21.5M.)
  • 36. Opinionated Observation : There is a sometimes subtle difference between (1) a software product with a support contract, and (2) a service contract which involves customizing and installing and configuring a software product, requiring or likely to require ongoing, additional service contracts. Accelrys and Symyx (“solutions providers”) have generated much of their revenue using the latter. In contrast: tools providers.
  • 37.
  • 38. Accelrys (My opinions -- feel free to disagree.) 1)  M&A's reflected US & global business trends, but perpetual re-org is stressful and there are huge costs to employees, customers, and technical progress. 2)  No coherent scientific mission, only business growth. 3)  Lots of excellent software, science and people. But all challenged by the chaos. "If in your science you only look for business, then you risk finding neither knowledge nor business." — Haldor Topsøe
  • 39. Case Study #5: OpenBabel
  • 40. OpenBabel l Open-source C++ community project based on OELib by Matt Stahl, OpenEye. l  Founded 2001, by Geoff Hutchinson et al. Now ~100 credited contributors. l  C++ w/ wrappers via SWIG: Python, Perl, Java, Ruby, C#. l  2011 paper in Journal of Cheminformatics* l >160,000 downloads, >400 citations, used by >40 projects *”Open Babel: An open chemical toolbox”, Noel M. O'Boyle, Michael Banck, Craig A. James, Chris Morley, Tim Vandermeersch and Geoffrey R. Hutchison, Journal of Cheminformatics 2011, 3:33doi:10.1186/1758-2946-3-33.
  • 41. $ obscreen OpenBabel obscreen - screen molecules based on calculated properties OpenBabel v2.2.3 (Dec 4 2010) Example syntax: obscreen [options] options: Code -i <infile> -o <outfile> -otable <output table> -badsmarts <badfile> ... bad smarts file [default:builti -minmwt <MINWT> ... minimum molweight [200] -maxmwt <MAXWT> ... maximum molweight [600] -minhbd <MINHBD> ... min Hbond donors [0] -maxhbd <MAXHBD> ... max Hbond donors [5] -minhba <MINHBA> ... min Hbond acceptors [0] -maxhba <MAXHBA> ... max Hbond acceptors [10] -minrot <MINROT> ... min rotors [0] -maxrot <MAXROT> ... max rotors [12] -minchiral <MINCHIRAL> ... min chiral atoms [0] -maxchiral <MAXCHIRAL> ... max chiral atoms [5] -minlogp <MINLOGP> ... min logP [-5.0] -maxlogp <MAXLOGP> ... max logP [5.0] -hbasmarts <HBASMARTS> ... default='[#6,#7;R0]=[#8]' -hbdsmarts <HBDSMARTS> ... default='[!H0;#7,#8,#9]' -n_inmax <N> ... input limit -n_outmax <N> ... output limit -v ... verbose -vv ... very verbose
  • 42. OpenBabel: active discussion lists indicators of community health
  • 43. OpenBabel Used by eMolecules, quite successfully Craig James, CTO formerly of Daylight, Accelrys
  • 44. OpenBabel (My opinions -- feel free to disagree.) l OB is very good but not yet close to Daylight, OEChem or JChem in comprehensiveness, quality, and other measures. l  SMARTS accuracy not state of the art. l  OB continually improving thanks to capable and active community of developers and users. l  The value reflects the total quality developer years invested. So there is no reason OB cannot catch up, if the community continues to grow.
  • 45. Some interesting and successful others: 1)  ChemAxon - Budapest, Marvin, Java, JChem 2)  Tripos (now Certara) - St. Louis, Sybyl, SGI 3)  Chem Comp Group - Montreal, MOE, SVL 4)  PyMol - was Delano, now Schrödinger (OSS) 5)  Schrodinger - quantum, etc. 6)  UCSF School of Pharmacy - DOCK, Kuntz & al. 7)  NIH NCTT - OSS, Java, Tripod, bioassay analysis 8)  Scitouch - Russia, Indigo, Dingo, Bingo 9)  CDK - FOSS, Java 10)  RDKit - FOSS, incl. machine learning 11)  Silicos NV - OSS & commercial 12)  Bioreason, VC-funded, bought by Simulations Plus 13)  Mesa Analytics & Computing 14)  NextMove Software 15)  UNM Division of Biocomputing 16)  IU SOIC CCRG
  • 46. Some General Conclusions l  Cheminformatics software landscape is very complex l  Economics & Freakonomics are drivers but also personalities l  FOSS can compete with $$$ software. l  Landscape includes lots of hidden costs. l  “Software engineering” has been an ideal, far from reality overall. l  Software can enable or hinder science l  Diverse business models, diverse everything l  Plenty of technological change l  Some human change too
  • 47. The End Feel free to contact me directly with questions or ideas! Jeremy J Yang jejyang@indiana.edu