SlideShare uma empresa Scribd logo
1 de 32
Baixar para ler offline
minutes
Python for Chemistry in 21 days



                Dr. Noel O'Boyle

   Dr. John Mitchell and Prof. Peter Murray-Rust

                 UCC Talk, Sept 2005
 Available at http://www-mitchell.ch.cam.ac.uk/noel/
Introduction
●   This talk will cover
     –   what Python is
     –   why you should be interested in it
     –   how you can use Python in chemistry
Introduction
●   This talk will cover
     –   what Python is
     –   why you should be interested in it
     –   how you can use Python in chemistry
●   This talk will not cover
     –   how to program in Python
●   See references at end of talk



                       Word of warning!!
“My mental eye could now distinguish larger structures, of
manifold conformation; long rows, sometimes more closely
fitted together; all twining and twisting in snakelike motion.
But look! What was that? One of the snakes had seized
hold of its own tail, and the form whirled mockingly before
my eyes. As if by the flash of lightning I awoke...Let us learn
to dream, gentlemen”
                          Friedrich August Kekulé (1829-1896)
What is Python?
●   For a computer scientist...
    –   a high-level programming language
    –   interpreted (byte-compiled)
    –   dynamic typing
    –   object-oriented
What is Python?
●   For a computer scientist...
    –   a high-level programming language
    –   interpreted (byte-compiled)
    –   dynamic typing
    –   object-oriented
●   For everyone else...
    –   a scripting language (like Perl or Ruby) released by
        Guido von Rossum in 1991
    –   easy to learn
    –   easy to read (!)
What is Python?
●   For a computer scientist...
    –   a high-level programming language
    –   interpreted (byte-compiled)
    –   dynamic typing
    –   object-oriented
●   For everyone else...
    –   a scripting language (like Perl or Ruby) released by
        Guido von Rossum in 1991
    –   easy to learn
    –   easy to read (!)
    –   named after Cambridge comedians
The Great Debate

Sir Lancelot:
We were in the nick of time. You were in great Perl.
Sir Galahad:
I don't think I was.
Sir Lancelot:
You were, Sir Galahad. You were in terrible Perl.
Sir Galahad:
Look, let me go back in there and face the Perl.
Sir Lancelot:
No, it's too perilous.

                      (adapted from Monty Python and the Holy Grail)
Why you should be interested (1)

●   Python has been adopted by the cheminformatics community
●   For example, AstraZeneca has moved some of its codebase
    from 'the other scripting language' to Python


Job Description – Research Software Developer/Informatician
[section deleted]
Required Skills:

At least one object oriented programming language, e.g., Python, C++, Java.
Web-based application development (design/construction/maintenance)
UNIX, UNIX scripting & Linux OS

                            Position in AstraZeneca R&D, 02-09-05
Why you should be interested (2)

●   Scientific computing: scipy/pylab (like Matlab)
●   Molecular dynamics: MMTK
●   Statistics: scipy, rpy (R), pychem
●   3D-visualisation: VTK (mayavi)
●   2D-visualisation: scipy, pylab, rpy
●   coming soon, a wrapper around OpenBabel
●   cheminformatics: OEChem, frowns, PyDaylight, pychem
●   bioinformatics: BioPython
●   structural biology: PyMOL
●   computational chemistry: GaussSum
●   you can still use Java libraries...like the CDK
>>> from scipy import *
         >>> from pylab import *
         >>> a = arange(0,4)       > a <- seq(0,3)
         >>> a                     >a
scipy/                                               R
         [0,1,2,3]                 [1] 0 1 2 3
pylab    >>> mean(a)               > mean(a)
         1.5                       [1] 1.5
         >>> a**2                  > a**2
         [0,1,4,9]                 [1] 0 1 4 9
         >>> plot(a,a**2)          > plot(a,a**2)
         >>> show()                > lines(a,a**2)
Scipy
–   cluster: information theory functions (currently, vq and kmeans)
–   weave: compilation of numeric expressions to C++ for fast execution
–   cow: parallel programming via a Cluster Of Workstations
–   fftpack: fast Fourier transform module based on fftpack and fftw when
    available
–   ga: genetic algorithms
–   io: reading and writing numeric arrays, MATLAB .mat, and Matrix
    Market .mtx files
–   integrate: numeric integration for bounded and unbounded ranges. ODE
    solvers.
–   interpolate: interpolation of values from a sample data set.
–   optimize: constrained and unconstrained optimization methods and
    root-finding algorithms
–   signal: signal processing (1-D and 2-D filtering, filter design, LTI
    systems, etc.)
–   special: special function types (bessel, gamma, airy, etc.)
–   stats: statistical functions (stdev, var, mean, etc.)
–   linalg: linear algebra and BLAS routines based on the ATLAS
    implementation of LAPACK
–   sparse: Some sparse matrix support. LU factorization and solving
    sparse linear systems
Scipy statistical functions

●   descriptive statistics: variance, standard deviation, standard error, mean,
    mode, median
●   correlation: Pearson r, Spearman r, Kendall tau
●   statistical tests: chi-squared, t-tests, binomial, Wilcoxon, Kruskal,
    Kolmogorov-Smirnov, Anderson, etc.

●   linear regression

●   analysis of variance (ANOVA)

●   (and more)
pylab
pychem: Using scipy for chemoinformatics

●   Many multivariate analysis techniques are based on matrix
    algebra
●   scipy has wrappers around well-known C and Fortran
    numerical libraries (ATLAS, LAPACK)
●   pychem can do:
     –   principal component analysis, partial least squares
         regression, Fisher's discriminant analysis
     –   clustering: k-means, hierarchical (used Open Source
         clustering library)
     –   feature selection: genetic algorithms with PLS
     –   additional methods can be added
Python and R
  ●   Advantages of R:
       –   a large number of statistical libraries are available
  ●   Disadvantages of R:
       –   difficult to write algorithms
       –   slow (most R libraries are written in C)
       –   chokes on large datasets (use scan instead of read.table)

           Reading in data                   Principal component analysis
Method         300K    600K    1.6M        Method         300K    600K     1.6M
Python             6.8    13.9      41     Python             2.2      3.6      42
R (read.table)      42     105             R (read.table)       5       10
R (scan)             9      20      56     R (scan)             3        5      29




                                http://www.redbrick.dcu.ie/~noel/RversusPython.html
Python and R
●   rpy module allows Python programs to interface with R
●   have the best of both worlds
     –   access to the statistical functions of R
     –   access to the numerous modules available for Python
     –   can program in Python, instead of in R!!

             Python                                    R
>>> from rpy import r
>>> x = [5.05, 6.75, 3.21, 2.66]         > x <- c(5.05, 6.75, 3.21, 2.66)
>>> y = [1.65, 26.5, -5.93, 7.96]        > y <- c(1.65, 26.5, -5.93, 7.96)
>>> print r.lsfit(x,y)['coefficients']   > lsfit(x, y)$coefficients
{'X': 5.3935773611970212,                 Intercept        X
 'Intercept': -16.281127993087839}       -16.281128 5.393577
Python and R
                                             > hc$merge
Problem: Analyse a hierarchical clustering         [,1] [,2]
                                               [1,] -32 -33
Solution: Use R to cluster, and Python to      [2,] -39 -71
analyse the merge object of the cluster        [3,] -43 -47
                                               [4,] -10 -55
                                               [5,] -19 -36
                                               [6,] -5 -24
                                               [7,] -62 -63
                                               [8,] -74 -75
                                               [9,] -35 -76
                                              [10,] -1 -84
                                              [11,] -41 -42
                                              [12,] -83 -96
                                              [13,] -2 -29
                                              [14,] -7 -21
                                              [15,] -61 4
Problem 1
Graphically show the distribution of molecular weights of
molecules in an SD file. The molecular weight is stored in
a field of the SD file.

1,2-Diaminoethane
  MOE2004           3D

  6 3 0 0 0 0 0 0 0 0999 V2000
   -0.6900   -0.6620  0.0000 C 0   0   0   0   0   0   0   0   0   0   0   0
    0.5850   -1.9590  0.8240 H 0   0   0   0   0   0   0   0   0   0   0   0
    0.5350    1.5040  0.0000 N 0   0   0   0   0   0   0   0   0   0   0   0
    0.6060    2.0500  0.8460 H 0   0   0   0   0   0   0   0   0   0   0   0
    1.3040    0.8520 -0.0460 H 0   0   0   0   0   0   0   0   0   0   0   0
  1 2 1 0 0 0 0
  1 3 1 0 0 0 0
  1 4 1 0 0 0 0
M END
> <chem.name>
1,2-Diaminoethane

> <molecular.weight>
60.0995

$$$$
Object-oriented approach

 object                        SD file



 object           Molecule 1      Molecule 2      Molecule n


attributes                               fields
               name        atoms

  The code for creating these objects is stored in sdparser.py,
  which can be imported and used by any scripts that need to
  parse SD files.
Solution

from sdparser import SDFile
from rpy import *

inputfile = "mddr_complete.sd"

allmolweights = []

for molecule in SDFile(inputfile):
   molweight = molecule.fields['molecular.weight']
   allmolweights.append(float(molweight))

r.png(file="molwt_r.png")
r.hist(allmolweights,xlab="Mol. weights",main="MDDR", col="red")
r.dev_off()
Solution

from sdparser import SDFile
from rpy import *

inputfile = "mddr_complete.sd"

allmolweights = []

for molecule in SDFile(inputfile):
   molweight = molecule.fields['molecular.weight']
   allmolweights.append(float(molweight))

r.png(file="molwt_r.png")
r.hist(allmolweights,xlab="Mol. weights",main="MDDR", col="red")
r.dev_off()
Solution

from sdparser import SDFile
from rpy import *

inputfile = "mddr_complete.sd"

allmolweights = []

for molecule in SDFile(inputfile):
   molweight = molecule.fields['molecular.weight']
   allmolweights.append(float(molweight))

r.png(file="molwt_r.png")
r.hist(allmolweights,xlab="Mol. weights",main="MDDR", col="red")
r.dev_off()
Solution

from sdparser import SDFile
from rpy import *

inputfile = "mddr_complete.sd"

allmolweights = []

for molecule in SDFile(inputfile):
   molweight = molecule.fields['molecular.weight']
   allmolweights.append(float(molweight))

r.png(file="molwt_r.png")
r.hist(allmolweights,xlab="Mol. weights",main="MDDR", col="red")
r.dev_off()
Problem 2
Every molecule in an SD file is missing the name. To be
compatible with proprietary program X, we need to set the
name equal to the value of the field “chem.name”.

(MISSING NAME!)
  MOE2004          3D

  6 3 0 0 0 0 0 0 0 0999 V2000
   -0.6900   -0.6620  0.0000 C 0   0   0   0   0   0   0   0   0   0   0   0
    0.5850   -1.9590  0.8240 H 0   0   0   0   0   0   0   0   0   0   0   0
    0.5350    1.5040  0.0000 N 0   0   0   0   0   0   0   0   0   0   0   0
    0.6060    2.0500  0.8460 H 0   0   0   0   0   0   0   0   0   0   0   0
    1.3040    0.8520 -0.0460 H 0   0   0   0   0   0   0   0   0   0   0   0
  1 2 1 0 0 0 0
  1 3 1 0 0 0 0
  1 4 1 0 0 0 0
M END
> <chem.name>
1,2-Diaminoethane

> <molecular.weight>
60.0995

$$$$
Solution

from sdparser import SDFile

inputfile = "mddr_complete.sd"
outputfile = "mddr_withnames.sd"

for molecule in SDFile(inputfile):
   molecule.name = molecule.fields['chemical.name']
   outputfile.write(molecule)

inputfile.close()
outputfile.close()

print "We are the knights who say....SD!!!"
Solution

from sdparser import SDFile

inputfile = "mddr_complete.sd"
outputfile = "mddr_withnames.sd"

for molecule in SDFile(inputfile):
   molecule.name = molecule.fields['chemical.name']
   outputfile.write(molecule)

inputfile.close()
outputfile.close()

print "We are the knights who say....SD!!!"
Python and Java
 ●  It's easy to use Java libraries from Python
      – using either Jython or JPype
      – see http://www.redbrick.dcu.ie/~noel/CDKJython.html

Example: using the CDK to calculate the number of rings in a molecule
(given a string variable containing CML)
from jpype import *
startJVM("jdk1.5.0_03/jre/lib/i386/server/libjvm.so")
cdk = JPackage("org").openscience.cdk
SSSRFinder = cdk.ringsearch.SSSRFinder
CMLReader = cdk.io.CMLReader

def getNumRings(molecule):
  # Convert to a CDK molecule
  reader = CMLReader(java.io.StringReader(molXmlValue))
  chemFile = reader.read(cdk.ChemFile())
  cdkMol = chemFile.getChemSequence(0).getChemModel(0).getSetOfMolecules().getMolecule(0)
  # Calculate the number of rings
  sssrFinder = SSSRFinder(cdkMol)
  sssr = sssrFinder.findSSSR().size()
  return sssr
3D visualisation
●   VTK (Visualisation Toolkit) from Kitware
    –   open source, freely available
    –   scalar, tensor, vector and volumetric methods
    –   advanced modeling techniques such as implicit modelling,
        polygon reduction, mesh smoothing, cutting, contouring, and
        Delaunay triangulation
●   MayaVi
    –   easy to use GUI interface to VTK, written in Python
    –   can create input files and visualise them using Python scripts

                                 Demo
Python Resources
●   http://www.python.org
●   Guido's Tutorial
     –   http://www.python.org/doc/current/tut/tut.html
●   O'Reilly's “Learning Python” or Visual Quickstart Guide
    to Python
     –   Make sure it's Python 2.3 or 2.4 though
●   For Windows, consider the Enthought edition
     –   http://www.enthought.com/
Python for Chemistry

Mais conteúdo relacionado

Mais procurados

Introduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlowIntroduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlow
Sri Ambati
 
Expectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture ModelsExpectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture Models
petitegeek
 

Mais procurados (20)

Survey of Attention mechanism
Survey of Attention mechanismSurvey of Attention mechanism
Survey of Attention mechanism
 
Meta learning tutorial
Meta learning tutorialMeta learning tutorial
Meta learning tutorial
 
Introduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlowIntroduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlow
 
Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)
Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)
Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)
 
Expectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture ModelsExpectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture Models
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ?
 
Introduction to Keras
Introduction to KerasIntroduction to Keras
Introduction to Keras
 
An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms
 
07 dimensionality reduction
07 dimensionality reduction07 dimensionality reduction
07 dimensionality reduction
 
Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2
 
Keras vs Tensorflow vs PyTorch | Deep Learning Frameworks Comparison | Edureka
Keras vs Tensorflow vs PyTorch | Deep Learning Frameworks Comparison | EdurekaKeras vs Tensorflow vs PyTorch | Deep Learning Frameworks Comparison | Edureka
Keras vs Tensorflow vs PyTorch | Deep Learning Frameworks Comparison | Edureka
 
Approaching (almost) Any NLP Problem
Approaching (almost) Any NLP ProblemApproaching (almost) Any NLP Problem
Approaching (almost) Any NLP Problem
 
Overview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningOverview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep Learning
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
Meta-Learning Presentation
Meta-Learning PresentationMeta-Learning Presentation
Meta-Learning Presentation
 
TP1 Traitement d'images Génie Logiciel avec Matlab
TP1 Traitement d'images Génie Logiciel avec MatlabTP1 Traitement d'images Génie Logiciel avec Matlab
TP1 Traitement d'images Génie Logiciel avec Matlab
 
Paper presentation on LLM compression
Paper presentation on LLM compression Paper presentation on LLM compression
Paper presentation on LLM compression
 
Introduction to Few shot learning
Introduction to Few shot learningIntroduction to Few shot learning
Introduction to Few shot learning
 
Introduction to AIoT & TinyML - with Arduino
Introduction to AIoT & TinyML - with ArduinoIntroduction to AIoT & TinyML - with Arduino
Introduction to AIoT & TinyML - with Arduino
 

Semelhante a Python for Chemistry

Seven waystouseturtle pycon2009
Seven waystouseturtle pycon2009Seven waystouseturtle pycon2009
Seven waystouseturtle pycon2009
A Jorge Garcia
 
Migrating from matlab to python
Migrating from matlab to pythonMigrating from matlab to python
Migrating from matlab to python
ActiveState
 
Profiling and optimization
Profiling and optimizationProfiling and optimization
Profiling and optimization
g3_nittala
 
2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer
tirlukachaitanya
 
DRUG - RDSTK Talk
DRUG - RDSTK TalkDRUG - RDSTK Talk
DRUG - RDSTK Talk
rtelmore
 

Semelhante a Python for Chemistry (20)

Seven waystouseturtle pycon2009
Seven waystouseturtle pycon2009Seven waystouseturtle pycon2009
Seven waystouseturtle pycon2009
 
Python Orientation
Python OrientationPython Orientation
Python Orientation
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Migrating from matlab to python
Migrating from matlab to pythonMigrating from matlab to python
Migrating from matlab to python
 
IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」
 
A Workshop on R
A Workshop on RA Workshop on R
A Workshop on R
 
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science Competitions
 
Using R in remote computer clusters
Using R in remote computer clustersUsing R in remote computer clusters
Using R in remote computer clusters
 
RR & Docker @ MuensteR Meetup (Sep 2017)
RR & Docker @ MuensteR Meetup (Sep 2017)RR & Docker @ MuensteR Meetup (Sep 2017)
RR & Docker @ MuensteR Meetup (Sep 2017)
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKit
 
Reproducible Computational Research in R
Reproducible Computational Research in RReproducible Computational Research in R
Reproducible Computational Research in R
 
DS LAB MANUAL.pdf
DS LAB MANUAL.pdfDS LAB MANUAL.pdf
DS LAB MANUAL.pdf
 
Profiling and optimization
Profiling and optimizationProfiling and optimization
Profiling and optimization
 
2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer
 
Python For Scientists
Python For ScientistsPython For Scientists
Python For Scientists
 
Regex Considered Harmful: Use Rosie Pattern Language Instead
Regex Considered Harmful: Use Rosie Pattern Language InsteadRegex Considered Harmful: Use Rosie Pattern Language Instead
Regex Considered Harmful: Use Rosie Pattern Language Instead
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging Environments
 
DRUG - RDSTK Talk
DRUG - RDSTK TalkDRUG - RDSTK Talk
DRUG - RDSTK Talk
 
R for hadoopers
R for hadoopersR for hadoopers
R for hadoopers
 
R and RMarkdown crash course
R and RMarkdown crash courseR and RMarkdown crash course
R and RMarkdown crash course
 

Mais de baoilleach

Universal Smiles: Finally a canonical SMILES string
Universal Smiles: Finally a canonical SMILES stringUniversal Smiles: Finally a canonical SMILES string
Universal Smiles: Finally a canonical SMILES string
baoilleach
 
What's New and Cooking in Open Babel 2.3.2
What's New and Cooking in Open Babel 2.3.2What's New and Cooking in Open Babel 2.3.2
What's New and Cooking in Open Babel 2.3.2
baoilleach
 
Large-scale computational design and selection of polymers for solar cells
Large-scale computational design and selection of polymers for solar cellsLarge-scale computational design and selection of polymers for solar cells
Large-scale computational design and selection of polymers for solar cells
baoilleach
 
Improving the quality of chemical databases with community-developed tools (a...
Improving the quality of chemical databases with community-developed tools (a...Improving the quality of chemical databases with community-developed tools (a...
Improving the quality of chemical databases with community-developed tools (a...
baoilleach
 
Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...
baoilleach
 

Mais de baoilleach (20)

We need to talk about Kekulization, Aromaticity and SMILES
We need to talk about Kekulization, Aromaticity and SMILESWe need to talk about Kekulization, Aromaticity and SMILES
We need to talk about Kekulization, Aromaticity and SMILES
 
Open Babel project overview
Open Babel project overviewOpen Babel project overview
Open Babel project overview
 
So I have an SD File... What do I do next?
So I have an SD File... What do I do next?So I have an SD File... What do I do next?
So I have an SD File... What do I do next?
 
Chemistrify the Web
Chemistrify the WebChemistrify the Web
Chemistrify the Web
 
Universal Smiles: Finally a canonical SMILES string
Universal Smiles: Finally a canonical SMILES stringUniversal Smiles: Finally a canonical SMILES string
Universal Smiles: Finally a canonical SMILES string
 
What's New and Cooking in Open Babel 2.3.2
What's New and Cooking in Open Babel 2.3.2What's New and Cooking in Open Babel 2.3.2
What's New and Cooking in Open Babel 2.3.2
 
Intro to Open Babel
Intro to Open BabelIntro to Open Babel
Intro to Open Babel
 
Protein-ligand docking
Protein-ligand dockingProtein-ligand docking
Protein-ligand docking
 
Cheminformatics
CheminformaticsCheminformatics
Cheminformatics
 
Making the most of a QM calculation
Making the most of a QM calculationMaking the most of a QM calculation
Making the most of a QM calculation
 
Data Analysis in QSAR
Data Analysis in QSARData Analysis in QSAR
Data Analysis in QSAR
 
Large-scale computational design and selection of polymers for solar cells
Large-scale computational design and selection of polymers for solar cellsLarge-scale computational design and selection of polymers for solar cells
Large-scale computational design and selection of polymers for solar cells
 
My Open Access papers
My Open Access papersMy Open Access papers
My Open Access papers
 
Improving the quality of chemical databases with community-developed tools (a...
Improving the quality of chemical databases with community-developed tools (a...Improving the quality of chemical databases with community-developed tools (a...
Improving the quality of chemical databases with community-developed tools (a...
 
De novo design of molecular wires with optimal properties for solar energy co...
De novo design of molecular wires with optimal properties for solar energy co...De novo design of molecular wires with optimal properties for solar energy co...
De novo design of molecular wires with optimal properties for solar energy co...
 
Cinfony - Bring cheminformatics toolkits into tune
Cinfony - Bring cheminformatics toolkits into tuneCinfony - Bring cheminformatics toolkits into tune
Cinfony - Bring cheminformatics toolkits into tune
 
Density functional theory calculations on Ruthenium polypyridyl complexes inc...
Density functional theory calculations on Ruthenium polypyridyl complexes inc...Density functional theory calculations on Ruthenium polypyridyl complexes inc...
Density functional theory calculations on Ruthenium polypyridyl complexes inc...
 
Application of Density Functional Theory to Scanning Tunneling Microscopy
Application of Density Functional Theory to Scanning Tunneling MicroscopyApplication of Density Functional Theory to Scanning Tunneling Microscopy
Application of Density Functional Theory to Scanning Tunneling Microscopy
 
Towards Practical Molecular Devices
Towards Practical Molecular DevicesTowards Practical Molecular Devices
Towards Practical Molecular Devices
 
Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...
 

Último

Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 

Último (20)

Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 

Python for Chemistry

  • 1. minutes Python for Chemistry in 21 days Dr. Noel O'Boyle Dr. John Mitchell and Prof. Peter Murray-Rust UCC Talk, Sept 2005 Available at http://www-mitchell.ch.cam.ac.uk/noel/
  • 2. Introduction ● This talk will cover – what Python is – why you should be interested in it – how you can use Python in chemistry
  • 3. Introduction ● This talk will cover – what Python is – why you should be interested in it – how you can use Python in chemistry ● This talk will not cover – how to program in Python ● See references at end of talk Word of warning!!
  • 4. “My mental eye could now distinguish larger structures, of manifold conformation; long rows, sometimes more closely fitted together; all twining and twisting in snakelike motion. But look! What was that? One of the snakes had seized hold of its own tail, and the form whirled mockingly before my eyes. As if by the flash of lightning I awoke...Let us learn to dream, gentlemen” Friedrich August Kekulé (1829-1896)
  • 5. What is Python? ● For a computer scientist... – a high-level programming language – interpreted (byte-compiled) – dynamic typing – object-oriented
  • 6. What is Python? ● For a computer scientist... – a high-level programming language – interpreted (byte-compiled) – dynamic typing – object-oriented ● For everyone else... – a scripting language (like Perl or Ruby) released by Guido von Rossum in 1991 – easy to learn – easy to read (!)
  • 7. What is Python? ● For a computer scientist... – a high-level programming language – interpreted (byte-compiled) – dynamic typing – object-oriented ● For everyone else... – a scripting language (like Perl or Ruby) released by Guido von Rossum in 1991 – easy to learn – easy to read (!) – named after Cambridge comedians
  • 8. The Great Debate Sir Lancelot: We were in the nick of time. You were in great Perl. Sir Galahad: I don't think I was. Sir Lancelot: You were, Sir Galahad. You were in terrible Perl. Sir Galahad: Look, let me go back in there and face the Perl. Sir Lancelot: No, it's too perilous. (adapted from Monty Python and the Holy Grail)
  • 9. Why you should be interested (1) ● Python has been adopted by the cheminformatics community ● For example, AstraZeneca has moved some of its codebase from 'the other scripting language' to Python Job Description – Research Software Developer/Informatician [section deleted] Required Skills: At least one object oriented programming language, e.g., Python, C++, Java. Web-based application development (design/construction/maintenance) UNIX, UNIX scripting & Linux OS Position in AstraZeneca R&D, 02-09-05
  • 10. Why you should be interested (2) ● Scientific computing: scipy/pylab (like Matlab) ● Molecular dynamics: MMTK ● Statistics: scipy, rpy (R), pychem ● 3D-visualisation: VTK (mayavi) ● 2D-visualisation: scipy, pylab, rpy ● coming soon, a wrapper around OpenBabel ● cheminformatics: OEChem, frowns, PyDaylight, pychem ● bioinformatics: BioPython ● structural biology: PyMOL ● computational chemistry: GaussSum ● you can still use Java libraries...like the CDK
  • 11. >>> from scipy import * >>> from pylab import * >>> a = arange(0,4) > a <- seq(0,3) >>> a >a scipy/ R [0,1,2,3] [1] 0 1 2 3 pylab >>> mean(a) > mean(a) 1.5 [1] 1.5 >>> a**2 > a**2 [0,1,4,9] [1] 0 1 4 9 >>> plot(a,a**2) > plot(a,a**2) >>> show() > lines(a,a**2)
  • 12. Scipy – cluster: information theory functions (currently, vq and kmeans) – weave: compilation of numeric expressions to C++ for fast execution – cow: parallel programming via a Cluster Of Workstations – fftpack: fast Fourier transform module based on fftpack and fftw when available – ga: genetic algorithms – io: reading and writing numeric arrays, MATLAB .mat, and Matrix Market .mtx files – integrate: numeric integration for bounded and unbounded ranges. ODE solvers. – interpolate: interpolation of values from a sample data set. – optimize: constrained and unconstrained optimization methods and root-finding algorithms – signal: signal processing (1-D and 2-D filtering, filter design, LTI systems, etc.) – special: special function types (bessel, gamma, airy, etc.) – stats: statistical functions (stdev, var, mean, etc.) – linalg: linear algebra and BLAS routines based on the ATLAS implementation of LAPACK – sparse: Some sparse matrix support. LU factorization and solving sparse linear systems
  • 13. Scipy statistical functions ● descriptive statistics: variance, standard deviation, standard error, mean, mode, median ● correlation: Pearson r, Spearman r, Kendall tau ● statistical tests: chi-squared, t-tests, binomial, Wilcoxon, Kruskal, Kolmogorov-Smirnov, Anderson, etc. ● linear regression ● analysis of variance (ANOVA) ● (and more)
  • 14. pylab
  • 15. pychem: Using scipy for chemoinformatics ● Many multivariate analysis techniques are based on matrix algebra ● scipy has wrappers around well-known C and Fortran numerical libraries (ATLAS, LAPACK) ● pychem can do: – principal component analysis, partial least squares regression, Fisher's discriminant analysis – clustering: k-means, hierarchical (used Open Source clustering library) – feature selection: genetic algorithms with PLS – additional methods can be added
  • 16. Python and R ● Advantages of R: – a large number of statistical libraries are available ● Disadvantages of R: – difficult to write algorithms – slow (most R libraries are written in C) – chokes on large datasets (use scan instead of read.table) Reading in data Principal component analysis Method 300K 600K 1.6M Method 300K 600K 1.6M Python 6.8 13.9 41 Python 2.2 3.6 42 R (read.table) 42 105 R (read.table) 5 10 R (scan) 9 20 56 R (scan) 3 5 29 http://www.redbrick.dcu.ie/~noel/RversusPython.html
  • 17. Python and R ● rpy module allows Python programs to interface with R ● have the best of both worlds – access to the statistical functions of R – access to the numerous modules available for Python – can program in Python, instead of in R!! Python R >>> from rpy import r >>> x = [5.05, 6.75, 3.21, 2.66] > x <- c(5.05, 6.75, 3.21, 2.66) >>> y = [1.65, 26.5, -5.93, 7.96] > y <- c(1.65, 26.5, -5.93, 7.96) >>> print r.lsfit(x,y)['coefficients'] > lsfit(x, y)$coefficients {'X': 5.3935773611970212, Intercept X 'Intercept': -16.281127993087839} -16.281128 5.393577
  • 18. Python and R > hc$merge Problem: Analyse a hierarchical clustering [,1] [,2] [1,] -32 -33 Solution: Use R to cluster, and Python to [2,] -39 -71 analyse the merge object of the cluster [3,] -43 -47 [4,] -10 -55 [5,] -19 -36 [6,] -5 -24 [7,] -62 -63 [8,] -74 -75 [9,] -35 -76 [10,] -1 -84 [11,] -41 -42 [12,] -83 -96 [13,] -2 -29 [14,] -7 -21 [15,] -61 4
  • 19. Problem 1 Graphically show the distribution of molecular weights of molecules in an SD file. The molecular weight is stored in a field of the SD file. 1,2-Diaminoethane MOE2004 3D 6 3 0 0 0 0 0 0 0 0999 V2000 -0.6900 -0.6620 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.5850 -1.9590 0.8240 H 0 0 0 0 0 0 0 0 0 0 0 0 0.5350 1.5040 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 0.6060 2.0500 0.8460 H 0 0 0 0 0 0 0 0 0 0 0 0 1.3040 0.8520 -0.0460 H 0 0 0 0 0 0 0 0 0 0 0 0 1 2 1 0 0 0 0 1 3 1 0 0 0 0 1 4 1 0 0 0 0 M END > <chem.name> 1,2-Diaminoethane > <molecular.weight> 60.0995 $$$$
  • 20. Object-oriented approach object SD file object Molecule 1 Molecule 2 Molecule n attributes fields name atoms The code for creating these objects is stored in sdparser.py, which can be imported and used by any scripts that need to parse SD files.
  • 21. Solution from sdparser import SDFile from rpy import * inputfile = "mddr_complete.sd" allmolweights = [] for molecule in SDFile(inputfile): molweight = molecule.fields['molecular.weight'] allmolweights.append(float(molweight)) r.png(file="molwt_r.png") r.hist(allmolweights,xlab="Mol. weights",main="MDDR", col="red") r.dev_off()
  • 22. Solution from sdparser import SDFile from rpy import * inputfile = "mddr_complete.sd" allmolweights = [] for molecule in SDFile(inputfile): molweight = molecule.fields['molecular.weight'] allmolweights.append(float(molweight)) r.png(file="molwt_r.png") r.hist(allmolweights,xlab="Mol. weights",main="MDDR", col="red") r.dev_off()
  • 23. Solution from sdparser import SDFile from rpy import * inputfile = "mddr_complete.sd" allmolweights = [] for molecule in SDFile(inputfile): molweight = molecule.fields['molecular.weight'] allmolweights.append(float(molweight)) r.png(file="molwt_r.png") r.hist(allmolweights,xlab="Mol. weights",main="MDDR", col="red") r.dev_off()
  • 24. Solution from sdparser import SDFile from rpy import * inputfile = "mddr_complete.sd" allmolweights = [] for molecule in SDFile(inputfile): molweight = molecule.fields['molecular.weight'] allmolweights.append(float(molweight)) r.png(file="molwt_r.png") r.hist(allmolweights,xlab="Mol. weights",main="MDDR", col="red") r.dev_off()
  • 25.
  • 26. Problem 2 Every molecule in an SD file is missing the name. To be compatible with proprietary program X, we need to set the name equal to the value of the field “chem.name”. (MISSING NAME!) MOE2004 3D 6 3 0 0 0 0 0 0 0 0999 V2000 -0.6900 -0.6620 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.5850 -1.9590 0.8240 H 0 0 0 0 0 0 0 0 0 0 0 0 0.5350 1.5040 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 0.6060 2.0500 0.8460 H 0 0 0 0 0 0 0 0 0 0 0 0 1.3040 0.8520 -0.0460 H 0 0 0 0 0 0 0 0 0 0 0 0 1 2 1 0 0 0 0 1 3 1 0 0 0 0 1 4 1 0 0 0 0 M END > <chem.name> 1,2-Diaminoethane > <molecular.weight> 60.0995 $$$$
  • 27. Solution from sdparser import SDFile inputfile = "mddr_complete.sd" outputfile = "mddr_withnames.sd" for molecule in SDFile(inputfile): molecule.name = molecule.fields['chemical.name'] outputfile.write(molecule) inputfile.close() outputfile.close() print "We are the knights who say....SD!!!"
  • 28. Solution from sdparser import SDFile inputfile = "mddr_complete.sd" outputfile = "mddr_withnames.sd" for molecule in SDFile(inputfile): molecule.name = molecule.fields['chemical.name'] outputfile.write(molecule) inputfile.close() outputfile.close() print "We are the knights who say....SD!!!"
  • 29. Python and Java ● It's easy to use Java libraries from Python – using either Jython or JPype – see http://www.redbrick.dcu.ie/~noel/CDKJython.html Example: using the CDK to calculate the number of rings in a molecule (given a string variable containing CML) from jpype import * startJVM("jdk1.5.0_03/jre/lib/i386/server/libjvm.so") cdk = JPackage("org").openscience.cdk SSSRFinder = cdk.ringsearch.SSSRFinder CMLReader = cdk.io.CMLReader def getNumRings(molecule): # Convert to a CDK molecule reader = CMLReader(java.io.StringReader(molXmlValue)) chemFile = reader.read(cdk.ChemFile()) cdkMol = chemFile.getChemSequence(0).getChemModel(0).getSetOfMolecules().getMolecule(0) # Calculate the number of rings sssrFinder = SSSRFinder(cdkMol) sssr = sssrFinder.findSSSR().size() return sssr
  • 30. 3D visualisation ● VTK (Visualisation Toolkit) from Kitware – open source, freely available – scalar, tensor, vector and volumetric methods – advanced modeling techniques such as implicit modelling, polygon reduction, mesh smoothing, cutting, contouring, and Delaunay triangulation ● MayaVi – easy to use GUI interface to VTK, written in Python – can create input files and visualise them using Python scripts Demo
  • 31. Python Resources ● http://www.python.org ● Guido's Tutorial – http://www.python.org/doc/current/tut/tut.html ● O'Reilly's “Learning Python” or Visual Quickstart Guide to Python – Make sure it's Python 2.3 or 2.4 though ● For Windows, consider the Enthought edition – http://www.enthought.com/