This document provides an overview of molecular modelling techniques used for in silico drug discovery. It discusses using computational approaches to model small molecule and protein interactions to assess drug safety and efficacy. The key techniques covered include obtaining protein structures from databases like PDB, simulating molecular interactions through docking and screening, and considering factors like binding affinity, pharmacokinetics and toxicity during the drug design process. Computational protein structure prediction is also discussed as an important technique when experimental structures are unavailable.
1. Molecular modelling for In silico drug
discovery:
modelling small molecules and proteins
Dr Lee Larcombe
leelarcombe@gmail.com
2. Lecture Aim
This lecture aims to provide a basic understanding of
the concept of protein and molecular in silico
engineering/design as part of the drug development
process:-
Introducing theory and approaches, drivers, databases
and software – and with a focus on safety and efficacy.
3. This Lecture Covers
• Drivers for use of computational approaches
• Getting protein structures
• Simulation of molecular interactions
• Considering safety during design
• We will also highlight key software or data sources along
the way
5. Business
Target identification
Lead selection
Lead refinement
Pre-Clinical phases
Genomics
Proteomics/Metabolomics
Interaction Networks
Molecular modelling
Protein modelling
Chemoinformatics
Molecular modelling
Data modelling
Interaction Networks
Systems Biology
In vitro
In vivo
££
£
£
£
££
6. Ethics Drivers
• Use of animals in research
• 3Rs – Refine, Reduce, Replace
• Relevance of animal data for human use
• Extrapolation across species
• Improvement of safety for subsequent trials
• Regulatory requirements and change
7. Extrapolation of data across
species
How relevant is animal physiology to human physiology ?
Models not available for all diseases
Choice of species can be important
• 30% attrition due to no efficacy in man
• 10% attrition due to toxicity
For biologics, even more difficult to predict
9. Safety and Efficacy of Small Molecule
Drugs
• Safety: safety issues primarily focus on the potential of
the small molecule to have off-target effects,
metabolite/breakdown product toxicity, or buildup/non
clearance
• Efficacy: efficacy issues focus on bioavailability and good
binding kinetics to the right target protein – including
variations of that protein (SNPs/mutants)
10. 1st we need a source of molecules:
Chemical Repositories
• Databases with safety information (GRS, CAS)
• Databases with structure and vendor/price – individual
chemical supply companies - Zinc
• Databases with multiple information types – ChEMBLdb,
PubChem, Kegg
11. ChEMBLdb
“The ChEMBL database (ChEMBLdb) contains medicinal chemistry bioassay data,
integrated from a wide variety of sources (the literature, deposited data sets, other
bioassay databases). Subsets of ChEMBLdb, relating to particular target classes, or
disease areas, are exported to smaller databases, These separate data sets, and the
entire ChEMBLdb, are available either via ftp downloads, or via bespoke query interfaces,
tailored to the requirements of the scientific communities with a specific interest in these
research areas”
• Targets: 10,579
• Compound records: 1,638,394
• Distinct compounds: 1,411,786
• Activities: 12,843,338
• Publications: 57,156
(release 19)
13. Basic Requirements for modelling
1. Representation of atomic coordinates
2. Scoring
3. Searching
14. Structure Representation
• How much information do you want to include?
• atoms present
• connections between atoms
• bond types
• stereochemical configuration
• charges
• isotopes
• 3D-coordinates for atoms
C8H9NO3
15. Structure Representation
• 3D-coordinates for atoms
• connections between atoms
OH
CH2
H N CH 2
O
OH
• bond types
16. http://en.wikipedia.org/wiki/International_Chemical_Identifier
Structure Representation - InChi
Morphine
InChI=1S/C17H19NO3/c1-18-7-6-17-10-3-5-13
(20)16(17)21-15-12(19)4-2-9(14(15)17)8-
11(10)18/h2-5,10-11,13,16,19-20H,6-
8H2,1H3/t10-,11+,13-,16-,17-/m0/s1
The condensed, 27 character standard
InChIKey is a hashed version of the full
standard InChI (using the SHA-256
algorithm), designed to allow for easy web
searches of chemical compounds.
BQJCRHHNABKAKU-KBQPJGBKSA-
N
17. Scoring (Energy Functions & Force Fields)
Energy can be broken into a sum of potential energy terms
E = Ebonds + Eangles + Etorsions + Evdw + Eelectrostatic
Estr stretch
Ebend bend
Etors torsion
EvdW van der Waals
Eel electrostatic
Epol polarization
+ - + -
Repulsion
Attraction
+-+
+-+
+-++
-
+-+
q
f
18. Searching
Mol Mechanics (static) – minimisation
Mol Dynamics (dynamic) – laws of motion
MD a bit more complicated … need to know about:
• Classical mechanics
• classical equations of motion (EOM)
• e.g. Newton’s equations of motion
If we know these equations we *could* try to search for ALL possible
structures of Proteins and how they fold e.g. Protein Folding
19. Energy Minimisation Theory
• Treat molecule as a set of balls (with mass) connected by rigid
rods and springs
• Rods and springs have empirically determined force constants
• Allows one to treat atomic-scale motions in proteins as classical
physics problems (OK approximation)
20. Energy Minimisation
Local minimum vs global minimum
Many local minima; only ONE global minimum
Methods: Steepest descent, Conjugate gradient, others…
• Efficient way of “polishing and shining” your model
• Removes atomic overlaps and unnatural strains in the structure
• Stabilizes or reinforces strong hydrogen bonds, breaks weak
ones
• Brings molecule to lowest energy
22. Steepest Descent Minimisation
Low Energy High Energy
Low Energy
Makes small locally steep moves down gradient
Sufficient if starting point already close to optimal solution (e.g.
refinement of experimental structure)
23. I have no idea where this image came from, but it is a very nice illustation of the comparison. If anyone
knows where it is from please let me know!
Molecular Mechanics vs Dynamics
MM calculates just minimum energy state.
MM ignores kinetic energy, does only potential energy
Molecules, especially proteins, are not static.
• Dynamics can be important to function
• Trajectories, not just minimum energy state.
MD takes same force model, but calculates F=ma and calculates
velocities of all atoms (as well as positions)
24. Why simulate the dynamics of (molecular) systems?
• Molecular systems are not static
• Molecules are in dynamic equilibrium
• Properties are averages over dynamic behaviour of
molecules
• Molecular processes are not instantaneous
• Time course (kinetics) of events is important
• Time dependence essential to understand development
and regulation
Why?
25. What can we do with chemical
models?
We can investigate structure and similarities of structure
between molecules
We can map structural characteristics to properties (SARs)
We can study molecular interactions – particularly with
proteins
26. Interactions – Docking & Screening
• Computation to assess binding affinity
• Looks for conformational and electrostatic "fit" between
proteins and other molecules
• Optimization: Does position and orientation of the two
molecules minimise the total energy? (Computationally
intensive)
• Docking small ligands to proteins is a way to find potential
drugs. Industrially important!
27. Virtual Screening
• Docking small ligands to proteins is a way to find potential
drugs. Industrially important
• A small region of interest (pharmacophore) can be identified,
reducing computation
• Empirical scoring functions are not universal
• Various search methods:
• Rigid- provides score for whole ligand (accurate)
• Flexible- breaks ligands into pieces and docks them
individually
28. So – we need protein (target)
structures
http://www.rcsb.org/
29. The PDB
The PDB was established in 1971 at Brookhaven National
Laboratory and originally contained 7 structures. In 1998,
the Research Collaboratory for Structural Bioinformatics
(RCSB) became responsible for the management of the
PDB.
Last year (2013), 9597 structures were deposited from
scientists all over the world – this year (2014) so far, 8391
Now totals 105,839 (yesterday) structures
31. What if there is no structure available?
Can we predict structures?
Tertiary structure is dependent on ‘folding’ of the protein.
Recognition, characterisation, and assignment of domains
and folds is a major area of structural bioinformatics.
Predicting structure from sequence is one of the biggest
challenges...
32. Historical perspective?
Basic secondary structure prediction
Basic methods of secondary structure prediction rely on
statistical applications of ‘propensity’
The propensity/inclination/tendency of an amino acid to
be in a particular structure based on observation of
known datasets
33. Propensity
n[I][s] / n[I]
n[s] / n
P =
P = propensity
I = residue of interest
n[I] = number of residues [I] in the database
n = total number of residues in the database
n[I][s] = number of residues [I] in state of interest i.e. helices
n[s] = number of all residues in the database in the state of interest.
34. Example
124 / 1640
1246 / 10136
P[A] = = 0.61
So, the helical propensity for Alanine where:
• the number of alanines in the database is 1640,
• and the total number of residues in the database is 10136,
• and where the number of alanines found in helices is 124,
• and the total number of residue found in helices is 1246,
would be 0.61
35. Sliding windows
Propensity values are often assigned using sliding
window methods
Sequence: A G T W Y K M C Q N P V
window 1: A G T W Y K M average applied to W
window 2: G T W Y K M C average applied to Y
window 3: T W Y K M C Q average applied to K
Theory that neighboring residues affect local structure
36. GOR
Method by Garnier, Osguthorpe & Robson (1978).
Uses propensity values for Helix, Sheet, Coil, Turn for each residue from
experimentally-determined structures
Analysis done for each state, most probable state is assigned
Sequence EVSAEEIKKHEEKWNKYYGVNAFNLPKELFSKVDEKDRQKYPYNTIGNVFVKGQTSATGV
GOR Sheet ---------------------------------------------SSSSSSSS---SSSS
GOR Helix --------HHHHHHHHH----HHHHHHHHHHHHHHHH-----------------------
GOR Coil --CCCC---------------------------------CCCC-----------------
GOR Turn ------------------TT----------------------------------------
37. Hydrophobicity
Method by Kyte & Doolittle (1982)
Uses values representing hydrophobicity of residues rather
than structural propensity
Applied with a sliding window method
38. Hydrophobicity
Often helices tend to be more hydrophobic
Internalised regions of a protein are more hydrophobic
Transmembrane domains are hydrophobic
40. One more - Hydrophilicity
A method by Hopp & Woods (1980)
Experimentally derived values representing residue
hydrophilicity
Attempts to determine surface/solvent accessibility -
Antigenicity?
41. Problems
Many of these tools are old - and rely on statistical values
from small datasets
They generally cannot achieve better than 60% accuracy
(depends on how you measure it!)
60% right is still 40% wrong!!!
!! However – they are still in common use !!
(eg. Emboss tools: garnier & antigenic)
43. Application example: Stability
There can be some benefit to using these scales in
combination (similar to antigenic) – here using scales for
order/disorder, aggregation potential and hydrophobicity to
look at protein stability in the absence of structural
information
44. Folding is Complex:
Is a truly random approach possible?
Levinthal’s paradox (1969)
100 residues = 99 peptide bonds
therefore 198 different phi and psi bond
angles
3 stable conformations of bond angle = 3198
possible conformations
At a nano/pico second sample rate proteins
would not find correct structure for a long
time (longer than the age of the Universe!)
phi
psi
Proteins fold on a milli/micro second timescale – this is the paradox...
45. How does it work at all?
1. proteins do NOT fold from random conformations,
which was an assumption of Levinthal's calculation
2. instead, they fold from denatured states that retain
substantial 2o, and possibly 3o, structure
Why are folding simulations so difficult?
• Simulations are computational expensive
• Gross approximations in simulations
• Nature uses tricks such as
• Posttranslational processing
• Chaperones
• Environment change
46. Complexity & Diversity –
potential vs reality
If the average protein contains about 300 amino acids, then
there could be a possible 20300 different proteins
(Apparently) this is more than the atoms in the universe!
Yet a human (complex) has only 30,000 proteins
All proteins so far appear to be represented by between
1000 - 5000 fold types
47. Two reasons for limited fold space
Convergent evolution
Certain folds are biophysically favourable and may
have arisen in multiple cases
Divergent evolution
The number of folds seen is limited because they have
evolved from a limited number of common ancestor
proteins
Despite the evolutionary limitation of the number of existing folds (fold
space) it is still complex enough to make classification and
comprehension difficult
48. Why is Folding Difficult to do?
It's amazing that not only do proteins self-assemble -- fold -- but they do
so amazingly quickly: some as fast as a millionth of a second. While this
time is very fast on a person's timescale, it's remarkably long for
computers to simulate.
In fact, it takes about a day to simulate a nanosecond (1/1,000,000,000 of
a second) of dynamics for a reasonable sized protein. (eg Intel core i7
2.66Ghz)
Unfortunately, proteins fold on the tens of microsecond timescale (10,000
nanoseconds). Thus, it would take 10,000 CPU days to simulate folding
-- i.e. it would take 30 CPU years! That's a long time to wait for one
result!
50. Similar Project:
http://boinc.bakerlab.org/rosetta/
ab initio protein tertiary structure prediction based on the approach that sequence-dependent
local interactions limit or bias segments of the chain to form only distinct
sets of local structures
and that non-local interactions select the lowest free-energy tertiary conformations
compatible with the local biases.
different models are used to treat the local and non-local interactions.
Rather than attempting a physical model for local sequence-structure relationships,
the approach turns to the protein database to look at the distribution of local structures
adopted by short sequence segments (fewer than 10 residues in length) in known
three-dimensional structures
Berkley Open Infrastructure for
Network Computing
52. Some Rosetta@home results
A: Left, crystal structure of the MarA transcription
factor bound to DNA; right, our best submitted
model in CASP3.Despite many incorrect details, the
overall fold is predicted with sufficient accuracy to
allow insights into the mode of DNA binding.
B: Left, the crystal structure of bacteriocin AS-48;
middle, our best submitted model in CASP4; right,
a structurally and functionally related protein (NK-lysin)
identified using this model in a structure-based
search of the Protein Data Bank (PDB). The
structural and functional similarity is not
recognizable using sequence comparison methods
(the identity between the two sequences is only 5
percent).
C: Left, crystal structure of the second domain of
MutS; middle, our best submitted model for this
domain in CASP4; right, a structurally related
protein (RuvC) with a related function recognized
using the model in a structure-based search of the
PDB. The similarity was not recognized using
sequence comparison or fold recognition methods.
56. A compromise: Homology modelling
If there is no structure for your protein - perhaps there is
one for a similar protein.
Sequence alignment tools can be used to compare this to
your sequence with unknown structure
Homology searching and sequence alignment is now the
first step to protein structure prediction
If homologous proteins are found with structures, unknown
can be ‘overlayed’ and structure inferred
57. Homology Modeling
Based on two assumptions:
1.The structure of a protein is determined by its amino acid
sequence alone
2.With evolution, the structure changes more slowly than
the sequence - similar sequences may adopt the same
structure
58. Sequence alignment
TEX19 – human protein without a
structure.
PDB 2AAM: Crystal structure of a
putative glycosidase (tm1410) from
thermotoga maritima
63. Using the Models – Docking/Screening
• Choose and prepare target protein
• Identify binding pocket
• Fit ligand to pocket
• Score
• (for screening – repeat!)
64. Identify the Binding Pocket
• Could identify this by the location of an existing co-crystallised
ligand
• Or use surface sphere clusters
• Or identify it by clustering of solvent molecules (normally
water)
• Perhaps identify it by clustering of fragments (SurFlex
dock protomol)
65. Binding site based on existing
ligand
• Most methods allow you to
specify where the site is –
perhaps by identifying key
residues or based on an
existing ligand
• Could use the ‘hole’ left by the
ligand as a pocket, or use the
‘surface’ of the ligand as a
protomol
66. Surface Sphere generation
• Generate the surface of the target
– Connolly surface
• ‘Rolls’ a sphere the radius of
water across the van der Waal’s
surface of the target
• Each atom’s centre of van der Waal’s radius acts as a sitepoint for the
generation of a sphere on the surface whose centre is perpendicular to
the surface at the sitepoint.
• Spheres are then clustered – each cluster is a potential pocket
68. Prepare the ligand
• The ligand needs to be prepared too
• Drawn & minimised
• From a database - & minimised
• Extracted from another/the same binding site
• Hydrogens added etc
• Minimised/optimised – ready to dock
69. Docking
• Rigid docking -> ligand is fixed conformationally
• Flexible docking –> ligand is conformationally flexible
• Posable -> ligand is rigid, but moved spacially
70. Rigid Ligand docking•
Centres of spheres
representing the binding
pocket act as ‘Site
Points’
• The atoms of the ligand
are matched to the site
points
• Once orientation made,
possibly interaction
minimised: receptor kept
rigid and ligand flexible
71. Alternatives
Flexible Docking Posable Docking
Rings treated as flexible
Other bonds treated as
flexible/rotamers
Rings treated as rigid – ligand
fragmented
Rigid docking, but ligands
posed conformationally
•Rotated
•Twisted
•Flipped etc
And repetitively docked to find
best fit
73. Virtual Screening
• Docking – but repeated with many potential ligands
• Libraries can come from resources such as
PubChem/ChEMBLdb – vendors – or other in-house
sources
• From specialised databases holding structures suitable for
docking
• It is important to have a diversified library especially for
rigid docking !
74. Considering safety & efficacy – “Drug-like”
Lipinski rule of 5 (or Pfizer rule)
‘Compounds which violate at least two of the following conditions have
a very low chance of being orally bioavailable’
• MW <500 Da
• log P (lipophilicity) <5
• number of H bond donors <5
• number of H bond acceptors <10
Works well once you have descriptions of small molecules – can be
search criteria in databases...
75. ADME / ADME-Tox
• Lipinski rule is really the 1st step in ADME (adsorption,
distribution, metabolism, excretion) modelling
• Structure Activity Relationships (SARs) – similar
molecules will behave in similar ways, ie have similar
effects.
• Allows for knowledge-based compariative analysis – Tox
databases
82. Explore the following Software Tools
As well as resources mentioned in the slides!
Homology Modelling
Modeller, Phyre, SwissModel
Model Viewers
Pymol, Jmol, Rasmol
Molecular Simulation etc
Gromacs, Tinker, Amber, NAMD, Charmm,
Docking/Screening
Surflex Dock, Dock, AutoDock, Vina
Graphical Tools/builders/interfaces
Chimera, Maestro, Ghemical, VMD, DeepView
Suites (companies)
Tripos, Accellrys, OpenEye, ChemAxon, Schrodinger, MoE, Yasara
Some are free for
academic use, but cost
for commercial use
Take note and beware!
83. Workflow example – free vs paid
ChEMBL
PDB
Discovery
Studio
ligand
target
Marvin Sketch
Chimera
Gromacs
Dock
Chimera
get structures
preparation
minimisation
dynamics
docking
evaluation
Commercial suite
vs free tools
£££ $$$