Reproducible research in molecular biophysics and structural biology

Reproducible research in molecular biophysics and
structural bioinformatics

Konrad HINSEN

Centre de Biophysique Moléculaire, Orléans, France
and
Synchrotron SOLEIL, Saint Aubin, France

13 December 2012

Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural December 2012
13 bioinformatics 1 / 36

Reproducibility

One of the ideals of science

Scientific results should be verifiable.
Verification requires reproduction by other scientists.
Few results actually are reproduced, but it’s still important to
make this possible:
important for the credibility of science in society
(remember “Climategate”, cold fusion, ...)
important for the credibility of a specific study
The more detail you provide about what you did, the more your
peers are willing to believe that you did what you claim to have
done.
Reproducibility also matters for efficient collaboration inside a
team.


Reproducibility in practice

Near-perfect in mathematics
No journal publishes a theorem unless the author provides a proof.

Often best-effort in experimental science
Lab notebooks record all the details for in-house replication.
Published protocols are less detailed, but often clear enough for
an expert in the ﬁeld.
The main limitation is technical: lab equipment and concrete
samples cannot be reproduced identically.

Lousy in computational science
Papers give short method summaries and concentrate on results.
Few scientists can reproduce their own results after a few
months.

Reproducibility in computational science

We could do better than experimentalists
Results are deterministic, fully determined by input data and
algorithms.
If we published all input data and programs, anyone could
reproduce the results exactly.

But we don’t
Lots of technical difﬁculties.
Important additional effort.
Few incentives.

Goals of the Reproducible Research movement
Create more awareness of the problem.
Provide better tools.

Reproducible (computational) Research matters

“Climategate”
In 2009 a server at the Climatic Research Unit at the University of
East Anglia was hacked and many of its files became public. One of
them describes a scientist’s difficulties to reproduce his colleagues’
results. This has been used by climate change skeptics to discredit
climate research.

Protein structure retractions
In 2006, three important protein structures published in Science
were retracted following the discovery of a bug in the software used
for data processing.

For more examples and details:
Z. Merali, “...Error ... why scientific programming does not
compute”, Nature 467, 775 (2010)
Science Special Issue on Computational Biology, 13 April 2012

Today’s tools for Reproducible Research

Version control for source code and data
Mercurial, Git, Subversion, ...
Great for source code, usable for small data sets
Workflow management systems
VisTrails, Taverna, Kepler, ...
Preservation of computational procedures and provenance
tracking, great for scientists who like graphical workflows
Electronic lab notebooks
Mathematica, IPython notebook, ...
Preservation of computational procedures and provenance
tracking, great for scientists who like scripting
Publication tools for computations
Collage, IPOL, myExperiment, PyPedia, RunMyCode, SHARE, ...
Experimental and/or domain-specific


Missing pieces (1/2)

Integration of version control and provenance tracking
Record for reproduction that dataset X was obtained from version 5
of dataset Y using version 2.2 of program Z compiled with gcc
version 4.1.

Version control for big datasets
Version control tools are made for text-based formats.

Provenance tracking and workflows across machines
If you do part of your computations elsewhere (supercomputer, lab’s
cluster, ...), existing tools won’t work for you.

Tool-independent standard file formats
If Alice wants to reproduce Bob’s results, she needs to use Bob’s
tools for version control and workflows/notebooks.

Missing pieces (2/2)

Reproducible floating-point computations
Conflict between reproducibility and performance:
Reproducibility for IEEE float arithmetic requires an exact
sequence of load, store, and arithmetic operations.
Performance optimization requires shuffling around operations
depending on processor type, cache size, memory access speed,
number of processors, etc.


Reproducibility in biomolecular simulations (1/3)

Heavy resource requirements

Most important technique: Molecular Dynamics
Long runs (days to weeks) on big parallel machines
Produces large datasets (a few GB per trajectory)
Trajectory analysis can be as expensive as the simulation itself

Datasets are too big for version control.

Today’s provenance trackers can’t keep track of computations across
machines.



Complexity

Computational models contain thousands of parameters
No complete formal description of the models
(except for program source code)
Simulation algorithms contain many approximations made for
performance reasons
These approximations are usually undocumented

Publishing a complete description of a molecular simulation is so
difﬁcult that very few scientists could do it and equally few could
understand the resulting description.

There is no ﬁle format for storing a complete description of a
molecular simulation in machine-readable form.

Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics
13 December 2012 10 / 36

A split community
A small number of “developers” with backgrounds in physics and
chemistry
A large number of “users” with backgrounds in biology and
biochemistry
A negligibly small number of computing specialists
Users don’t care about the technical details of the simulations.
Users (have to) trust the developers blindly.
Developers understand the models but have no training in
software development or data management.

The field is dominated by a few large monolithic simulation packages
whose internal workings are complex and mostly undocumented.

Intermediate stages of processing in a simulation workflow are
impossible to access or available only in program-specific formats.
13 December 2012 11 / 36

Complexity: the Amber force ﬁeld (1/2)
The Amber force ﬁeld (http://ambermd.org/#ff) is one of the most
popular computational models for biomolecules. In a research paper,
it is typically described by its formula for the potential energy:

(0) 2
U = kij rij − rij
bonds ij

(0) 2
+ kijk φijk − φijk
angles ijk

+ kijkl cos (nijkl θijkl − δijkl )
dihedrals ijkl
12 6
σij σij
+ all pairs ij 4 ij r 12
− r6
qi qj
+ all pairs ij 4π 0 rij

13 December 2012 12 / 36

Complexity: the Amber force field (2/2)
In addition to that formula, we need
rules for summing over bonds, angles, etc.
rules for finding the values of all the parameters for a given
molecule
Most of these rules can be found on the Amber Web site (with some
effort), but not in any research paper.

One rule is, to the best of my knowledge, not written down anywhere.
It concerns the attribution of “improper dihedral” terms. I obtained it
from the Amber developers by asking a precise question by e-mail.

That rule refers to the indices of the atoms in the input file defining
the molecular structure. These indices are chosen arbitrarily. Making
an energy dependent on them is profoundly unphysical.

How many users of the Amber force field know about this unphysical
dependence on atom indices? The effect is very small.
13 December 2012 13 / 36

A real-life case of (non-)reproducibility (1/5)

Goal: ﬁnd the gas-phase structure of short peptide sequences by
molecular simulation

or

Earlier work on this topic1 ﬁnds that Ac A15 K + H+ forms a helix, but
Ac K A15 + H+ forms a globule.

My simulations predict that both sequences form globules.
What did I do differently?

1
M.F. Jarrold, Phys. Chem. Chem. Phys. 9, 1659 (2007)
13 December 2012 14 / 36


From the paper:

Molecular Dynamics (MD) simulations were performed to
help interpret the experimental results. The simulations
were done with the MACSIMUS suite of programs [31] using
the CHARMM21.3 parameter set. A dielectric constant of 1.0
was employed.

The URL in ref. 31 is broken, but Google helps me ﬁnd MACSIMUS
nevertheless. It’s free and comes with a manual! ¨

I download the “latest release” dated 2012-11-09.
But which one was used for that paper in 2007?

13 December 2012 15 / 36


MACSIMUS comes with a file charmm21.par that starts with

! Parameter File for CHARMM version 21.3 [June 24, 1991]
! Includes parameters for both polar and all hydrogen topology files
! Based on QUANTA Parameter Handbook [Polygen Corporation, 1990]
! Modified by JK using various sources

Did that paper use the “polar” or the “all hydrogen” topology files?

I can find only one set of topology files named charmm21, and that’s
with polar hydrogens only, so I guess “polar”.

Now I have the parameters, but I don’t know the rules of the
CHARMM force field, nor can I be sure that MACSIMUS uses the same
rules as the CHARMM software.

13 December 2012 16 / 36


From the paper:

A variety of starting structures were employed (such as
helix, sheet, and extended linear chain) and a number of
simulated annealing schedules were used in an effort to
escape high energy local minima. Often, hundreds of
simulations were performed to explore the energy
landscape of a particular peptide. In some cases, MD with
simulated annealing was unable to locate the lowest energy
conformation and more sophisticated methods were used
(see description of evolutionary based methods below).

I might as well give up here...

My point is not to criticize this particular paper.
The level of description is typical for the ﬁeld.
13 December 2012 17 / 36


What I would have liked to get:
a machine-readable file containing a full specification of the
simulated system:
chemical structure
all force field terms with their parameters
the initial atom positions
a script implementing the annealing protocol
links to all the software used in the simulation, with version
numbers

All that with persistent references (DOIs).

13 December 2012 18 / 36

Enough whining...

Let’s do something
to improve the situ-
tion!

13 December 2012 19 / 36

1
A data model for molecular
simulations

2
A data model for executable
papers

3
Teaching best practices to
computational scientists

13 December 2012 20 / 36

Data handling in molecular simulation

There is no standard ﬁle format for molecular simulations.

Best approximation: the PDB format
made for a different purpose (crystallography)
lacks precision for coordinates
lacks the possibility to store anything else but coordinates
limited to 10.000 atoms
few if any programs respect it completely

OK... so let’s deﬁne something better...
... which is a far from trivial task.

13 December 2012 21 / 36

What do we want to archive and exchange?

the system we are simulating
the chemical structure of the molecules
the number of each kind of molecule
the topology of the system (infinite, 3D periodic, 2D periodic, ...)
symmetries (crystals, ...)
configurations (positions of all atoms plus PBC parameters)
velocities, masses, charges, and other per-atom properties
force field parameters
force fields (the rules)
trajectories
experimental data used in simulations
... lots of other stuff ...

Don’t try to do everything at once: we want a modular set of
conventions.
13 December 2012 22 / 36

Which systems, which simulation techniques?

anything at the atomic and molecular scale: gases, liquids, solids
from atoms via small molecules to macromolecules
all-atom and coarse-grained models
from quantum models via classical mechanics to stochastic
dynamics

Application domains:

Chemical/statistical physics
Solid-state physics
Chemistry
Molecular/structural biology

13 December 2012 23 / 36

Mosaic: a data model for molecular simulation
MOlecular SimulAtion Interchange Conventions

http://bitbucket.org/molsim/mosaic/

Mosaic is a data model with (currently) two representations.

Data model
expresses molecular simulation data in terms of basic
data items and structures: numbers, strings, lists, trees,
...
Representation
deﬁnes how a data model is stored as a sequence of
bytes in a ﬁle. Conversion between representations is
exact.

This is work in progress.
In order to succeed, it must become a community project.
13 December 2012 24 / 36

Mosaic: current state
Data items
System topology, chemical structure
Conﬁgurations
Per-atom/site numerical properties (velocities, charges, masses,
. . .)
Per-atom/site labels (atom names, atom types, . . .)

Representations
HDF5 for big data sets, fast access, and easy interfacing to
C/C++/Fortran code
XML for processing with standard XML tools
Planned: JSON for interfacing with Web-based tools
Three in-memory representations for Python

Tools
Conversion between representations biophysics and structural bioinformatics
Konrad HINSEN (CBM/SOLEIL)
Reproducible research in molecular 13 December 2012 25 / 36

HDF5 (Hierarchical Data Format, version 5)

Platform-independent binary
root group
storage groups
Efﬁcient library for data handling
Metadata (provenance,
dependencies, . . . )
Well-established: used by big
organizations (NASA, . . .) for very
large data sets (meteorology . . .)
Ideal for exchanging and
archiving data datasets
(≈ arrays)
Suitable for very large data sets

13 December 2012 26 / 36

So here’s my model for the gas phase peptide...

1 <?xml version="1.0" encoding="utf-8"?>
2 <mosaic version="0.1">
3 <universe cell_shape="infinite" convention="MMTK" id="universe">
4 <molecules>
5 <molecule count="1">
6 <fragment label="" species="protein">
7 <fragments>
8 <fragment label="A" polymer_type="polypeptide" species="ACE,Ala,Ala,Ala,Ala,Ala,Ala,Al
9 <fragments>
0 <fragment label="ACE0" species="ace_beginning">
1 <atoms>
2 <atom label="CBond" name="C" type="element" />
3 <atom label="CH3" name="C" type="element" />
4 <atom label="HH31" name="H" type="element" />
7 <atom label="O" name="O" type="element" />
8 </ atoms>
9 <bonds>
0 <bond atoms="CH3 HH31" order="" />
3 <bond atoms="CBond CH3" order="" />
4 <bond atoms="CBond O" order="" />
5 </ bonds>
6 </ fragment>

. . . (686 lines in total)
13 December 2012 27 / 36

Biomolecular simulation . . . a few years from now

PDB import
Mosaic HDF5 file
Chain extraction PDB crystal universe + configuration
Protein in vacuum universe + configuration
Force field Force field for protein in vacuum
Solvated protein universe + configuration
Solvatation Force field for solvated protein
Trajectory for solvated protein
MD simulation
Small independent All the data for the
programs simulation study in one place
13 December 2012 28 / 36

Managing, publishing, and archiving reproducible
research
Reproducible research results in large sets of interrelated software,
data sets, and documentation.
Managing ﬁles, dependencies, and metadata (provenance, . . .)
becomes a nightmare when moving between multiple
computers.
Publishing everything in an easily usable form is difﬁcult due to a
lack of standards.
Archiving is almost pointless: which software will still work 50
years from now?
ActivePapers
(http://dirac.cnrs-orleans.fr/plone/software/activepapers)
provides a solution to these problems
is a ready-to-run prototype (needs a Java installation)
is rather useless in practice for now
13 December 2012 29 / 36

Preserving data for a long time

Formalized descriptions
data models
ontologies
well-deﬁned ﬁle formats

Use standards, even bad ones
Standards ensure the critical mass that will motivate future scientists
to maintain the knowledge and the tools required to interpret your
data.

Limited dependencies
software: operating systems, compilers, libraries, . . .
infrastructure: Internet, communication protocols, . . .
organizations: research labs, publishers, . . .

13 December 2012 30 / 36

Code is data

There is nothing special about software, it’s just a kind of
data!

Everything is data:
Scientiﬁc data is data.
Code is data.
Input parameters are data.
Documentation is data.

All we need is good data models for everything, plus a way to
describe dependencies.

The tricky part is ﬁnding a good data model for code.

13 December 2012 31 / 36

Representing code as data
“Data types” for code:

source code: defined by a language specification
there are lots of languages
programmers are attached to them
binary/executable: defined by a processor/hardware
specification
very complex specification
evolves rapidly with technological progress
bytecode: defined by a virtual machine specification (JVM, CLI,
. . .)
clear and simple specification
not directly visible to the programmer
today’s virtual machines were not designed for scientific
computing
→ sub-optimal performance

JVM bytecodes provide the best existing infrastructure.
There HINSEN (CBM/SOLEIL) scientific software for the JVMand structural bioinformatics is32 / 36
Konrad is not much Reproducible research in molecular biophysics platform, which
13 December 2012

Dependency handling

Dataﬂow graph
of program execution

dataset 1 dataset 2

input

program Directed source code
Acyclic
output
Graph compiler
of dependencies
dataset 3
dataset 2

dataset 1 program

dataset 3

13 December 2012 33 / 36

ActivePapers overview (1/2)

An ActivePaper . . .
is an HDF5 file containing any number of data sets
typically contains scientific data, code, and documentation
is identified by a DOI (Digital Object Identifier)
can refer to other ActivePapers through their DOIs
stores the dependency graph between data sets in HDF5
metadata

Executable code in an ActivePaper . . .
cannot access files outside of ActivePaper files
cannot modify data except inside its own ActivePaper
comes in three varieties: calclet (reads and writes data), viewlet
(reads data for interactive visualization), and library (called by
other code)
13 December 2012 34 / 36

ActivePapers overview (2/2)

Special cases
a raw dataset (usually with some documentation)
a software library (usually with some documentation)

External dependencies (must be maintained for 50 years . . .)
a Java Virtual Machine with the Java standard library
the HDF5 library (could be rewritten as a JVM library)
JHDF5, a Java interface to HDF5 (could be rewritten as a JVM library)

Problems solved
future-proof publishing and archiving of computational research
secure repeatability of computational studies
reusability of scientiﬁc data and software
integration of scientiﬁc data and software into bibliometry
13 December 2012 35 / 36

Teaching best practices to computational scientists
Software Carpentry
is an organization dedicated to teaching computing skills to
scientists, with support from the Alfred P. Sloan Foundation and the
Mozilla Foundation.

Activities:
short intensive workshops (“boot camps”)
online courses

You can help by
hosting a boot camp
being an instructor at boot-camps
working on teaching material

http://software-carpentry.org/
13 December 2012 36 / 36

Reproducible research in molecular biophysics and structural biology

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (13)

Semelhante a Reproducible research in molecular biophysics and structural biology

Semelhante a Reproducible research in molecular biophysics and structural biology (20)

Último

Último (20)

Reproducible research in molecular biophysics and structural biology