Reproducible research in molecular biophysics and structural biology
1. Reproducible research in molecular biophysics and
structural bioinformatics
Konrad HINSEN
Centre de Biophysique Moléculaire, Orléans, France
and
Synchrotron SOLEIL, Saint Aubin, France
13 December 2012
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural December 2012
13 bioinformatics 1 / 36
2. Reproducibility
One of the ideals of science
Scientific results should be verifiable.
Verification requires reproduction by other scientists.
Few results actually are reproduced, but it’s still important to
make this possible:
important for the credibility of science in society
(remember “Climategate”, cold fusion, ...)
important for the credibility of a specific study
The more detail you provide about what you did, the more your
peers are willing to believe that you did what you claim to have
done.
Reproducibility also matters for efficient collaboration inside a
team.
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural December 2012
13 bioinformatics 2 / 36
3. Reproducibility in practice
Near-perfect in mathematics
No journal publishes a theorem unless the author provides a proof.
Often best-effort in experimental science
Lab notebooks record all the details for in-house replication.
Published protocols are less detailed, but often clear enough for
an expert in the field.
The main limitation is technical: lab equipment and concrete
samples cannot be reproduced identically.
Lousy in computational science
Papers give short method summaries and concentrate on results.
Few scientists can reproduce their own results after a few
months.
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural December 2012
13 bioinformatics 3 / 36
4. Reproducibility in computational science
We could do better than experimentalists
Results are deterministic, fully determined by input data and
algorithms.
If we published all input data and programs, anyone could
reproduce the results exactly.
But we don’t
Lots of technical difficulties.
Important additional effort.
Few incentives.
Goals of the Reproducible Research movement
Create more awareness of the problem.
Provide better tools.
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural December 2012
13 bioinformatics 4 / 36
5. Reproducible (computational) Research matters
“Climategate”
In 2009 a server at the Climatic Research Unit at the University of
East Anglia was hacked and many of its files became public. One of
them describes a scientist’s difficulties to reproduce his colleagues’
results. This has been used by climate change skeptics to discredit
climate research.
Protein structure retractions
In 2006, three important protein structures published in Science
were retracted following the discovery of a bug in the software used
for data processing.
For more examples and details:
Z. Merali, “...Error ... why scientific programming does not
compute”, Nature 467, 775 (2010)
Science Special Issue on Computational Biology, 13 April 2012
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural December 2012
13 bioinformatics 5 / 36
6. Today’s tools for Reproducible Research
Version control for source code and data
Mercurial, Git, Subversion, ...
Great for source code, usable for small data sets
Workflow management systems
VisTrails, Taverna, Kepler, ...
Preservation of computational procedures and provenance
tracking, great for scientists who like graphical workflows
Electronic lab notebooks
Mathematica, IPython notebook, ...
Preservation of computational procedures and provenance
tracking, great for scientists who like scripting
Publication tools for computations
Collage, IPOL, myExperiment, PyPedia, RunMyCode, SHARE, ...
Experimental and/or domain-specific
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural December 2012
13 bioinformatics 6 / 36
7. Missing pieces (1/2)
Integration of version control and provenance tracking
Record for reproduction that dataset X was obtained from version 5
of dataset Y using version 2.2 of program Z compiled with gcc
version 4.1.
Version control for big datasets
Version control tools are made for text-based formats.
Provenance tracking and workflows across machines
If you do part of your computations elsewhere (supercomputer, lab’s
cluster, ...), existing tools won’t work for you.
Tool-independent standard file formats
If Alice wants to reproduce Bob’s results, she needs to use Bob’s
tools for version control and workflows/notebooks.
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural December 2012
13 bioinformatics 7 / 36
8. Missing pieces (2/2)
Reproducible floating-point computations
Conflict between reproducibility and performance:
Reproducibility for IEEE float arithmetic requires an exact
sequence of load, store, and arithmetic operations.
Performance optimization requires shuffling around operations
depending on processor type, cache size, memory access speed,
number of processors, etc.
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural December 2012
13 bioinformatics 8 / 36
9. Reproducibility in biomolecular simulations (1/3)
Heavy resource requirements
Most important technique: Molecular Dynamics
Long runs (days to weeks) on big parallel machines
Produces large datasets (a few GB per trajectory)
Trajectory analysis can be as expensive as the simulation itself
Datasets are too big for version control.
Today’s provenance trackers can’t keep track of computations across
machines.
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural December 2012
13 bioinformatics 9 / 36
10. Reproducibility in biomolecular simulations (2/3)
Complexity
Computational models contain thousands of parameters
No complete formal description of the models
(except for program source code)
Simulation algorithms contain many approximations made for
performance reasons
These approximations are usually undocumented
Publishing a complete description of a molecular simulation is so
difficult that very few scientists could do it and equally few could
understand the resulting description.
There is no file format for storing a complete description of a
molecular simulation in machine-readable form.
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics
13 December 2012 10 / 36
11. Reproducibility in biomolecular simulations (3/3)
A split community
A small number of “developers” with backgrounds in physics and
chemistry
A large number of “users” with backgrounds in biology and
biochemistry
A negligibly small number of computing specialists
Users don’t care about the technical details of the simulations.
Users (have to) trust the developers blindly.
Developers understand the models but have no training in
software development or data management.
The field is dominated by a few large monolithic simulation packages
whose internal workings are complex and mostly undocumented.
Intermediate stages of processing in a simulation workflow are
impossible to access or available only in program-specific formats.
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics
13 December 2012 11 / 36
12. Complexity: the Amber force field (1/2)
The Amber force field (http://ambermd.org/#ff) is one of the most
popular computational models for biomolecules. In a research paper,
it is typically described by its formula for the potential energy:
(0) 2
U = kij rij − rij
bonds ij
(0) 2
+ kijk φijk − φijk
angles ijk
+ kijkl cos (nijkl θijkl − δijkl )
dihedrals ijkl
12 6
σij σij
+ all pairs ij 4 ij r 12
− r6
qi qj
+ all pairs ij 4π 0 rij
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics
13 December 2012 12 / 36
13. Complexity: the Amber force field (2/2)
In addition to that formula, we need
rules for summing over bonds, angles, etc.
rules for finding the values of all the parameters for a given
molecule
Most of these rules can be found on the Amber Web site (with some
effort), but not in any research paper.
One rule is, to the best of my knowledge, not written down anywhere.
It concerns the attribution of “improper dihedral” terms. I obtained it
from the Amber developers by asking a precise question by e-mail.
That rule refers to the indices of the atoms in the input file defining
the molecular structure. These indices are chosen arbitrarily. Making
an energy dependent on them is profoundly unphysical.
How many users of the Amber force field know about this unphysical
dependence on atom indices? The effect is very small.
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics
13 December 2012 13 / 36
14. A real-life case of (non-)reproducibility (1/5)
Goal: find the gas-phase structure of short peptide sequences by
molecular simulation
or
Earlier work on this topic1 finds that Ac A15 K + H+ forms a helix, but
Ac K A15 + H+ forms a globule.
My simulations predict that both sequences form globules.
What did I do differently?
1
M.F. Jarrold, Phys. Chem. Chem. Phys. 9, 1659 (2007)
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics
13 December 2012 14 / 36
15. A real-life case of (non-)reproducibility (2/5)
From the paper:
Molecular Dynamics (MD) simulations were performed to
help interpret the experimental results. The simulations
were done with the MACSIMUS suite of programs [31] using
the CHARMM21.3 parameter set. A dielectric constant of 1.0
was employed.
The URL in ref. 31 is broken, but Google helps me find MACSIMUS
nevertheless. It’s free and comes with a manual! ¨
I download the “latest release” dated 2012-11-09.
But which one was used for that paper in 2007?
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics
13 December 2012 15 / 36
16. A real-life case of (non-)reproducibility (3/5)
MACSIMUS comes with a file charmm21.par that starts with
! Parameter File for CHARMM version 21.3 [June 24, 1991]
! Includes parameters for both polar and all hydrogen topology files
! Based on QUANTA Parameter Handbook [Polygen Corporation, 1990]
! Modified by JK using various sources
Did that paper use the “polar” or the “all hydrogen” topology files?
I can find only one set of topology files named charmm21, and that’s
with polar hydrogens only, so I guess “polar”.
Now I have the parameters, but I don’t know the rules of the
CHARMM force field, nor can I be sure that MACSIMUS uses the same
rules as the CHARMM software.
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics
13 December 2012 16 / 36
17. A real-life case of (non-)reproducibility (4/5)
From the paper:
A variety of starting structures were employed (such as
helix, sheet, and extended linear chain) and a number of
simulated annealing schedules were used in an effort to
escape high energy local minima. Often, hundreds of
simulations were performed to explore the energy
landscape of a particular peptide. In some cases, MD with
simulated annealing was unable to locate the lowest energy
conformation and more sophisticated methods were used
(see description of evolutionary based methods below).
I might as well give up here...
My point is not to criticize this particular paper.
The level of description is typical for the field.
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics
13 December 2012 17 / 36
18. A real-life case of (non-)reproducibility (5/5)
What I would have liked to get:
a machine-readable file containing a full specification of the
simulated system:
chemical structure
all force field terms with their parameters
the initial atom positions
a script implementing the annealing protocol
links to all the software used in the simulation, with version
numbers
All that with persistent references (DOIs).
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics
13 December 2012 18 / 36
19. Enough whining...
Let’s do something
to improve the situ-
tion!
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics
13 December 2012 19 / 36
20. 1
A data model for molecular
simulations
2
A data model for executable
papers
3
Teaching best practices to
computational scientists
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics
13 December 2012 20 / 36
21. Data handling in molecular simulation
There is no standard file format for molecular simulations.
Best approximation: the PDB format
made for a different purpose (crystallography)
lacks precision for coordinates
lacks the possibility to store anything else but coordinates
limited to 10.000 atoms
few if any programs respect it completely
OK... so let’s define something better...
... which is a far from trivial task.
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics
13 December 2012 21 / 36
22. What do we want to archive and exchange?
the system we are simulating
the chemical structure of the molecules
the number of each kind of molecule
the topology of the system (infinite, 3D periodic, 2D periodic, ...)
symmetries (crystals, ...)
configurations (positions of all atoms plus PBC parameters)
velocities, masses, charges, and other per-atom properties
force field parameters
force fields (the rules)
trajectories
experimental data used in simulations
... lots of other stuff ...
Don’t try to do everything at once: we want a modular set of
conventions.
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics
13 December 2012 22 / 36
23. Which systems, which simulation techniques?
anything at the atomic and molecular scale: gases, liquids, solids
from atoms via small molecules to macromolecules
all-atom and coarse-grained models
from quantum models via classical mechanics to stochastic
dynamics
Application domains:
Chemical/statistical physics
Solid-state physics
Chemistry
Molecular/structural biology
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics
13 December 2012 23 / 36
24. Mosaic: a data model for molecular simulation
MOlecular SimulAtion Interchange Conventions
http://bitbucket.org/molsim/mosaic/
Mosaic is a data model with (currently) two representations.
Data model
expresses molecular simulation data in terms of basic
data items and structures: numbers, strings, lists, trees,
...
Representation
defines how a data model is stored as a sequence of
bytes in a file. Conversion between representations is
exact.
This is work in progress.
In order to succeed, it must become a community project.
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics
13 December 2012 24 / 36
25. Mosaic: current state
Data items
System topology, chemical structure
Configurations
Per-atom/site numerical properties (velocities, charges, masses,
. . .)
Per-atom/site labels (atom names, atom types, . . .)
Representations
HDF5 for big data sets, fast access, and easy interfacing to
C/C++/Fortran code
XML for processing with standard XML tools
Planned: JSON for interfacing with Web-based tools
Three in-memory representations for Python
Tools
Conversion between representations biophysics and structural bioinformatics
Konrad HINSEN (CBM/SOLEIL)
Reproducible research in molecular 13 December 2012 25 / 36
26. HDF5 (Hierarchical Data Format, version 5)
Platform-independent binary
root group
storage groups
Efficient library for data handling
Metadata (provenance,
dependencies, . . . )
Well-established: used by big
organizations (NASA, . . .) for very
large data sets (meteorology . . .)
Ideal for exchanging and
archiving data datasets
(≈ arrays)
Suitable for very large data sets
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics
13 December 2012 26 / 36
28. Biomolecular simulation . . . a few years from now
PDB import
Mosaic HDF5 file
Chain extraction PDB crystal universe + configuration
Protein in vacuum universe + configuration
Force field Force field for protein in vacuum
Solvated protein universe + configuration
Solvatation Force field for solvated protein
Trajectory for solvated protein
MD simulation
Small independent All the data for the
programs simulation study in one place
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics
13 December 2012 28 / 36
29. Managing, publishing, and archiving reproducible
research
Reproducible research results in large sets of interrelated software,
data sets, and documentation.
Managing files, dependencies, and metadata (provenance, . . .)
becomes a nightmare when moving between multiple
computers.
Publishing everything in an easily usable form is difficult due to a
lack of standards.
Archiving is almost pointless: which software will still work 50
years from now?
ActivePapers
(http://dirac.cnrs-orleans.fr/plone/software/activepapers)
provides a solution to these problems
is a ready-to-run prototype (needs a Java installation)
is rather useless in practice for now
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics
13 December 2012 29 / 36
30. Preserving data for a long time
Formalized descriptions
data models
ontologies
well-defined file formats
Use standards, even bad ones
Standards ensure the critical mass that will motivate future scientists
to maintain the knowledge and the tools required to interpret your
data.
Limited dependencies
software: operating systems, compilers, libraries, . . .
infrastructure: Internet, communication protocols, . . .
organizations: research labs, publishers, . . .
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics
13 December 2012 30 / 36
31. Code is data
There is nothing special about software, it’s just a kind of
data!
Everything is data:
Scientific data is data.
Code is data.
Input parameters are data.
Documentation is data.
All we need is good data models for everything, plus a way to
describe dependencies.
The tricky part is finding a good data model for code.
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics
13 December 2012 31 / 36
32. Representing code as data
“Data types” for code:
source code: defined by a language specification
there are lots of languages
programmers are attached to them
binary/executable: defined by a processor/hardware
specification
very complex specification
evolves rapidly with technological progress
bytecode: defined by a virtual machine specification (JVM, CLI,
. . .)
clear and simple specification
not directly visible to the programmer
today’s virtual machines were not designed for scientific
computing
→ sub-optimal performance
JVM bytecodes provide the best existing infrastructure.
There HINSEN (CBM/SOLEIL) scientific software for the JVMand structural bioinformatics is32 / 36
Konrad is not much Reproducible research in molecular biophysics platform, which
13 December 2012
33. Dependency handling
Dataflow graph
of program execution
dataset 1 dataset 2
input
program Directed source code
Acyclic
output
Graph compiler
of dependencies
dataset 3
dataset 2
dataset 1 program
dataset 3
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics
13 December 2012 33 / 36
34. ActivePapers overview (1/2)
An ActivePaper . . .
is an HDF5 file containing any number of data sets
typically contains scientific data, code, and documentation
is identified by a DOI (Digital Object Identifier)
can refer to other ActivePapers through their DOIs
stores the dependency graph between data sets in HDF5
metadata
Executable code in an ActivePaper . . .
cannot access files outside of ActivePaper files
cannot modify data except inside its own ActivePaper
comes in three varieties: calclet (reads and writes data), viewlet
(reads data for interactive visualization), and library (called by
other code)
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics
13 December 2012 34 / 36
35. ActivePapers overview (2/2)
Special cases
a raw dataset (usually with some documentation)
a software library (usually with some documentation)
External dependencies (must be maintained for 50 years . . .)
a Java Virtual Machine with the Java standard library
the HDF5 library (could be rewritten as a JVM library)
JHDF5, a Java interface to HDF5 (could be rewritten as a JVM library)
Problems solved
future-proof publishing and archiving of computational research
secure repeatability of computational studies
reusability of scientific data and software
integration of scientific data and software into bibliometry
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics
13 December 2012 35 / 36
36. Teaching best practices to computational scientists
Software Carpentry
is an organization dedicated to teaching computing skills to
scientists, with support from the Alfred P. Sloan Foundation and the
Mozilla Foundation.
Activities:
short intensive workshops (“boot camps”)
online courses
You can help by
hosting a boot camp
being an instructor at boot-camps
working on teaching material
http://software-carpentry.org/
Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics
13 December 2012 36 / 36