SlideShare uma empresa Scribd logo
1 de 36
Baixar para ler offline
Reproducible research in molecular biophysics and
            structural bioinformatics

                                      Konrad HINSEN

                      Centre de Biophysique Moléculaire, Orléans, France
                                            and
                           Synchrotron SOLEIL, Saint Aubin, France


                                  13 December 2012




Konrad HINSEN (CBM/SOLEIL)    Reproducible research in molecular biophysics and structural December 2012
                                                                                       13 bioinformatics   1 / 36
Reproducibility

One of the ideals of science

     Scientific results should be verifiable.
     Verification requires reproduction by other scientists.
     Few results actually are reproduced, but it’s still important to
     make this possible:
            important for the credibility of science in society
            (remember “Climategate”, cold fusion, ...)
            important for the credibility of a specific study
            The more detail you provide about what you did, the more your
            peers are willing to believe that you did what you claim to have
            done.
     Reproducibility also matters for efficient collaboration inside a
     team.


 Konrad HINSEN (CBM/SOLEIL)   Reproducible research in molecular biophysics and structural December 2012
                                                                                       13 bioinformatics   2 / 36
Reproducibility in practice

Near-perfect in mathematics
No journal publishes a theorem unless the author provides a proof.

Often best-effort in experimental science
     Lab notebooks record all the details for in-house replication.
     Published protocols are less detailed, but often clear enough for
     an expert in the field.
     The main limitation is technical: lab equipment and concrete
     samples cannot be reproduced identically.

Lousy in computational science
     Papers give short method summaries and concentrate on results.
     Few scientists can reproduce their own results after a few
     months.
 Konrad HINSEN (CBM/SOLEIL)   Reproducible research in molecular biophysics and structural December 2012
                                                                                       13 bioinformatics   3 / 36
Reproducibility in computational science

We could do better than experimentalists
     Results are deterministic, fully determined by input data and
     algorithms.
     If we published all input data and programs, anyone could
     reproduce the results exactly.

But we don’t
     Lots of technical difficulties.
     Important additional effort.
     Few incentives.

Goals of the Reproducible Research movement
     Create more awareness of the problem.
     Provide better tools.
 Konrad HINSEN (CBM/SOLEIL)   Reproducible research in molecular biophysics and structural December 2012
                                                                                       13 bioinformatics   4 / 36
Reproducible (computational) Research matters

“Climategate”
In 2009 a server at the Climatic Research Unit at the University of
East Anglia was hacked and many of its files became public. One of
them describes a scientist’s difficulties to reproduce his colleagues’
results. This has been used by climate change skeptics to discredit
climate research.

Protein structure retractions
In 2006, three important protein structures published in Science
were retracted following the discovery of a bug in the software used
for data processing.

For more examples and details:
    Z. Merali, “...Error ... why scientific programming does not
    compute”, Nature 467, 775 (2010)
    Science Special Issue on Computational Biology, 13 April 2012
 Konrad HINSEN (CBM/SOLEIL)   Reproducible research in molecular biophysics and structural December 2012
                                                                                       13 bioinformatics   5 / 36
Today’s tools for Reproducible Research

     Version control for source code and data
     Mercurial, Git, Subversion, ...
     Great for source code, usable for small data sets
     Workflow management systems
     VisTrails, Taverna, Kepler, ...
     Preservation of computational procedures and provenance
     tracking, great for scientists who like graphical workflows
     Electronic lab notebooks
     Mathematica, IPython notebook, ...
     Preservation of computational procedures and provenance
     tracking, great for scientists who like scripting
     Publication tools for computations
     Collage, IPOL, myExperiment, PyPedia, RunMyCode, SHARE, ...
     Experimental and/or domain-specific

 Konrad HINSEN (CBM/SOLEIL)   Reproducible research in molecular biophysics and structural December 2012
                                                                                       13 bioinformatics   6 / 36
Missing pieces                                                                                       (1/2)

Integration of version control and provenance tracking
Record for reproduction that dataset X was obtained from version 5
of dataset Y using version 2.2 of program Z compiled with gcc
version 4.1.

Version control for big datasets
Version control tools are made for text-based formats.

Provenance tracking and workflows across machines
If you do part of your computations elsewhere (supercomputer, lab’s
cluster, ...), existing tools won’t work for you.

Tool-independent standard file formats
If Alice wants to reproduce Bob’s results, she needs to use Bob’s
tools for version control and workflows/notebooks.
 Konrad HINSEN (CBM/SOLEIL)   Reproducible research in molecular biophysics and structural December 2012
                                                                                       13 bioinformatics   7 / 36
Missing pieces                                                                                       (2/2)

Reproducible floating-point computations
Conflict between reproducibility and performance:
     Reproducibility for IEEE float arithmetic requires an exact
     sequence of load, store, and arithmetic operations.
     Performance optimization requires shuffling around operations
     depending on processor type, cache size, memory access speed,
     number of processors, etc.




 Konrad HINSEN (CBM/SOLEIL)   Reproducible research in molecular biophysics and structural December 2012
                                                                                       13 bioinformatics   8 / 36
Reproducibility in biomolecular simulations                                                          (1/3)



Heavy resource requirements

     Most important technique: Molecular Dynamics
     Long runs (days to weeks) on big parallel machines
     Produces large datasets (a few GB per trajectory)
     Trajectory analysis can be as expensive as the simulation itself


Datasets are too big for version control.

Today’s provenance trackers can’t keep track of computations across
machines.




 Konrad HINSEN (CBM/SOLEIL)   Reproducible research in molecular biophysics and structural December 2012
                                                                                       13 bioinformatics   9 / 36
Reproducibility in biomolecular simulations                                                             (2/3)

Complexity

     Computational models contain thousands of parameters
     No complete formal description of the models
     (except for program source code)
     Simulation algorithms contain many approximations made for
     performance reasons
     These approximations are usually undocumented


Publishing a complete description of a molecular simulation is so
difficult that very few scientists could do it and equally few could
understand the resulting description.

There is no file format for storing a complete description of a
molecular simulation in machine-readable form.

 Konrad HINSEN (CBM/SOLEIL)   Reproducible research in molecular biophysics and structural bioinformatics
                                                                                      13 December 2012      10 / 36
Reproducibility in biomolecular simulations                                                             (3/3)
A split community
     A small number of “developers” with backgrounds in physics and
     chemistry
     A large number of “users” with backgrounds in biology and
     biochemistry
     A negligibly small number of computing specialists
     Users don’t care about the technical details of the simulations.
     Users (have to) trust the developers blindly.
     Developers understand the models but have no training in
     software development or data management.

The field is dominated by a few large monolithic simulation packages
whose internal workings are complex and mostly undocumented.

Intermediate stages of processing in a simulation workflow are
impossible to access or available only in program-specific formats.
 Konrad HINSEN (CBM/SOLEIL)   Reproducible research in molecular biophysics and structural bioinformatics
                                                                                      13 December 2012      11 / 36
Complexity: the Amber force field                                                                            (1/2)
The Amber force field (http://ambermd.org/#ff) is one of the most
popular computational models for biomolecules. In a research paper,
it is typically described by its formula for the potential energy:


                                                            (0) 2
                  U    =                      kij rij − rij
                              bonds      ij

                                                                        (0) 2
                              +                      kijk φijk − φijk
                                  angles       ijk

                              +                         kijkl cos (nijkl θijkl − δijkl )
                                  dihedrals ijkl
                                                                        12         6
                                                                       σij        σij
                              +           all pairs ij 4        ij     r 12
                                                                              −   r6
                                                             qi qj
                              +           all pairs      ij 4π 0 rij


 Konrad HINSEN (CBM/SOLEIL)       Reproducible research in molecular biophysics and structural bioinformatics
                                                                                          13 December 2012      12 / 36
Complexity: the Amber force field                                                                        (2/2)
In addition to that formula, we need
     rules for summing over bonds, angles, etc.
     rules for finding the values of all the parameters for a given
     molecule
Most of these rules can be found on the Amber Web site (with some
effort), but not in any research paper.

One rule is, to the best of my knowledge, not written down anywhere.
It concerns the attribution of “improper dihedral” terms. I obtained it
from the Amber developers by asking a precise question by e-mail.

That rule refers to the indices of the atoms in the input file defining
the molecular structure. These indices are chosen arbitrarily. Making
an energy dependent on them is profoundly unphysical.

How many users of the Amber force field know about this unphysical
dependence on atom indices? The effect is very small.
 Konrad HINSEN (CBM/SOLEIL)   Reproducible research in molecular biophysics and structural bioinformatics
                                                                                      13 December 2012      13 / 36
A real-life case of (non-)reproducibility (1/5)

Goal: find the gas-phase structure of short peptide sequences by
molecular simulation




                                              or




Earlier work on this topic1 finds that Ac A15 K + H+ forms a helix, but
Ac K A15 + H+ forms a globule.

My simulations predict that both sequences form globules.
What did I do differently?

  1
      M.F. Jarrold, Phys. Chem. Chem. Phys. 9, 1659 (2007)
 Konrad HINSEN (CBM/SOLEIL)   Reproducible research in molecular biophysics and structural bioinformatics
                                                                                      13 December 2012      14 / 36
A real-life case of (non-)reproducibility (2/5)


From the paper:

     Molecular Dynamics (MD) simulations were performed to
     help interpret the experimental results. The simulations
     were done with the MACSIMUS suite of programs [31] using
     the CHARMM21.3 parameter set. A dielectric constant of 1.0
     was employed.


The URL in ref. 31 is broken, but Google helps me find MACSIMUS
nevertheless. It’s free and comes with a manual! ¨

I download the “latest release” dated 2012-11-09.
But which one was used for that paper in 2007?



 Konrad HINSEN (CBM/SOLEIL)   Reproducible research in molecular biophysics and structural bioinformatics
                                                                                      13 December 2012      15 / 36
A real-life case of (non-)reproducibility (3/5)

MACSIMUS comes with a file charmm21.par that starts with

!   Parameter File for CHARMM version 21.3 [June 24, 1991]
!   Includes parameters for both polar and all hydrogen topology files
!   Based on QUANTA Parameter Handbook [Polygen Corporation, 1990]
!   Modified by JK using various sources



Did that paper use the “polar” or the “all hydrogen” topology files?

I can find only one set of topology files named charmm21, and that’s
with polar hydrogens only, so I guess “polar”.

Now I have the parameters, but I don’t know the rules of the
CHARMM force field, nor can I be sure that MACSIMUS uses the same
rules as the CHARMM software.


    Konrad HINSEN (CBM/SOLEIL)   Reproducible research in molecular biophysics and structural bioinformatics
                                                                                         13 December 2012      16 / 36
A real-life case of (non-)reproducibility (4/5)

From the paper:

      A variety of starting structures were employed (such as
      helix, sheet, and extended linear chain) and a number of
      simulated annealing schedules were used in an effort to
      escape high energy local minima. Often, hundreds of
      simulations were performed to explore the energy
      landscape of a particular peptide. In some cases, MD with
      simulated annealing was unable to locate the lowest energy
      conformation and more sophisticated methods were used
      (see description of evolutionary based methods below).


I might as well give up here...

My point is not to criticize this particular paper.
The level of description is typical for the field.
  Konrad HINSEN (CBM/SOLEIL)   Reproducible research in molecular biophysics and structural bioinformatics
                                                                                       13 December 2012      17 / 36
A real-life case of (non-)reproducibility (5/5)



What I would have liked to get:
     a machine-readable file containing a full specification of the
     simulated system:
            chemical structure
            all force field terms with their parameters
            the initial atom positions
     a script implementing the annealing protocol
     links to all the software used in the simulation, with version
     numbers

All that with persistent references (DOIs).




 Konrad HINSEN (CBM/SOLEIL)   Reproducible research in molecular biophysics and structural bioinformatics
                                                                                      13 December 2012      18 / 36
Enough whining...

                   Let’s do something
                   to improve the situ-
                   tion!




Konrad HINSEN (CBM/SOLEIL)   Reproducible research in molecular biophysics and structural bioinformatics
                                                                                     13 December 2012      19 / 36
1
        A data model for molecular
        simulations

2
        A data model for executable
        papers

3
        Teaching best practices to
        computational scientists

    Konrad HINSEN (CBM/SOLEIL)   Reproducible research in molecular biophysics and structural bioinformatics
                                                                                         13 December 2012      20 / 36
Data handling in molecular simulation


There is no standard file format for molecular simulations.

Best approximation: the PDB format
     made for a different purpose (crystallography)
     lacks precision for coordinates
     lacks the possibility to store anything else but coordinates
     limited to 10.000 atoms
     few if any programs respect it completely


OK... so let’s define something better...
                                    ... which is a far from trivial task.



 Konrad HINSEN (CBM/SOLEIL)   Reproducible research in molecular biophysics and structural bioinformatics
                                                                                      13 December 2012      21 / 36
What do we want to archive and exchange?

     the system we are simulating
            the chemical structure of the molecules
            the number of each kind of molecule
            the topology of the system (infinite, 3D periodic, 2D periodic, ...)
            symmetries (crystals, ...)
     configurations (positions of all atoms plus PBC parameters)
     velocities, masses, charges, and other per-atom properties
     force field parameters
     force fields (the rules)
     trajectories
     experimental data used in simulations
     ... lots of other stuff ...

Don’t try to do everything at once: we want a modular set of
conventions.
 Konrad HINSEN (CBM/SOLEIL)   Reproducible research in molecular biophysics and structural bioinformatics
                                                                                      13 December 2012      22 / 36
Which systems, which simulation techniques?


     anything at the atomic and molecular scale: gases, liquids, solids
     from atoms via small molecules to macromolecules
     all-atom and coarse-grained models
     from quantum models via classical mechanics to stochastic
     dynamics

Application domains:

     Chemical/statistical physics
     Solid-state physics
     Chemistry
     Molecular/structural biology


 Konrad HINSEN (CBM/SOLEIL)   Reproducible research in molecular biophysics and structural bioinformatics
                                                                                      13 December 2012      23 / 36
Mosaic: a data model for molecular simulation
MOlecular SimulAtion Interchange Conventions

http://bitbucket.org/molsim/mosaic/

Mosaic is a data model with (currently) two representations.

Data model
                  expresses molecular simulation data in terms of basic
                  data items and structures: numbers, strings, lists, trees,
                  ...
Representation
            defines how a data model is stored as a sequence of
            bytes in a file. Conversion between representations is
            exact.

     This is work in progress.
     In order to succeed, it must become a community project.
 Konrad HINSEN (CBM/SOLEIL)   Reproducible research in molecular biophysics and structural bioinformatics
                                                                                      13 December 2012      24 / 36
Mosaic: current state
Data items
     System topology, chemical structure
     Configurations
     Per-atom/site numerical properties (velocities, charges, masses,
     . . .)
     Per-atom/site labels (atom names, atom types, . . .)

Representations
     HDF5 for big data sets, fast access, and easy interfacing to
     C/C++/Fortran code
     XML for processing with standard XML tools
     Planned: JSON for interfacing with Web-based tools
     Three in-memory representations for Python

Tools
     Conversion between representations biophysics and structural bioinformatics
 Konrad HINSEN (CBM/SOLEIL)
                     Reproducible research in molecular      13 December 2012      25 / 36
HDF5 (Hierarchical Data Format, version 5)


  Platform-independent binary
                                                               root group
  storage                                                                            groups
  Efficient library for data handling
  Metadata (provenance,
  dependencies, . . . )
  Well-established: used by big
  organizations (NASA, . . .) for very
  large data sets (meteorology . . .)
  Ideal for exchanging and
  archiving data                                                                             datasets
                                                                                             (≈ arrays)
  Suitable for very large data sets



 Konrad HINSEN (CBM/SOLEIL)   Reproducible research in molecular biophysics and structural bioinformatics
                                                                                      13 December 2012      26 / 36
So here’s my model for the gas phase peptide...

1   <?xml version="1.0" encoding="utf-8"?>
2   <mosaic version="0.1">
3      <universe cell_shape="infinite" convention="MMTK" id="universe">
4           <molecules>
5               <molecule count="1">
6                   <fragment label="" species="protein">
7                       <fragments>
8                           <fragment label="A" polymer_type="polypeptide" species="ACE,Ala,Ala,Ala,Ala,Ala,Ala,Al
9                               <fragments>
0                                   <fragment label="ACE0" species="ace_beginning">
1                                        <atoms>
2                                             <atom label="CBond" name="C" type="element" />
3                                             <atom label="CH3" name="C" type="element" />
4                                             <atom label="HH31" name="H" type="element" />
5                                             <atom label="HH32" name="H" type="element" />
6                                             <atom label="HH33" name="H" type="element" />
7                                             <atom label="O" name="O" type="element" />
8                                        </ atoms>
9                                        <bonds>
0                                             <bond atoms="CH3 HH31" order="" />
1                                             <bond atoms="CH3 HH32" order="" />
2                                             <bond atoms="CH3 HH33" order="" />
3                                             <bond atoms="CBond CH3" order="" />
4                                             <bond atoms="CBond O" order="" />
5                                        </ bonds>
6                                   </ fragment>


    . . . (686 lines in total)
       Konrad HINSEN (CBM/SOLEIL)     Reproducible research in molecular biophysics and structural bioinformatics
                                                                                              13 December 2012      27 / 36
Biomolecular simulation . . . a few years from now



       PDB import
                                                          Mosaic HDF5 file
   Chain extraction                        PDB crystal universe + configuration
                                      Protein in vacuum universe + configuration
        Force field                            Force field for protein in vacuum
                                       Solvated protein universe + configuration
        Solvatation                             Force field for solvated protein
                                                Trajectory for solvated protein
     MD simulation
   Small independent                                       All the data for the
       programs                                       simulation study in one place
 Konrad HINSEN (CBM/SOLEIL)   Reproducible research in molecular biophysics and structural bioinformatics
                                                                                      13 December 2012      28 / 36
Managing, publishing, and archiving reproducible
research
Reproducible research results in large sets of interrelated software,
data sets, and documentation.
     Managing files, dependencies, and metadata (provenance, . . .)
     becomes a nightmare when moving between multiple
     computers.
     Publishing everything in an easily usable form is difficult due to a
     lack of standards.
     Archiving is almost pointless: which software will still work 50
     years from now?
ActivePapers
(http://dirac.cnrs-orleans.fr/plone/software/activepapers)
     provides a solution to these problems
     is a ready-to-run prototype (needs a Java installation)
     is rather useless in practice for now
 Konrad HINSEN (CBM/SOLEIL)   Reproducible research in molecular biophysics and structural bioinformatics
                                                                                      13 December 2012      29 / 36
Preserving data for a long time

Formalized descriptions
     data models
     ontologies
     well-defined file formats

Use standards, even bad ones
Standards ensure the critical mass that will motivate future scientists
to maintain the knowledge and the tools required to interpret your
data.

Limited dependencies
     software: operating systems, compilers, libraries, . . .
     infrastructure: Internet, communication protocols, . . .
     organizations: research labs, publishers, . . .

 Konrad HINSEN (CBM/SOLEIL)   Reproducible research in molecular biophysics and structural bioinformatics
                                                                                      13 December 2012      30 / 36
Code is data

There is nothing special about software, it’s just a kind of
data!

Everything is data:
     Scientific data is data.
     Code is data.
     Input parameters are data.
     Documentation is data.

All we need is good data models for everything, plus a way to
describe dependencies.

The tricky part is finding a good data model for code.


 Konrad HINSEN (CBM/SOLEIL)   Reproducible research in molecular biophysics and structural bioinformatics
                                                                                      13 December 2012      31 / 36
Representing code as data
“Data types” for code:

      source code: defined by a language specification
            there are lots of languages
            programmers are attached to them
      binary/executable: defined by a processor/hardware
      specification
            very complex specification
            evolves rapidly with technological progress
      bytecode: defined by a virtual machine specification (JVM, CLI,
      . . .)
            clear and simple specification
            not directly visible to the programmer
            today’s virtual machines were not designed for scientific
            computing
            → sub-optimal performance

JVM bytecodes provide the best existing infrastructure.
There HINSEN (CBM/SOLEIL) scientific software for the JVMand structural bioinformatics is32 / 36
 Konrad is not much           Reproducible research in molecular biophysics platform, which
                                                                                 13 December 2012
Dependency handling

 Dataflow graph
 of program execution

   dataset 1            dataset 2

                input

              program                     Directed                    source code
                                          Acyclic
        output
                                          Graph                                             compiler
                                          of dependencies
             dataset 3
                                                           dataset 2

                                             dataset 1                     program



                                                           dataset 3

 Konrad HINSEN (CBM/SOLEIL)   Reproducible research in molecular biophysics and structural bioinformatics
                                                                                      13 December 2012      33 / 36
ActivePapers overview (1/2)

An ActivePaper . . .
     is an HDF5 file containing any number of data sets
     typically contains scientific data, code, and documentation
     is identified by a DOI (Digital Object Identifier)
     can refer to other ActivePapers through their DOIs
     stores the dependency graph between data sets in HDF5
     metadata

Executable code in an ActivePaper . . .
     cannot access files outside of ActivePaper files
     cannot modify data except inside its own ActivePaper
     comes in three varieties: calclet (reads and writes data), viewlet
     (reads data for interactive visualization), and library (called by
     other code)
 Konrad HINSEN (CBM/SOLEIL)   Reproducible research in molecular biophysics and structural bioinformatics
                                                                                      13 December 2012      34 / 36
ActivePapers overview (2/2)

Special cases
     a raw dataset (usually with some documentation)
     a software library (usually with some documentation)

External dependencies (must be maintained for 50 years . . .)
     a Java Virtual Machine with the Java standard library
     the HDF5 library (could be rewritten as a JVM library)
     JHDF5, a Java interface to HDF5 (could be rewritten as a JVM library)

Problems solved
     future-proof publishing and archiving of computational research
     secure repeatability of computational studies
     reusability of scientific data and software
     integration of scientific data and software into bibliometry
 Konrad HINSEN (CBM/SOLEIL)   Reproducible research in molecular biophysics and structural bioinformatics
                                                                                      13 December 2012      35 / 36
Teaching best practices to computational scientists
Software Carpentry
is an organization dedicated to teaching computing skills to
scientists, with support from the Alfred P. Sloan Foundation and the
Mozilla Foundation.

Activities:
      short intensive workshops (“boot camps”)
      online courses


You can help by
      hosting a boot camp
      being an instructor at boot-camps
      working on teaching material


http://software-carpentry.org/
  Konrad HINSEN (CBM/SOLEIL)   Reproducible research in molecular biophysics and structural bioinformatics
                                                                                       13 December 2012      36 / 36

Mais conteúdo relacionado

Destaque

Locomotion and-support
Locomotion and-supportLocomotion and-support
Locomotion and-support
Nor Huda
 
Biomechanics powerpoint 2010
Biomechanics powerpoint 2010Biomechanics powerpoint 2010
Biomechanics powerpoint 2010
mrsdavison
 
Introduction to biomechanics
Introduction to biomechanicsIntroduction to biomechanics
Introduction to biomechanics
Shimaa Essa
 

Destaque (13)

Biology Form 5 Chapter 2 - Locomotion & Support : 2.3
Biology Form 5 Chapter 2 - Locomotion & Support : 2.3Biology Form 5 Chapter 2 - Locomotion & Support : 2.3
Biology Form 5 Chapter 2 - Locomotion & Support : 2.3
 
How do animals move?
How do animals move?How do animals move?
How do animals move?
 
Biology Form 5 Chapter 2 - Locomotion & Support : 2.1 Part 3
Biology Form 5 Chapter 2 - Locomotion & Support : 2.1 Part 3Biology Form 5 Chapter 2 - Locomotion & Support : 2.1 Part 3
Biology Form 5 Chapter 2 - Locomotion & Support : 2.1 Part 3
 
Interference between physics and biology
Interference between physics and biologyInterference between physics and biology
Interference between physics and biology
 
Locomotion and-support
Locomotion and-supportLocomotion and-support
Locomotion and-support
 
Biology Form 5 Chapter 2 - Locomotion & support : 2.1 Part 1
Biology Form 5 Chapter 2 - Locomotion & support : 2.1 Part 1Biology Form 5 Chapter 2 - Locomotion & support : 2.1 Part 1
Biology Form 5 Chapter 2 - Locomotion & support : 2.1 Part 1
 
Biology Form 5 Chapter 2 - Locomotion & support : 2.1 Part 2
Biology Form 5 Chapter 2 - Locomotion & support : 2.1 Part 2Biology Form 5 Chapter 2 - Locomotion & support : 2.1 Part 2
Biology Form 5 Chapter 2 - Locomotion & support : 2.1 Part 2
 
Biomechanics powerpoint 2010
Biomechanics powerpoint 2010Biomechanics powerpoint 2010
Biomechanics powerpoint 2010
 
Introduction to Biomechanics
Introduction to BiomechanicsIntroduction to Biomechanics
Introduction to Biomechanics
 
Introduction to biomechanics
Introduction to biomechanicsIntroduction to biomechanics
Introduction to biomechanics
 
Movement In Animals
Movement In AnimalsMovement In Animals
Movement In Animals
 
Year 11 biomechanics with levers, force summation
Year 11 biomechanics with levers, force summationYear 11 biomechanics with levers, force summation
Year 11 biomechanics with levers, force summation
 
HOW DO ANIMALS MOVE?
HOW DO ANIMALS MOVE?HOW DO ANIMALS MOVE?
HOW DO ANIMALS MOVE?
 

Semelhante a Reproducible research in molecular biophysics and structural biology

Bioinformatics kernels relations
Bioinformatics kernels relationsBioinformatics kernels relations
Bioinformatics kernels relations
Michiel Stock
 

Semelhante a Reproducible research in molecular biophysics and structural biology (20)

Reproducibility in computer-assisted research
Reproducibility in computer-assisted researchReproducibility in computer-assisted research
Reproducibility in computer-assisted research
 
Analyzing and integrating probabilistic and deterministic computational model...
Analyzing and integrating probabilistic and deterministic computational model...Analyzing and integrating probabilistic and deterministic computational model...
Analyzing and integrating probabilistic and deterministic computational model...
 
A computational scientist's wish list for tomorrow's computing systems
A computational scientist's wish list for tomorrow's computing systemsA computational scientist's wish list for tomorrow's computing systems
A computational scientist's wish list for tomorrow's computing systems
 
Cyberinfrastructure Day 2010: Applications in Biocomputing
Cyberinfrastructure Day 2010: Applications in BiocomputingCyberinfrastructure Day 2010: Applications in Biocomputing
Cyberinfrastructure Day 2010: Applications in Biocomputing
 
Bioinformatics kernels relations
Bioinformatics kernels relationsBioinformatics kernels relations
Bioinformatics kernels relations
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11
 
Adding value to scientific results: COMBINE standards & guidelines for system...
Adding value to scientific results: COMBINE standards & guidelines for system...Adding value to scientific results: COMBINE standards & guidelines for system...
Adding value to scientific results: COMBINE standards & guidelines for system...
 
Ontologies Ontop Databases
Ontologies Ontop DatabasesOntologies Ontop Databases
Ontologies Ontop Databases
 
BiVeS & BudHat @ Combine2013 in Paris
BiVeS & BudHat @ Combine2013 in ParisBiVeS & BudHat @ Combine2013 in Paris
BiVeS & BudHat @ Combine2013 in Paris
 
FOCUS K3D Newsletter (Aug 09)
FOCUS K3D Newsletter (Aug 09)FOCUS K3D Newsletter (Aug 09)
FOCUS K3D Newsletter (Aug 09)
 
SBML (the Systems Biology Markup Language), model databases, and other resources
SBML (the Systems Biology Markup Language), model databases, and other resourcesSBML (the Systems Biology Markup Language), model databases, and other resources
SBML (the Systems Biology Markup Language), model databases, and other resources
 
Software Preservation: challenges and opportunities for reproductibility (Sci...
Software Preservation: challenges and opportunities for reproductibility (Sci...Software Preservation: challenges and opportunities for reproductibility (Sci...
Software Preservation: challenges and opportunities for reproductibility (Sci...
 
ScilabTEC 2015 - Irill
ScilabTEC 2015 - IrillScilabTEC 2015 - Irill
ScilabTEC 2015 - Irill
 
IBSB tutorial
IBSB tutorialIBSB tutorial
IBSB tutorial
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
COMBINE standards & tools: Getting model management right
COMBINE standards & tools: Getting model management rightCOMBINE standards & tools: Getting model management right
COMBINE standards & tools: Getting model management right
 
Numerical models for complex molecular systems
Numerical models for complex molecular systemsNumerical models for complex molecular systems
Numerical models for complex molecular systems
 
FAIR data management in biomedicine
FAIR data management  in biomedicineFAIR data management  in biomedicine
FAIR data management in biomedicine
 
io-Chem-BD, una solució per gestionar el Big Data en Química Computacional
io-Chem-BD, una solució per gestionar el Big Data en Química Computacionalio-Chem-BD, una solució per gestionar el Big Data en Química Computacional
io-Chem-BD, una solució per gestionar el Big Data en Química Computacional
 
Novel In Vivo Concentration Detector
Novel In Vivo Concentration DetectorNovel In Vivo Concentration Detector
Novel In Vivo Concentration Detector
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Reproducible research in molecular biophysics and structural biology

  • 1. Reproducible research in molecular biophysics and structural bioinformatics Konrad HINSEN Centre de Biophysique Moléculaire, Orléans, France and Synchrotron SOLEIL, Saint Aubin, France 13 December 2012 Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural December 2012 13 bioinformatics 1 / 36
  • 2. Reproducibility One of the ideals of science Scientific results should be verifiable. Verification requires reproduction by other scientists. Few results actually are reproduced, but it’s still important to make this possible: important for the credibility of science in society (remember “Climategate”, cold fusion, ...) important for the credibility of a specific study The more detail you provide about what you did, the more your peers are willing to believe that you did what you claim to have done. Reproducibility also matters for efficient collaboration inside a team. Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural December 2012 13 bioinformatics 2 / 36
  • 3. Reproducibility in practice Near-perfect in mathematics No journal publishes a theorem unless the author provides a proof. Often best-effort in experimental science Lab notebooks record all the details for in-house replication. Published protocols are less detailed, but often clear enough for an expert in the field. The main limitation is technical: lab equipment and concrete samples cannot be reproduced identically. Lousy in computational science Papers give short method summaries and concentrate on results. Few scientists can reproduce their own results after a few months. Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural December 2012 13 bioinformatics 3 / 36
  • 4. Reproducibility in computational science We could do better than experimentalists Results are deterministic, fully determined by input data and algorithms. If we published all input data and programs, anyone could reproduce the results exactly. But we don’t Lots of technical difficulties. Important additional effort. Few incentives. Goals of the Reproducible Research movement Create more awareness of the problem. Provide better tools. Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural December 2012 13 bioinformatics 4 / 36
  • 5. Reproducible (computational) Research matters “Climategate” In 2009 a server at the Climatic Research Unit at the University of East Anglia was hacked and many of its files became public. One of them describes a scientist’s difficulties to reproduce his colleagues’ results. This has been used by climate change skeptics to discredit climate research. Protein structure retractions In 2006, three important protein structures published in Science were retracted following the discovery of a bug in the software used for data processing. For more examples and details: Z. Merali, “...Error ... why scientific programming does not compute”, Nature 467, 775 (2010) Science Special Issue on Computational Biology, 13 April 2012 Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural December 2012 13 bioinformatics 5 / 36
  • 6. Today’s tools for Reproducible Research Version control for source code and data Mercurial, Git, Subversion, ... Great for source code, usable for small data sets Workflow management systems VisTrails, Taverna, Kepler, ... Preservation of computational procedures and provenance tracking, great for scientists who like graphical workflows Electronic lab notebooks Mathematica, IPython notebook, ... Preservation of computational procedures and provenance tracking, great for scientists who like scripting Publication tools for computations Collage, IPOL, myExperiment, PyPedia, RunMyCode, SHARE, ... Experimental and/or domain-specific Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural December 2012 13 bioinformatics 6 / 36
  • 7. Missing pieces (1/2) Integration of version control and provenance tracking Record for reproduction that dataset X was obtained from version 5 of dataset Y using version 2.2 of program Z compiled with gcc version 4.1. Version control for big datasets Version control tools are made for text-based formats. Provenance tracking and workflows across machines If you do part of your computations elsewhere (supercomputer, lab’s cluster, ...), existing tools won’t work for you. Tool-independent standard file formats If Alice wants to reproduce Bob’s results, she needs to use Bob’s tools for version control and workflows/notebooks. Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural December 2012 13 bioinformatics 7 / 36
  • 8. Missing pieces (2/2) Reproducible floating-point computations Conflict between reproducibility and performance: Reproducibility for IEEE float arithmetic requires an exact sequence of load, store, and arithmetic operations. Performance optimization requires shuffling around operations depending on processor type, cache size, memory access speed, number of processors, etc. Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural December 2012 13 bioinformatics 8 / 36
  • 9. Reproducibility in biomolecular simulations (1/3) Heavy resource requirements Most important technique: Molecular Dynamics Long runs (days to weeks) on big parallel machines Produces large datasets (a few GB per trajectory) Trajectory analysis can be as expensive as the simulation itself Datasets are too big for version control. Today’s provenance trackers can’t keep track of computations across machines. Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural December 2012 13 bioinformatics 9 / 36
  • 10. Reproducibility in biomolecular simulations (2/3) Complexity Computational models contain thousands of parameters No complete formal description of the models (except for program source code) Simulation algorithms contain many approximations made for performance reasons These approximations are usually undocumented Publishing a complete description of a molecular simulation is so difficult that very few scientists could do it and equally few could understand the resulting description. There is no file format for storing a complete description of a molecular simulation in machine-readable form. Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 10 / 36
  • 11. Reproducibility in biomolecular simulations (3/3) A split community A small number of “developers” with backgrounds in physics and chemistry A large number of “users” with backgrounds in biology and biochemistry A negligibly small number of computing specialists Users don’t care about the technical details of the simulations. Users (have to) trust the developers blindly. Developers understand the models but have no training in software development or data management. The field is dominated by a few large monolithic simulation packages whose internal workings are complex and mostly undocumented. Intermediate stages of processing in a simulation workflow are impossible to access or available only in program-specific formats. Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 11 / 36
  • 12. Complexity: the Amber force field (1/2) The Amber force field (http://ambermd.org/#ff) is one of the most popular computational models for biomolecules. In a research paper, it is typically described by its formula for the potential energy: (0) 2 U = kij rij − rij bonds ij (0) 2 + kijk φijk − φijk angles ijk + kijkl cos (nijkl θijkl − δijkl ) dihedrals ijkl 12 6 σij σij + all pairs ij 4 ij r 12 − r6 qi qj + all pairs ij 4π 0 rij Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 12 / 36
  • 13. Complexity: the Amber force field (2/2) In addition to that formula, we need rules for summing over bonds, angles, etc. rules for finding the values of all the parameters for a given molecule Most of these rules can be found on the Amber Web site (with some effort), but not in any research paper. One rule is, to the best of my knowledge, not written down anywhere. It concerns the attribution of “improper dihedral” terms. I obtained it from the Amber developers by asking a precise question by e-mail. That rule refers to the indices of the atoms in the input file defining the molecular structure. These indices are chosen arbitrarily. Making an energy dependent on them is profoundly unphysical. How many users of the Amber force field know about this unphysical dependence on atom indices? The effect is very small. Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 13 / 36
  • 14. A real-life case of (non-)reproducibility (1/5) Goal: find the gas-phase structure of short peptide sequences by molecular simulation or Earlier work on this topic1 finds that Ac A15 K + H+ forms a helix, but Ac K A15 + H+ forms a globule. My simulations predict that both sequences form globules. What did I do differently? 1 M.F. Jarrold, Phys. Chem. Chem. Phys. 9, 1659 (2007) Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 14 / 36
  • 15. A real-life case of (non-)reproducibility (2/5) From the paper: Molecular Dynamics (MD) simulations were performed to help interpret the experimental results. The simulations were done with the MACSIMUS suite of programs [31] using the CHARMM21.3 parameter set. A dielectric constant of 1.0 was employed. The URL in ref. 31 is broken, but Google helps me find MACSIMUS nevertheless. It’s free and comes with a manual! ¨ I download the “latest release” dated 2012-11-09. But which one was used for that paper in 2007? Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 15 / 36
  • 16. A real-life case of (non-)reproducibility (3/5) MACSIMUS comes with a file charmm21.par that starts with ! Parameter File for CHARMM version 21.3 [June 24, 1991] ! Includes parameters for both polar and all hydrogen topology files ! Based on QUANTA Parameter Handbook [Polygen Corporation, 1990] ! Modified by JK using various sources Did that paper use the “polar” or the “all hydrogen” topology files? I can find only one set of topology files named charmm21, and that’s with polar hydrogens only, so I guess “polar”. Now I have the parameters, but I don’t know the rules of the CHARMM force field, nor can I be sure that MACSIMUS uses the same rules as the CHARMM software. Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 16 / 36
  • 17. A real-life case of (non-)reproducibility (4/5) From the paper: A variety of starting structures were employed (such as helix, sheet, and extended linear chain) and a number of simulated annealing schedules were used in an effort to escape high energy local minima. Often, hundreds of simulations were performed to explore the energy landscape of a particular peptide. In some cases, MD with simulated annealing was unable to locate the lowest energy conformation and more sophisticated methods were used (see description of evolutionary based methods below). I might as well give up here... My point is not to criticize this particular paper. The level of description is typical for the field. Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 17 / 36
  • 18. A real-life case of (non-)reproducibility (5/5) What I would have liked to get: a machine-readable file containing a full specification of the simulated system: chemical structure all force field terms with their parameters the initial atom positions a script implementing the annealing protocol links to all the software used in the simulation, with version numbers All that with persistent references (DOIs). Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 18 / 36
  • 19. Enough whining... Let’s do something to improve the situ- tion! Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 19 / 36
  • 20. 1 A data model for molecular simulations 2 A data model for executable papers 3 Teaching best practices to computational scientists Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 20 / 36
  • 21. Data handling in molecular simulation There is no standard file format for molecular simulations. Best approximation: the PDB format made for a different purpose (crystallography) lacks precision for coordinates lacks the possibility to store anything else but coordinates limited to 10.000 atoms few if any programs respect it completely OK... so let’s define something better... ... which is a far from trivial task. Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 21 / 36
  • 22. What do we want to archive and exchange? the system we are simulating the chemical structure of the molecules the number of each kind of molecule the topology of the system (infinite, 3D periodic, 2D periodic, ...) symmetries (crystals, ...) configurations (positions of all atoms plus PBC parameters) velocities, masses, charges, and other per-atom properties force field parameters force fields (the rules) trajectories experimental data used in simulations ... lots of other stuff ... Don’t try to do everything at once: we want a modular set of conventions. Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 22 / 36
  • 23. Which systems, which simulation techniques? anything at the atomic and molecular scale: gases, liquids, solids from atoms via small molecules to macromolecules all-atom and coarse-grained models from quantum models via classical mechanics to stochastic dynamics Application domains: Chemical/statistical physics Solid-state physics Chemistry Molecular/structural biology Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 23 / 36
  • 24. Mosaic: a data model for molecular simulation MOlecular SimulAtion Interchange Conventions http://bitbucket.org/molsim/mosaic/ Mosaic is a data model with (currently) two representations. Data model expresses molecular simulation data in terms of basic data items and structures: numbers, strings, lists, trees, ... Representation defines how a data model is stored as a sequence of bytes in a file. Conversion between representations is exact. This is work in progress. In order to succeed, it must become a community project. Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 24 / 36
  • 25. Mosaic: current state Data items System topology, chemical structure Configurations Per-atom/site numerical properties (velocities, charges, masses, . . .) Per-atom/site labels (atom names, atom types, . . .) Representations HDF5 for big data sets, fast access, and easy interfacing to C/C++/Fortran code XML for processing with standard XML tools Planned: JSON for interfacing with Web-based tools Three in-memory representations for Python Tools Conversion between representations biophysics and structural bioinformatics Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular 13 December 2012 25 / 36
  • 26. HDF5 (Hierarchical Data Format, version 5) Platform-independent binary root group storage groups Efficient library for data handling Metadata (provenance, dependencies, . . . ) Well-established: used by big organizations (NASA, . . .) for very large data sets (meteorology . . .) Ideal for exchanging and archiving data datasets (≈ arrays) Suitable for very large data sets Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 26 / 36
  • 27. So here’s my model for the gas phase peptide... 1 <?xml version="1.0" encoding="utf-8"?> 2 <mosaic version="0.1"> 3 <universe cell_shape="infinite" convention="MMTK" id="universe"> 4 <molecules> 5 <molecule count="1"> 6 <fragment label="" species="protein"> 7 <fragments> 8 <fragment label="A" polymer_type="polypeptide" species="ACE,Ala,Ala,Ala,Ala,Ala,Ala,Al 9 <fragments> 0 <fragment label="ACE0" species="ace_beginning"> 1 <atoms> 2 <atom label="CBond" name="C" type="element" /> 3 <atom label="CH3" name="C" type="element" /> 4 <atom label="HH31" name="H" type="element" /> 5 <atom label="HH32" name="H" type="element" /> 6 <atom label="HH33" name="H" type="element" /> 7 <atom label="O" name="O" type="element" /> 8 </ atoms> 9 <bonds> 0 <bond atoms="CH3 HH31" order="" /> 1 <bond atoms="CH3 HH32" order="" /> 2 <bond atoms="CH3 HH33" order="" /> 3 <bond atoms="CBond CH3" order="" /> 4 <bond atoms="CBond O" order="" /> 5 </ bonds> 6 </ fragment> . . . (686 lines in total) Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 27 / 36
  • 28. Biomolecular simulation . . . a few years from now PDB import Mosaic HDF5 file Chain extraction PDB crystal universe + configuration Protein in vacuum universe + configuration Force field Force field for protein in vacuum Solvated protein universe + configuration Solvatation Force field for solvated protein Trajectory for solvated protein MD simulation Small independent All the data for the programs simulation study in one place Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 28 / 36
  • 29. Managing, publishing, and archiving reproducible research Reproducible research results in large sets of interrelated software, data sets, and documentation. Managing files, dependencies, and metadata (provenance, . . .) becomes a nightmare when moving between multiple computers. Publishing everything in an easily usable form is difficult due to a lack of standards. Archiving is almost pointless: which software will still work 50 years from now? ActivePapers (http://dirac.cnrs-orleans.fr/plone/software/activepapers) provides a solution to these problems is a ready-to-run prototype (needs a Java installation) is rather useless in practice for now Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 29 / 36
  • 30. Preserving data for a long time Formalized descriptions data models ontologies well-defined file formats Use standards, even bad ones Standards ensure the critical mass that will motivate future scientists to maintain the knowledge and the tools required to interpret your data. Limited dependencies software: operating systems, compilers, libraries, . . . infrastructure: Internet, communication protocols, . . . organizations: research labs, publishers, . . . Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 30 / 36
  • 31. Code is data There is nothing special about software, it’s just a kind of data! Everything is data: Scientific data is data. Code is data. Input parameters are data. Documentation is data. All we need is good data models for everything, plus a way to describe dependencies. The tricky part is finding a good data model for code. Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 31 / 36
  • 32. Representing code as data “Data types” for code: source code: defined by a language specification there are lots of languages programmers are attached to them binary/executable: defined by a processor/hardware specification very complex specification evolves rapidly with technological progress bytecode: defined by a virtual machine specification (JVM, CLI, . . .) clear and simple specification not directly visible to the programmer today’s virtual machines were not designed for scientific computing → sub-optimal performance JVM bytecodes provide the best existing infrastructure. There HINSEN (CBM/SOLEIL) scientific software for the JVMand structural bioinformatics is32 / 36 Konrad is not much Reproducible research in molecular biophysics platform, which 13 December 2012
  • 33. Dependency handling Dataflow graph of program execution dataset 1 dataset 2 input program Directed source code Acyclic output Graph compiler of dependencies dataset 3 dataset 2 dataset 1 program dataset 3 Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 33 / 36
  • 34. ActivePapers overview (1/2) An ActivePaper . . . is an HDF5 file containing any number of data sets typically contains scientific data, code, and documentation is identified by a DOI (Digital Object Identifier) can refer to other ActivePapers through their DOIs stores the dependency graph between data sets in HDF5 metadata Executable code in an ActivePaper . . . cannot access files outside of ActivePaper files cannot modify data except inside its own ActivePaper comes in three varieties: calclet (reads and writes data), viewlet (reads data for interactive visualization), and library (called by other code) Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 34 / 36
  • 35. ActivePapers overview (2/2) Special cases a raw dataset (usually with some documentation) a software library (usually with some documentation) External dependencies (must be maintained for 50 years . . .) a Java Virtual Machine with the Java standard library the HDF5 library (could be rewritten as a JVM library) JHDF5, a Java interface to HDF5 (could be rewritten as a JVM library) Problems solved future-proof publishing and archiving of computational research secure repeatability of computational studies reusability of scientific data and software integration of scientific data and software into bibliometry Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 35 / 36
  • 36. Teaching best practices to computational scientists Software Carpentry is an organization dedicated to teaching computing skills to scientists, with support from the Alfred P. Sloan Foundation and the Mozilla Foundation. Activities: short intensive workshops (“boot camps”) online courses You can help by hosting a boot camp being an instructor at boot-camps working on teaching material http://software-carpentry.org/ Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 36 / 36