This document discusses the challenges of reproducibility in bioinformatics. It notes that for an analysis to be repeatable, the same data, code, and version information must be available. However, obtaining the exact same starting data can be difficult when data is large, hardware fails, or filtering steps are not documented. Pipelines help capture and automate analyses but are not a panacea, as quality control requires human judgment. The best approach may be to package and publish individual analyses with documentation of the full process.
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Reproducibility: The Myths and Truths of "Push-Button
1. Reproducibility:
The Myths and Truths of “Push-
Button” Bioinformatics
Simon Cockell
Bioinformatics Special Interest Group
19th July 2012
2. Repeatability and Reproducibility
• Main principle of
scientific method
• Repeatability is „within
lab‟
• Reproducibility is
„between lab‟
• Broader concept
• This should be easy in
bioinformatics, right?
• Same data + same code =
same results
• Not many analyses have
stochastisicity
http://xkcd.com/242/
3. Same data?
• Example
• Data deposited in SRA
• Original data deleted by
researchers
• .sra files are NOT .fastq
• All filtering/QC steps lost
• Starting point for
subsequent analysis not
the same – regardless of
whether same code
used
4. Same data?
• Data files are very large
• Hardware failures are
surprisingly common
• Not all hardware failures
are catastrophic
• Bit-flipping by faulty RAM
• Do you keep an
md5sum of your data, to
ensure it hasn‟t been
corrupted by the transfer
process?
5. Same code?
• What version of a
particular software did
you use?
• Is it still available?
• Did you write it yourself?
• Do you use version
control?
• Did you tag a version?
• Is the software
closed/proprietary?
6. Version Control
• Good practice for
software AND data
• DVCS means it doesn‟t
have to be in a remote
repository
• All local folders can be
versioned
• Doesn‟t mean they have to
be, it‟s a judgment call
• Check-in regularly
• Tag important “releases”
https://twitter.com/sjcockell/status/202041359920676864
7. Pipelines
• Package your analysis
• Easily repeatable
• Also easy to distribute
• Start-to-finish task
automation
• Process captured by
underlying pipeline
architecture
http://bioinformatics.knowledgeblog.org/2011/06/21/using-standardized-bioinformatics-formats-in-taverna-workflows-for-integrating-biological-data/
8. Tools for pipelining analyses
• Huge numbers
• See: http://en.wikipedia.org/wiki/Bioinformatics_workflow_management_systems
• Only a few widely used:
• Bash
• old school
• Taverna
• build workflows from public webservices
• Galaxy
• sequencing focus – tools provided in „toolshed‟
• Microbase
• distributed computing, build workflows from „responders‟
• e-Science Central
• „Science as a Service‟ – cloud focus
• not specifically a bioinformatics tool
9. Bash
• Single-machine (or
cluster) command-line
workflows
• No fancy GUIs
• Record provenance &
process
• Rudimentary parallel
processing
http://www.gnu.org/software/bash/
10.
11. Taverna
• Workflows from web
services
• Lack of relevant services
• Relies on providers
• Gluing services together
increasingly problematic
• Sharing workflows
through myExperiment
• http://www.myexperiment.org/
http://www.taverna.org.uk/
12. Galaxy
• “open, web-based
platform for data
intensive biomedical
research”
• Install or use (limited)
public server
• Can build workflows
from tools in „toolshed‟
• Command-line tools
wrapped with web
interface
https://main.g2.bx.psu.edu/
14. Microbase
• Task management framework
• Workflows emerge from interacting „responders‟
• Notification system passes messages around
• „Cloud-ready‟ system that scales easily
• Responders must be written for new tools
http://www.microbasecloud.com/
15. e-Science Central
• „Blocks‟ can be
combined into
workflows
• Blocks need to be
written by an expert
• Social networking
features
• Good provenance
recording
http://www.esciencecentral.co.uk/
16. The best approach?
• Good for individual analysis
• Package & publish
• All datasets different
• One size does not fit all
• Downstream processes often depend on results of upstream ones
• Note lack of QC
• Requires human interaction – impossible to pipeline
• Different every time
• Subjective – major source of variation in results
• BUT – important and necessary (GIGO)
17. More tools for reproducibility
• iPython notebook
• http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html
• Build notebooks with code embedded
• Run code arbitrarily
• Example: https://pilgrims.ncl.ac.uk:9999/
• Runmycode.org
• Allows researchers to create „companion websites‟ for papers
• This website allows readers to implement the methodology
described in the paper
• Example:
http://www.runmycode.org/CompanionSite/site.do?siteId=92
18. The executable paper
• The ultimate in
repeatable research
• Data and code
embedded in the
publication
• Figures can be
generated, in situ, from
the actual data
• http://ged.msu.edu/pap
ers/2012-diginorm/
19. Summary
• For work to be repeatable:
• Data and code must be available
• Process must be documented (and preferably shared)
• Version information is important
• Pipelines are not the great panacea
• Though they may help for parts of the process
• Bash is as good as many „fancier‟ tools (for tasks on a single machine
or cluster)
20. Inspirations for this talk
• C. Titus Brown‟s blogposts on repeatability and the
executable paper
• http://ivory.idyll.org/blog
• Michael Barton‟s blogposts about organising
bioinformatics projects and pipelines
• http://bioinformaticszen.com/
Notas do Editor
Or any other Unix shell (maybe even Windows batch scripts)