This document discusses scientific workflow systems and their use for automating computational tasks, making use of computational infrastructure, abstracting away complexity, capturing provenance, and enabling reproducibility. It provides an overview of various workflow systems including Snakemake, Nextflow, Taverna, KNIME, Galaxy, and Common Workflow Language, and how containers like Docker can be used to package and distribute workflows.
1. Partners Funding
bioexcel.eu
Scientific Workflow Systems
1
Stian Soiland-Reyes
eScience Lab, The University of Manchester
2017-11-03, Aix-en-Provence
CESAB workshop: Reproducible Workflows
orcid.org/0000-0001-9842-9718 @soilandreyes
This work is licensed under a
Creative Commons Attribution 4.0 International License.
2. bioexcel.eu
What is a Workflow?
Orchestrating computational tasks
Managing the control and data flow
Homogeneous or heterogeneous tasks:
– Local / remote
– Own / third party
– White, grey or black boxes
– Reliable / fragile
– Reserved / dynamic
– Various underpinning infrastructure
– Various access controls
BioExcel: Biomolecular recognition
3. bioexcel.eu
Not on the agenda: Business workflows
Control flow of who has responsibility for what
BPM
Business workflows + computational workflows
IBISBA
3
4. bioexcel.eu
Why use workflows?Automation
– Automate computational aspects
– Repetitive pipelines, sweep campaigns
Scaling – compute cycles
– Make use of computational infrastructure &
handle large data
Abstraction – people cycles
– Shield complexity and incompatibilities
– Report, re-use, evolve, share, compare
– Repeat –Tweak - Repeat
– First class commodities
Provenance - reporting
– Capture, report and utilize log and data lineage
auto-documentation
– Traceable evolution, audit, transparency
– Compare
Findable
Accessible
Interoperable
Reusable
(Reproducible)
4 Adapted from Bertram Ludäscher atWORKS2015 https://www.slideshare.net/ludaesch/works-2015provenancemileage
6. bioexcel.eu
Laser Interferometer Gravitational-Wave Observatory
First detection of gravitational waves from colliding black holes
https://pegasus.isi.edu/2016/02/11/pegasus-powers-ligo-gravitational-waves-detection-analysis/
https://pegasus.isi.edu/
12. bioexcel.eu
Stop Press!GUIs not essential!
GUI: Canvas, drag-drop blocks, arrows,
run button, data visualization
Script: Textual, command line, view data
externally. Script easily run from other apps.
Scripts can be workflows!
Workflow systems ⇆ Scripts
Scripts on ASAP meter:
Automation: ★ ★ ★ ★ ★
Scaling: ★ ★
Abstraction: ★
Provenance: ★ ★
17. bioexcel.eu
http://commonwl.org/
Workflow interoperability
Common workflow format
Community based standards effort
Designed for clusters & clouds
Use containers (e.g. Docker)
Textual YAML files
(GUIs available)
Workflow: Steps with data dependencies
Step: command line or inline scripts
Scatter/gather on steps
Rich annotations
19. bioexcel.eu
ContainersLinux Container technology
..light-weight "virtual" virtual machine
A container is started from a image
Images downloaded from Docker Hub
Dockerfile: Layer-based recipe
Philosophy: One service, one
image → microservices
Cloud's best friend: scalable, reproducible,
customizable
19
S. Woodman, H. Hiden, P. Watson, P. Missier Achieving Reproducibility by Combining Provenance with Service and Workflow Versioning. In: The 6th Workshop on Workflows in Support of Large-Scale Science. 2011, Seattle