C-SCALE Tutorial: Snakemake

S
Sebastian Luna-ValeroCloud Community Support Specialist em EGI Foundation
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101017529.
Copernicus - eoSC AnaLytics Engine
C-SCALE tutorial: Snakemake
Sebastian Luna-Valero, EGI Foundation
sebastian.luna.valero@egi.eu
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
Outline
• Why workflows?
• Why snakemake?
• Let’s build a workflow!
2
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
Why workflows?
Credits: https://github.com/c-scale-community/use-case-hisea
Goals:
● from raw data to figures
○ with “one click”
● re-run with new config
○ spatial scale
○ temporal scale
● re-run half-way through
○ recover from issues
● dependency management
○ between tasks
○ software packages
3
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
Why workflows?
When to build a workflow?
● Re-run the same analysis over and over again, with different input parameters
● Ability to re-run the work partially; recover from intermediate failures
● Combine together heterogeneous tooling into the same analysis
○ Python, R, Julia, Docker, Bash, etc.
4
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
Why snakemake?
• Mature workflow management system.
• Great community around it.
• Easy to learn? :)
• A Snakemake workflow scales without modification from single core workstations and
multi-core servers to batch systems (e.g. slurm)
• Snakemake integrates with the package manager Conda and the container engine
Singularity such that defining the software stack becomes part of the workflow itself.
• Further information: https://snakemake.readthedocs.io/
5
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
Let’s build a workflow!
• Snakemake follows the GNU Make paradigm: workflows are defined in terms of rules that
define how to create output files from input files.
• $ snakemake --cores 1
• The application of a rule to generate a set of output files is called job.
6
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
rule count_countries:
input:
"european-countries.txt"
output:
"number-of-countries.txt"
shell:
"wc --lines european-countries.txt > number-of-countries.txt"
$ cat european-countries.txt
Netherlands
Greece
Spain
Portugal
Italy
Poland
Austria
Snakefile
Let’s build a workflow!
• Snakemake follows the GNU Make paradigm: workflows are defined in terms of rules that
define how to create output files from input files.
• $ snakemake --cores 1
• Snakemake only re-runs jobs if one of the input files is newer than one of the output files
or one of the input files will be updated by another job.
7
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
rule count_countries:
input:
"european-countries.txt"
output:
"number-of-countries.txt"
shell:
"wc --lines european-countries.txt > number-of-countries.txt"
$ cat european-countries.txt
Netherlands
Greece
Spain
Portugal
Italy
Poland
Austria
Belgium
Snakefile
Let’s build a workflow!
• Generalize the rule:
• $ snakemake --cores 1
• $ wc --lines european-countries.txt > number-of-countries.txt
8
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
rule count_countries:
input:
"european-countries.txt"
output:
"number-of-countries.txt"
shell:
"wc --lines {input} > {output}"
$ cat european-countries.txt
Netherlands
Greece
Spain
Portugal
Italy
Poland
Austria
Snakefile
Let’s build a workflow!
• Adding more than one input file:
• $ snakemake --cores 1
• $ wc --lines european-countries.txt other-countries.txt 
> number-of-countries.txt
9
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
rule count_countries:
input:
"european-countries.txt",
"other-countries.txt"
output:
"number-of-countries.txt"
shell:
"wc --lines {input} > {output}"
$ cat european-countries.txt
Netherlands
Greece
Spain
Portugal
Italy
Poland
Austria
Snakefile
$ cat other-countries.txt
US
Canada
Let’s build a workflow!
• It’s better to organize your working directory:
• $ snakemake --cores 1
10
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
rule count_countries:
input:
"countries/european-countries.txt",
"countries/other-countries.txt"
output:
"stats/number-of-countries.txt"
shell:
"wc --lines {input} > {output}"
$ cat european-countries.txt
Netherlands
Greece
Spain
Portugal
Italy
Poland
Austria
Snakefile
$ cat other-countries.txt
US
Canada
Let’s build a workflow!
• Connecting rules! Targets can be rules, output files.
• $ snakemake --cores 1 <target>
11
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
rule pre_processing:
input:
"stats/number-of-countries.txt"
output:
"pre-processing.done"
shell:
"touch pre-processing.done"
rule count_countries:
input:
"countries/european-countries.txt",
"countries/other-countries.txt"
output:
"stats/number-of-countries.txt"
shell:
"wc --lines {input} > {output}"
$ cat european-countries.txt
Netherlands
Greece
Spain
Portugal
Italy
Poland
Austria
Snakefile
$ cat other-countries.txt
US
Canada
Let’s build a workflow!
• Updating intermediate files (however: #1978 and #2011)
• $ snakemake --cores 1 <target>
12
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
rule pre_processing:
input:
"stats/number-of-countries.txt"
output:
"pre-processing.done"
shell:
"touch pre-processing.done"
rule count_countries:
input:
"countries/european-countries.txt",
"countries/other-countries.txt"
output:
"stats/number-of-countries.txt"
shell:
"wc --lines {input} > {output}"
Snakefile $ cat european-countries.txt
Netherlands
Greece
Spain
Portugal
Italy
Poland
Austria
Belgium
$ cat other-countries.txt
US
Canada
Let’s build a workflow!
• Dependencies between the rules are determined creating a Directed Acyclic Graph
• $ snakemake --cores 1 --dag | dot -Tsvg > dag.svg
13
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
rule pre_processing:
input:
"stats/number-of-countries.txt"
output:
"pre-processing.done"
shell:
"touch pre-processing.done"
rule count_countries:
input:
"countries/european-countries.txt",
"countries/other-countries.txt"
output:
"stats/number-of-countries.txt"
shell:
"wc --lines {input} > {output}"
Snakefile
Let’s build a workflow!
• Python
• $ snakemake --cores 1 <target>
14
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
rule pre_processing:
input:
"stats/number-of-countries.txt"
output:
"pre-processing.done"
shell:
"python --input stats/number-of-countries.txt myscript.py"
rule count_countries:
input:
"countries/european-countries.txt",
"countries/other-countries.txt"
output:
"stats/number-of-countries.txt"
shell:
"wc --lines {input} > {output}"
Snakefile
Let’s build a workflow!
• Containers
• $ snakemake --cores 1 <target>
15
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
rule pre_processing:
input:
"stats/number-of-countries.txt"
output:
"pre-processing.done"
shell:
"udocker run example"
rule count_countries:
input:
"countries/european-countries.txt",
"countries/other-countries.txt"
output:
"stats/number-of-countries.txt"
shell:
"wc --lines {input} > {output}"
Snakefile
Let’s build a workflow!
• Pre-built support for Singularity (see docs for more details)
16
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
rule pre_processing:
input:
"stats/number-of-countries.txt"
output:
"pre-processing.done"
container:
"docker://repo/image"
script:
"scripts/plot.R"
rule count_countries:
input:
"countries/european-countries.txt",
"countries/other-countries.txt"
output:
"stats/number-of-countries.txt"
shell:
"wc --lines {input} > {output}"
Snakefile
Let’s build a workflow!
• Configuration
• $ snakemake --cores 1
17
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
configfile: "config.yaml"
rule count_countries:
input:
expand("{input}", input=config['european']),
expand("{input}", input=config['other'])
output:
"stats/number-of-countries.txt"
shell:
"wc --lines {input} > {output}"
$ cat european-countries.txt
Netherlands
Greece
Spain
Portugal
Italy
Poland
Austria
Snakefile $ cat config.yaml
european: 'countries/european-countries.txt'
other: 'countries/other-countries.txt'
$ cat other-countries.txt
US
Canada
Let’s build a workflow!
• Logging
• $ snakemake --cores 1
18
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
rule count_countries:
input:
"countries/european-countries.txt",
"countries/other-countries.txt"
output:
"stats/number-of-countries.txt"
log:
"logs/count_countries.log"
shell:
"wc --lines {input} > {output}"
$ cat european-countries.txt
Netherlands
Greece
Spain
Portugal
Italy
Poland
Austria
Snakefile
$ cat other-countries.txt
US
Canada
Let’s build a workflow!
• Benchmarking
• $ snakemake --cores 1
19
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
rule count_countries:
input:
"countries/european-countries.txt",
"countries/other-countries.txt"
output:
"stats/number-of-countries.txt"
benchmark:
"benchmarks/count_countries.txt"
shell:
"wc --lines {input} > {output}"
$ cat european-countries.txt
Netherlands
Greece
Spain
Portugal
Italy
Poland
Austria
Snakefile
$ cat other-countries.txt
US
Canada
Let’s build a workflow!
• Modularization
20
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
rule count_countries:
input:
"countries/european-countries.txt",
"countries/other-countries.txt"
output:
"stats/number-of-countries.txt"
shell:
"wc --lines {input} > {output}"
$ cat european-countries.txt
Netherlands
Greece
Spain
Portugal
Italy
Poland
Austria
Snakefile
$ cat other-countries.txt
US
Canada
include: "rules/count_countries.smk"
rule pre_processing:
input:
"stats/number-of-countries.txt"
output:
"pre-processing.done"
shell:
"touch pre-processing.done"
Snakefile
Let’s build a workflow!
• Integration with conda
• $ snakemake --cores 1 --use-conda --conda-frontend mamba
21
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
rule count_countries:
input:
"countries/european-countries.txt",
"countries/other-countries.txt"
output:
"stats/number-of-countries.txt"
conda:
"envs/count_countries.yaml"
shell:
"wc --lines {input} > {output}"
$ cat european-countries.txt
Netherlands
Greece
Spain
Portugal
Italy
Poland
Austria
Snakefile
$ cat other-countries.txt
US
Canada
$ cat envs/count_countries.yaml
name: count_countries
channels:
- conda-forge
- defaults
dependencies:
- coreutils
Let’s build a workflow!
• Other examples
• https://github.com/c-scale-community/c-scale-tutorial-snakemake
• https://github.com/c-scale-community/use-case-hisea/pull/41/files
22
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
Let’s build a workflow!
• Advanced features
• Pre-built functionality for scatter-gather jobs
• Cluster execution: snakemake --cluster qsub (see SLURM docs)
• Self-contained HTML reports
• Accessing remote storage:
• Amazon S3, Google Cloud Storage, Microsoft Azure Blob Storage
• SFTP, HTTP, FTP, Dropbox, XRootD, WebDAV, GFAL, GridFTP, iRODs, etc.
• Best practices
• https://snakemake.readthedocs.io/en/stable/snakefiles/best_practices.html
• FAQs: https://snakemake.readthedocs.io/en/stable/project_info/faq.html
23
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
Thank you for your attention.
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101017529.
Copernicus - eoSC AnaLytics Engine
contact@c-scale.eu
https://c-scale.eu
@C_SCALE_EU
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
Sebastian Luna-Valero, EGI Foundation
sebastian.luna.valero@egi.eu
Let’s build a workflow!
• Wildcards example:
• $ snakemake --cores 1 stats/number-of-european-countries.txt
• $ snakemake --cores 1 stats/number-of-other-countries.txt
25
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
rule count_countries:
input:
"countries/{category}-countries.txt"
output:
"stats/number-of-{category}-countries.txt"
shell:
"wc --lines {input} > {output}"
$ cat list-of-countries.txt
Netherlands
Greece
Spain
Portugal
Italy
Poland
Austria
Snakefile
Let’s build a workflow!
• Many to many with glob_wildcards:
• $ snakemake --cores 1
26
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
CATEGORIES, = glob_wildcards("countries/{category}-countries.txt")
print(CATEGORIES)
rule all:
input:
expand("stats/number-of-{category}-countries.txt", category=CATEGORIES)
rule count_countries:
input:
"countries/{category}-countries.txt"
output:
"stats/number-of-{category}-countries.txt"
shell:
"wc --lines {input} > {output}"
$ cat list-of-countries.txt
Netherlands
Greece
Spain
Portugal
Italy
Poland
Austria
Snakefile
input-1
input-2
output-1
output-2
input-n output-n
input-.. output-..
Let’s build a workflow!
• Dependencies between the rules are determined automatically, creating a DAG (directed
acyclic graph) of jobs that can be automatically parallelized.
• Snakemake only re-runs jobs if one of the input files is newer than one of the output files
or one of the input files will be updated by another job.
• https://github.com/snakemake/snakemake/issues/1978
• Snakemake works backwards from requested output, and not from available input.
• Targets
• rule names can be targets
• output files can be targets
• if no target is given at the command line, Snakemake will define the first rule of the
Snakefile as the target. Hence, it is best practice to have a rule all at the top of the
workflow which has all typically desired target files as input files.
27
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
1 de 27

Recomendados

Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas... por
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...Embarcados
97 visualizações75 slides
Kubernetes - State of the Union (Q1-2016) por
Kubernetes - State of the Union (Q1-2016)Kubernetes - State of the Union (Q1-2016)
Kubernetes - State of the Union (Q1-2016)DoiT International
2K visualizações33 slides
Writing plugins for Nagios and Opsview - CAPSiDE Tech Talks por
Writing plugins for Nagios and Opsview - CAPSiDE Tech TalksWriting plugins for Nagios and Opsview - CAPSiDE Tech Talks
Writing plugins for Nagios and Opsview - CAPSiDE Tech TalksJose Luis Martínez
3K visualizações35 slides
generate IP CORES por
generate IP CORESgenerate IP CORES
generate IP CORESguest296013
4.4K visualizações19 slides
IIT-RTC 2017 Qt WebRTC Tutorial (Qt Janus Client) por
IIT-RTC 2017 Qt WebRTC Tutorial (Qt Janus Client)IIT-RTC 2017 Qt WebRTC Tutorial (Qt Janus Client)
IIT-RTC 2017 Qt WebRTC Tutorial (Qt Janus Client)Alexandre Gouaillard
5.4K visualizações62 slides
On the code of data science por
On the code of data scienceOn the code of data science
On the code of data scienceGael Varoquaux
4.7K visualizações73 slides

Mais conteúdo relacionado

Similar a C-SCALE Tutorial: Snakemake

InfluxDB Live Product Training por
InfluxDB Live Product TrainingInfluxDB Live Product Training
InfluxDB Live Product TrainingInfluxData
160 visualizações34 slides
GitHub Actions - using Free Oracle Cloud Infrastructure (OCI) por
GitHub Actions - using Free Oracle Cloud Infrastructure (OCI)GitHub Actions - using Free Oracle Cloud Infrastructure (OCI)
GitHub Actions - using Free Oracle Cloud Infrastructure (OCI)Phil Wilkins
2.2K visualizações39 slides
Scilab: Computing Tool For Engineers por
Scilab: Computing Tool For EngineersScilab: Computing Tool For Engineers
Scilab: Computing Tool For EngineersNaren P.R.
2K visualizações27 slides
Cape2013 scilab-workshop-19Oct13 por
Cape2013 scilab-workshop-19Oct13Cape2013 scilab-workshop-19Oct13
Cape2013 scilab-workshop-19Oct13Naren P.R.
3.7K visualizações26 slides
Node-RED and Minecraft - CamJam September 2015 por
Node-RED and Minecraft - CamJam September 2015Node-RED and Minecraft - CamJam September 2015
Node-RED and Minecraft - CamJam September 2015Boris Adryan
2.9K visualizações9 slides
Practical virtual network functions with Snabb (SDN Barcelona VI) por
Practical virtual network functions with Snabb (SDN Barcelona VI)Practical virtual network functions with Snabb (SDN Barcelona VI)
Practical virtual network functions with Snabb (SDN Barcelona VI)Igalia
116 visualizações47 slides

Similar a C-SCALE Tutorial: Snakemake(20)

InfluxDB Live Product Training por InfluxData
InfluxDB Live Product TrainingInfluxDB Live Product Training
InfluxDB Live Product Training
InfluxData160 visualizações
GitHub Actions - using Free Oracle Cloud Infrastructure (OCI) por Phil Wilkins
GitHub Actions - using Free Oracle Cloud Infrastructure (OCI)GitHub Actions - using Free Oracle Cloud Infrastructure (OCI)
GitHub Actions - using Free Oracle Cloud Infrastructure (OCI)
Phil Wilkins2.2K visualizações
Scilab: Computing Tool For Engineers por Naren P.R.
Scilab: Computing Tool For EngineersScilab: Computing Tool For Engineers
Scilab: Computing Tool For Engineers
Naren P.R.2K visualizações
Cape2013 scilab-workshop-19Oct13 por Naren P.R.
Cape2013 scilab-workshop-19Oct13Cape2013 scilab-workshop-19Oct13
Cape2013 scilab-workshop-19Oct13
Naren P.R.3.7K visualizações
Node-RED and Minecraft - CamJam September 2015 por Boris Adryan
Node-RED and Minecraft - CamJam September 2015Node-RED and Minecraft - CamJam September 2015
Node-RED and Minecraft - CamJam September 2015
Boris Adryan2.9K visualizações
Practical virtual network functions with Snabb (SDN Barcelona VI) por Igalia
Practical virtual network functions with Snabb (SDN Barcelona VI)Practical virtual network functions with Snabb (SDN Barcelona VI)
Practical virtual network functions with Snabb (SDN Barcelona VI)
Igalia116 visualizações
Node-RED and getting started on the Internet of Things por Boris Adryan
Node-RED and getting started on the Internet of ThingsNode-RED and getting started on the Internet of Things
Node-RED and getting started on the Internet of Things
Boris Adryan6.3K visualizações
Creating basic workflows as Jupyter Notebooks to use Cytoscape programmatically. por Hakky St
Creating basic workflows as Jupyter Notebooks to use Cytoscape programmatically.Creating basic workflows as Jupyter Notebooks to use Cytoscape programmatically.
Creating basic workflows as Jupyter Notebooks to use Cytoscape programmatically.
Hakky St445 visualizações
Machinel Learning with spark por Ons Dridi
Machinel Learning with spark Machinel Learning with spark
Machinel Learning with spark
Ons Dridi807 visualizações
LAB 1 Report.docx por AhamedMusharaf1
LAB 1 Report.docxLAB 1 Report.docx
LAB 1 Report.docx
AhamedMusharaf175 visualizações
Optimizing Your CI Pipelines por Sebastian Witowski
Optimizing Your CI PipelinesOptimizing Your CI Pipelines
Optimizing Your CI Pipelines
Sebastian Witowski67 visualizações
An introduction to workflow-based programming with Node-RED por Boris Adryan
An introduction to workflow-based programming with Node-REDAn introduction to workflow-based programming with Node-RED
An introduction to workflow-based programming with Node-RED
Boris Adryan20.6K visualizações
Feature Detection in Ajax-enabled Web Applications por Nikolaos Tsantalis
Feature Detection in Ajax-enabled Web ApplicationsFeature Detection in Ajax-enabled Web Applications
Feature Detection in Ajax-enabled Web Applications
Nikolaos Tsantalis2.9K visualizações
Graphical packet generator por tusharjadhav2611
Graphical packet generatorGraphical packet generator
Graphical packet generator
tusharjadhav2611658 visualizações
Larson and toubro por anoopc1998
Larson and toubroLarson and toubro
Larson and toubro
anoopc199896 visualizações
Building TaxBrain: Numba-enabled Financial Computing on the Web por talumbau
Building TaxBrain: Numba-enabled Financial Computing on the WebBuilding TaxBrain: Numba-enabled Financial Computing on the Web
Building TaxBrain: Numba-enabled Financial Computing on the Web
talumbau1.8K visualizações
ESP8266 and IOT por dega1999
ESP8266 and IOTESP8266 and IOT
ESP8266 and IOT
dega199910.3K visualizações
Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co... por Nane Kratzke
Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...
Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...
Nane Kratzke1.7K visualizações

Último

JioEngage_Presentation.pptx por
JioEngage_Presentation.pptxJioEngage_Presentation.pptx
JioEngage_Presentation.pptxadmin125455
8 visualizações4 slides
Top-5-production-devconMunich-2023-v2.pptx por
Top-5-production-devconMunich-2023-v2.pptxTop-5-production-devconMunich-2023-v2.pptx
Top-5-production-devconMunich-2023-v2.pptxTier1 app
6 visualizações42 slides
Benefits in Software Development por
Benefits in Software DevelopmentBenefits in Software Development
Benefits in Software DevelopmentJohn Valentino
5 visualizações15 slides
Agile 101 por
Agile 101Agile 101
Agile 101John Valentino
10 visualizações20 slides
Gen Apps on Google Cloud PaLM2 and Codey APIs in Action por
Gen Apps on Google Cloud PaLM2 and Codey APIs in ActionGen Apps on Google Cloud PaLM2 and Codey APIs in Action
Gen Apps on Google Cloud PaLM2 and Codey APIs in ActionMárton Kodok
16 visualizações55 slides
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium... por
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...Lisi Hocke
35 visualizações124 slides

Último(20)

JioEngage_Presentation.pptx por admin125455
JioEngage_Presentation.pptxJioEngage_Presentation.pptx
JioEngage_Presentation.pptx
admin1254558 visualizações
Top-5-production-devconMunich-2023-v2.pptx por Tier1 app
Top-5-production-devconMunich-2023-v2.pptxTop-5-production-devconMunich-2023-v2.pptx
Top-5-production-devconMunich-2023-v2.pptx
Tier1 app6 visualizações
Benefits in Software Development por John Valentino
Benefits in Software DevelopmentBenefits in Software Development
Benefits in Software Development
John Valentino5 visualizações
Agile 101 por John Valentino
Agile 101Agile 101
Agile 101
John Valentino10 visualizações
Gen Apps on Google Cloud PaLM2 and Codey APIs in Action por Márton Kodok
Gen Apps on Google Cloud PaLM2 and Codey APIs in ActionGen Apps on Google Cloud PaLM2 and Codey APIs in Action
Gen Apps on Google Cloud PaLM2 and Codey APIs in Action
Márton Kodok16 visualizações
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium... por Lisi Hocke
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...
Team Transformation Tactics for Holistic Testing and Quality (Japan Symposium...
Lisi Hocke35 visualizações
Unlocking the Power of AI in Product Management - A Comprehensive Guide for P... por NimaTorabi2
Unlocking the Power of AI in Product Management - A Comprehensive Guide for P...Unlocking the Power of AI in Product Management - A Comprehensive Guide for P...
Unlocking the Power of AI in Product Management - A Comprehensive Guide for P...
NimaTorabi216 visualizações
Bootstrapping vs Venture Capital.pptx por Zeljko Svedic
Bootstrapping vs Venture Capital.pptxBootstrapping vs Venture Capital.pptx
Bootstrapping vs Venture Capital.pptx
Zeljko Svedic15 visualizações
Quality Engineer: A Day in the Life por John Valentino
Quality Engineer: A Day in the LifeQuality Engineer: A Day in the Life
Quality Engineer: A Day in the Life
John Valentino7 visualizações
Ports-and-Adapters Architecture for Embedded HMI por Burkhard Stubert
Ports-and-Adapters Architecture for Embedded HMIPorts-and-Adapters Architecture for Embedded HMI
Ports-and-Adapters Architecture for Embedded HMI
Burkhard Stubert29 visualizações
Electronic AWB - Electronic Air Waybill por Freightoscope
Electronic AWB - Electronic Air Waybill Electronic AWB - Electronic Air Waybill
Electronic AWB - Electronic Air Waybill
Freightoscope 5 visualizações
Using Qt under LGPL-3.0 por Burkhard Stubert
Using Qt under LGPL-3.0Using Qt under LGPL-3.0
Using Qt under LGPL-3.0
Burkhard Stubert13 visualizações
The Era of Large Language Models.pptx por AbdulVahedShaik
The Era of Large Language Models.pptxThe Era of Large Language Models.pptx
The Era of Large Language Models.pptx
AbdulVahedShaik7 visualizações
20231129 - Platform @ localhost 2023 - Application-driven infrastructure with... por sparkfabrik
20231129 - Platform @ localhost 2023 - Application-driven infrastructure with...20231129 - Platform @ localhost 2023 - Application-driven infrastructure with...
20231129 - Platform @ localhost 2023 - Application-driven infrastructure with...
sparkfabrik8 visualizações
Keep por Geniusee
KeepKeep
Keep
Geniusee78 visualizações
Navigating container technology for enhanced security by Niklas Saari por Metosin Oy
Navigating container technology for enhanced security by Niklas SaariNavigating container technology for enhanced security by Niklas Saari
Navigating container technology for enhanced security by Niklas Saari
Metosin Oy14 visualizações
predicting-m3-devopsconMunich-2023.pptx por Tier1 app
predicting-m3-devopsconMunich-2023.pptxpredicting-m3-devopsconMunich-2023.pptx
predicting-m3-devopsconMunich-2023.pptx
Tier1 app8 visualizações
Sprint 226 por ManageIQ
Sprint 226Sprint 226
Sprint 226
ManageIQ11 visualizações
What is API por artembondar5
What is APIWhat is API
What is API
artembondar512 visualizações

C-SCALE Tutorial: Snakemake

  • 1. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101017529. Copernicus - eoSC AnaLytics Engine C-SCALE tutorial: Snakemake Sebastian Luna-Valero, EGI Foundation sebastian.luna.valero@egi.eu C-SCALE tutorial: Snakemake | 29th November 2022 | Online
  • 2. Outline • Why workflows? • Why snakemake? • Let’s build a workflow! 2 C-SCALE tutorial: Snakemake | 29th November 2022 | Online
  • 3. Why workflows? Credits: https://github.com/c-scale-community/use-case-hisea Goals: ● from raw data to figures ○ with “one click” ● re-run with new config ○ spatial scale ○ temporal scale ● re-run half-way through ○ recover from issues ● dependency management ○ between tasks ○ software packages 3 C-SCALE tutorial: Snakemake | 29th November 2022 | Online
  • 4. Why workflows? When to build a workflow? ● Re-run the same analysis over and over again, with different input parameters ● Ability to re-run the work partially; recover from intermediate failures ● Combine together heterogeneous tooling into the same analysis ○ Python, R, Julia, Docker, Bash, etc. 4 C-SCALE tutorial: Snakemake | 29th November 2022 | Online
  • 5. Why snakemake? • Mature workflow management system. • Great community around it. • Easy to learn? :) • A Snakemake workflow scales without modification from single core workstations and multi-core servers to batch systems (e.g. slurm) • Snakemake integrates with the package manager Conda and the container engine Singularity such that defining the software stack becomes part of the workflow itself. • Further information: https://snakemake.readthedocs.io/ 5 C-SCALE tutorial: Snakemake | 29th November 2022 | Online
  • 6. Let’s build a workflow! • Snakemake follows the GNU Make paradigm: workflows are defined in terms of rules that define how to create output files from input files. • $ snakemake --cores 1 • The application of a rule to generate a set of output files is called job. 6 C-SCALE tutorial: Snakemake | 29th November 2022 | Online rule count_countries: input: "european-countries.txt" output: "number-of-countries.txt" shell: "wc --lines european-countries.txt > number-of-countries.txt" $ cat european-countries.txt Netherlands Greece Spain Portugal Italy Poland Austria Snakefile
  • 7. Let’s build a workflow! • Snakemake follows the GNU Make paradigm: workflows are defined in terms of rules that define how to create output files from input files. • $ snakemake --cores 1 • Snakemake only re-runs jobs if one of the input files is newer than one of the output files or one of the input files will be updated by another job. 7 C-SCALE tutorial: Snakemake | 29th November 2022 | Online rule count_countries: input: "european-countries.txt" output: "number-of-countries.txt" shell: "wc --lines european-countries.txt > number-of-countries.txt" $ cat european-countries.txt Netherlands Greece Spain Portugal Italy Poland Austria Belgium Snakefile
  • 8. Let’s build a workflow! • Generalize the rule: • $ snakemake --cores 1 • $ wc --lines european-countries.txt > number-of-countries.txt 8 C-SCALE tutorial: Snakemake | 29th November 2022 | Online rule count_countries: input: "european-countries.txt" output: "number-of-countries.txt" shell: "wc --lines {input} > {output}" $ cat european-countries.txt Netherlands Greece Spain Portugal Italy Poland Austria Snakefile
  • 9. Let’s build a workflow! • Adding more than one input file: • $ snakemake --cores 1 • $ wc --lines european-countries.txt other-countries.txt > number-of-countries.txt 9 C-SCALE tutorial: Snakemake | 29th November 2022 | Online rule count_countries: input: "european-countries.txt", "other-countries.txt" output: "number-of-countries.txt" shell: "wc --lines {input} > {output}" $ cat european-countries.txt Netherlands Greece Spain Portugal Italy Poland Austria Snakefile $ cat other-countries.txt US Canada
  • 10. Let’s build a workflow! • It’s better to organize your working directory: • $ snakemake --cores 1 10 C-SCALE tutorial: Snakemake | 29th November 2022 | Online rule count_countries: input: "countries/european-countries.txt", "countries/other-countries.txt" output: "stats/number-of-countries.txt" shell: "wc --lines {input} > {output}" $ cat european-countries.txt Netherlands Greece Spain Portugal Italy Poland Austria Snakefile $ cat other-countries.txt US Canada
  • 11. Let’s build a workflow! • Connecting rules! Targets can be rules, output files. • $ snakemake --cores 1 <target> 11 C-SCALE tutorial: Snakemake | 29th November 2022 | Online rule pre_processing: input: "stats/number-of-countries.txt" output: "pre-processing.done" shell: "touch pre-processing.done" rule count_countries: input: "countries/european-countries.txt", "countries/other-countries.txt" output: "stats/number-of-countries.txt" shell: "wc --lines {input} > {output}" $ cat european-countries.txt Netherlands Greece Spain Portugal Italy Poland Austria Snakefile $ cat other-countries.txt US Canada
  • 12. Let’s build a workflow! • Updating intermediate files (however: #1978 and #2011) • $ snakemake --cores 1 <target> 12 C-SCALE tutorial: Snakemake | 29th November 2022 | Online rule pre_processing: input: "stats/number-of-countries.txt" output: "pre-processing.done" shell: "touch pre-processing.done" rule count_countries: input: "countries/european-countries.txt", "countries/other-countries.txt" output: "stats/number-of-countries.txt" shell: "wc --lines {input} > {output}" Snakefile $ cat european-countries.txt Netherlands Greece Spain Portugal Italy Poland Austria Belgium $ cat other-countries.txt US Canada
  • 13. Let’s build a workflow! • Dependencies between the rules are determined creating a Directed Acyclic Graph • $ snakemake --cores 1 --dag | dot -Tsvg > dag.svg 13 C-SCALE tutorial: Snakemake | 29th November 2022 | Online rule pre_processing: input: "stats/number-of-countries.txt" output: "pre-processing.done" shell: "touch pre-processing.done" rule count_countries: input: "countries/european-countries.txt", "countries/other-countries.txt" output: "stats/number-of-countries.txt" shell: "wc --lines {input} > {output}" Snakefile
  • 14. Let’s build a workflow! • Python • $ snakemake --cores 1 <target> 14 C-SCALE tutorial: Snakemake | 29th November 2022 | Online rule pre_processing: input: "stats/number-of-countries.txt" output: "pre-processing.done" shell: "python --input stats/number-of-countries.txt myscript.py" rule count_countries: input: "countries/european-countries.txt", "countries/other-countries.txt" output: "stats/number-of-countries.txt" shell: "wc --lines {input} > {output}" Snakefile
  • 15. Let’s build a workflow! • Containers • $ snakemake --cores 1 <target> 15 C-SCALE tutorial: Snakemake | 29th November 2022 | Online rule pre_processing: input: "stats/number-of-countries.txt" output: "pre-processing.done" shell: "udocker run example" rule count_countries: input: "countries/european-countries.txt", "countries/other-countries.txt" output: "stats/number-of-countries.txt" shell: "wc --lines {input} > {output}" Snakefile
  • 16. Let’s build a workflow! • Pre-built support for Singularity (see docs for more details) 16 C-SCALE tutorial: Snakemake | 29th November 2022 | Online rule pre_processing: input: "stats/number-of-countries.txt" output: "pre-processing.done" container: "docker://repo/image" script: "scripts/plot.R" rule count_countries: input: "countries/european-countries.txt", "countries/other-countries.txt" output: "stats/number-of-countries.txt" shell: "wc --lines {input} > {output}" Snakefile
  • 17. Let’s build a workflow! • Configuration • $ snakemake --cores 1 17 C-SCALE tutorial: Snakemake | 29th November 2022 | Online configfile: "config.yaml" rule count_countries: input: expand("{input}", input=config['european']), expand("{input}", input=config['other']) output: "stats/number-of-countries.txt" shell: "wc --lines {input} > {output}" $ cat european-countries.txt Netherlands Greece Spain Portugal Italy Poland Austria Snakefile $ cat config.yaml european: 'countries/european-countries.txt' other: 'countries/other-countries.txt' $ cat other-countries.txt US Canada
  • 18. Let’s build a workflow! • Logging • $ snakemake --cores 1 18 C-SCALE tutorial: Snakemake | 29th November 2022 | Online rule count_countries: input: "countries/european-countries.txt", "countries/other-countries.txt" output: "stats/number-of-countries.txt" log: "logs/count_countries.log" shell: "wc --lines {input} > {output}" $ cat european-countries.txt Netherlands Greece Spain Portugal Italy Poland Austria Snakefile $ cat other-countries.txt US Canada
  • 19. Let’s build a workflow! • Benchmarking • $ snakemake --cores 1 19 C-SCALE tutorial: Snakemake | 29th November 2022 | Online rule count_countries: input: "countries/european-countries.txt", "countries/other-countries.txt" output: "stats/number-of-countries.txt" benchmark: "benchmarks/count_countries.txt" shell: "wc --lines {input} > {output}" $ cat european-countries.txt Netherlands Greece Spain Portugal Italy Poland Austria Snakefile $ cat other-countries.txt US Canada
  • 20. Let’s build a workflow! • Modularization 20 C-SCALE tutorial: Snakemake | 29th November 2022 | Online rule count_countries: input: "countries/european-countries.txt", "countries/other-countries.txt" output: "stats/number-of-countries.txt" shell: "wc --lines {input} > {output}" $ cat european-countries.txt Netherlands Greece Spain Portugal Italy Poland Austria Snakefile $ cat other-countries.txt US Canada include: "rules/count_countries.smk" rule pre_processing: input: "stats/number-of-countries.txt" output: "pre-processing.done" shell: "touch pre-processing.done" Snakefile
  • 21. Let’s build a workflow! • Integration with conda • $ snakemake --cores 1 --use-conda --conda-frontend mamba 21 C-SCALE tutorial: Snakemake | 29th November 2022 | Online rule count_countries: input: "countries/european-countries.txt", "countries/other-countries.txt" output: "stats/number-of-countries.txt" conda: "envs/count_countries.yaml" shell: "wc --lines {input} > {output}" $ cat european-countries.txt Netherlands Greece Spain Portugal Italy Poland Austria Snakefile $ cat other-countries.txt US Canada $ cat envs/count_countries.yaml name: count_countries channels: - conda-forge - defaults dependencies: - coreutils
  • 22. Let’s build a workflow! • Other examples • https://github.com/c-scale-community/c-scale-tutorial-snakemake • https://github.com/c-scale-community/use-case-hisea/pull/41/files 22 C-SCALE tutorial: Snakemake | 29th November 2022 | Online
  • 23. Let’s build a workflow! • Advanced features • Pre-built functionality for scatter-gather jobs • Cluster execution: snakemake --cluster qsub (see SLURM docs) • Self-contained HTML reports • Accessing remote storage: • Amazon S3, Google Cloud Storage, Microsoft Azure Blob Storage • SFTP, HTTP, FTP, Dropbox, XRootD, WebDAV, GFAL, GridFTP, iRODs, etc. • Best practices • https://snakemake.readthedocs.io/en/stable/snakefiles/best_practices.html • FAQs: https://snakemake.readthedocs.io/en/stable/project_info/faq.html 23 C-SCALE tutorial: Snakemake | 29th November 2022 | Online
  • 24. Thank you for your attention. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101017529. Copernicus - eoSC AnaLytics Engine contact@c-scale.eu https://c-scale.eu @C_SCALE_EU C-SCALE tutorial: Snakemake | 29th November 2022 | Online Sebastian Luna-Valero, EGI Foundation sebastian.luna.valero@egi.eu
  • 25. Let’s build a workflow! • Wildcards example: • $ snakemake --cores 1 stats/number-of-european-countries.txt • $ snakemake --cores 1 stats/number-of-other-countries.txt 25 C-SCALE tutorial: Snakemake | 29th November 2022 | Online rule count_countries: input: "countries/{category}-countries.txt" output: "stats/number-of-{category}-countries.txt" shell: "wc --lines {input} > {output}" $ cat list-of-countries.txt Netherlands Greece Spain Portugal Italy Poland Austria Snakefile
  • 26. Let’s build a workflow! • Many to many with glob_wildcards: • $ snakemake --cores 1 26 C-SCALE tutorial: Snakemake | 29th November 2022 | Online CATEGORIES, = glob_wildcards("countries/{category}-countries.txt") print(CATEGORIES) rule all: input: expand("stats/number-of-{category}-countries.txt", category=CATEGORIES) rule count_countries: input: "countries/{category}-countries.txt" output: "stats/number-of-{category}-countries.txt" shell: "wc --lines {input} > {output}" $ cat list-of-countries.txt Netherlands Greece Spain Portugal Italy Poland Austria Snakefile input-1 input-2 output-1 output-2 input-n output-n input-.. output-..
  • 27. Let’s build a workflow! • Dependencies between the rules are determined automatically, creating a DAG (directed acyclic graph) of jobs that can be automatically parallelized. • Snakemake only re-runs jobs if one of the input files is newer than one of the output files or one of the input files will be updated by another job. • https://github.com/snakemake/snakemake/issues/1978 • Snakemake works backwards from requested output, and not from available input. • Targets • rule names can be targets • output files can be targets • if no target is given at the command line, Snakemake will define the first rule of the Snakefile as the target. Hence, it is best practice to have a rule all at the top of the workflow which has all typically desired target files as input files. 27 C-SCALE tutorial: Snakemake | 29th November 2022 | Online