A short introduction to reproducible research, reproducibility with R, Docker, and all together for reproducible research using R and Docker containers. Includes demos of Rocker and containerit.
1. Reproducible Research in R
with Docker
Daniel Nüst | University of Münster | @nordholmen
MünsteR Meetup, Sep 2017
https://www.meetup.com/Munster-R-Users-Group/events/241108949/
3. Why should I care about reproducible research?
(an opinionated view)
Improve quality of your work today
Existence of your work tomorrow: journal requirements 2020+
Societal challenges… Who knows the Oxford Dictionaries word of the year
2016?
https://en.oxforddictionaries.com/word-of-the-year/word-of-the-year-2016
3
5. “Tradition” of notebooks in lab work, e.g. in
chemistry
(analog and digital)
Open Notebook Science
(https://en.wikipedia.org/wiki/Open_notebook_science)
No comparable tradition and education in
younger and mostly digital geostatistics,
GIS, ...
https://twitter.com/wellcometrust/status/49632
3565239955456
5
Lab notebooks
https://www.google.de/search?q=chemistry+l
ab+notebook&safe=off&tbm=isch
https://en.wikipedia.org/wiki/File:Studies_of_the_Arm
_showing_the_Movements_made_by_the_Biceps.jp
g
6. R Markdown a.k.a. .Rmd
http://rmarkdown.rstudio.com/
Based on Mark DOWN
https://daringfireball.net/projects/markdown/syntax
6
7. 7
#1: reproducibility helps to avoid disaster
#2: reproducibility makes it easier to write papers
#3: reproducibility helps reviewers see it your way
#4: reproducibility enables continuity of your work
#5: reproducibility helps to build your reputation
14. 14
Docker for Data Science
(all the Docker advantages… write once, biz ops, cloud, etc.)
Reproducibility through controlled working environment
Project separation + don’t clutter dev machine
Environment (re)creation, documentation
Adopt good practices on the way
Easy collaboration
Easy transition from testing to production
15. 15
https://hub.docker.com/r/rocker/rstudio/
Base containers (r-base, r-devel, r-ver, ..)
Use case containers (r-devel-ubsan-clang, ..)
Stacks (tidyverse, geospatial, ..)
docker run -it -p 8787:8787 rocker/rstudio
http://localhost:8787/ (rstudio/rstudio)
Rocker: https://github.com/rocker-org
16. rocker/r-ver and other base images
https://github.com/rocker-org/rocker#base-docker-containers
16
17. rocker/geospatial and other use cases
https://github.com/rocker-org/rocker#versioned-stack-builds-on-r-ver
17
19. https://hub.docker.com/r/rocker/rstudio/
docker run --rm -it -p 8787:8787 rocker/rstudio
http://localhost:8787/ (rstudio/rstudio)
Great example: https://github.com/benmarwick/1989-excavation-report-Madjebebe
docker run --rm -it -p 8787:8787 benmarwick/mjb1989excavationpaper
http://localhost:8787/ (rstudio/rstudio)
19
20. RStudio Desktop vs. rocker/rstudio
No functional difference, “Desktop” version ist just a lightweight browser wrapper
(https://rpubs.com/jmcphers/rstudio-architecture)
$ docker run -d -p 8787:8787 rocker/rstudio
$ docker ps
20
25. Running the container
> write(dockerfile_object)
INFO [2017-07-06 10:10:05] Writing dockerfile to
/home/daniel/Documents/2017_useR/Dockerfile
$ docker build -t user2017demo .
Sending build context to Docker daemon 6.054MB
Step 1/7 : FROM rocker/r-ver:3.4.1
3.4.1: Pulling from rocker/r-ver
c75480ad9aaf: Pull complete
[...]
The following additional packages will be installed:
[...]
* installing *source* package ‘foreign’ ...
[...]
Successfully built e30936ac8687
Successfully tagged user2017demo:latest
25
$ docker run -it user2017demo
R version 3.4.1 (2017-06-30) -- "Single Candle"
Copyright (C) 2017 The R Foundation for Statistical
Computing
Platform: x86_64-pc-linux-gnu (64-bit)
[..]
> library(rgdal); require(maptools)
Loading required package: sp
> nc <- rgdal::readOGR(system.file("shapes/",
package="maptools"), "sids", verbose = FALSE)
[...]
> summary(nc)
Object of class SpatialPolygonsDataFrame
Coordinates:
min max
x -84.32385 -75.45698
y 33.88199 36.58965
Is projected: FALSE
[...]
26. Running container with data in plain R with harbor
https://github.com/nuest/containerit/blob/79f8832975e00c84cdcc665df0c2846d834e27c5/demo/fullstack.R
> write.csv(file = “dataset.csv”, x = cars)
> dataset <- read.csv("dataset.csv")
> model <- lm(log(dist) ~ log(speed),
data = dataset)
> summary(model)
> cmd <- CMD_Rscript("script.R")
> df <- containerit::dockerfile(from = workspace,
cmd = cmd,
r_version = "3.3.3",
copy = "script_dir")
> write(df)
/tmp/Rtmpeachap/
├── dataset.csv
├── Dockerfile
└── script.R
> harbor::docker_cmd(harbor::localhost, "build",
arg = workspace,
docker_opts = c("-t", "fullstack-r-demo")
capture_text = TRUE
)
> harbor::docker_run(image = "fullstack-r-demo")
R version 3.3.3 (2017-03-06) -- "Another Canoe"
Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
[...]
> dataset <- read.csv("dataset.csv")
> model <- lm(log(dist) ~ log(speed), data = dataset)
> summary(model)
Call:
lm(formula = log(dist) ~ log(speed), data = dataset)
Residuals:
Min 1Q Median 3Q Max
-1.00215 -0.24578 -0.02898 0.20717 0.88289
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.7297 0.3758 -1.941 0.0581 .
26
28. More
Labels for metadata
devtools session information (install from git under dev.)
Custom base images
Docker vs. R
http://bit.ly/docker-r
Boettiger, Carl. 2015. “An Introduction
to Docker for Reproducible Research,
with Examples from the R
Environment.” ACM SIGOPS
Operating Systems Review 49
(January): 71–79.
doi:10.1145/2723872.2723882 28
29. Limitations
No shell, no fun
Windows :-(
image size
Versioning
How to access files/plots from a container?
29
30. Summary
Docker is a great tool for data science, reproducible research, consulting, …
Be “tidy” outside of your R Markdown
containerit makes Docker easier
(DRY, less copy&paste, best practices, automatic system dependencies)
Benefits from Rocker (MRAN by default, …), harbor, ...
Alternatives / potential for combination:
package management locally (packrat, pkgsnap, switchr/GRANBase)
or
remotely (MRAN timemachine/checkpoint), or install specific versions
from
30
> True @wellcomelibrary? MT @Libroantiguo Marie Curie's experimental notebook - after almost 100yrs, still radioactive.
http://www.openculture.com/2015/07/marie-curies-research-papers-are-still-radioactive-100-years-later.html
Markdown is a lightweight markup language with plain text formatting syntax. It is designed so that it can be converted to HTML and many other formats using a tool by the same name. Markdown is often used to format readme files, for writing messages in online discussion forums, and to create rich text using a plain text editor.
Who is a researcher?
The ERC provides a well-structured container for both the needs of journals (ERC as the item under review), archives (suitable metadata and packaging formats), and researchers (literally everything needed to re-do an analysis is there). It relies on Docker to define and store the runtime environment. ERCs should be simple enough to be created manually and absorb best practices for organizing digital workspaces.
“Bundle”
Nested containers (BagIt, Docker)
Librarian-ready
Reproducibility range of 5 to 10 years(still worth integrating, target users are not science historians)
Desktop-size data and algorithms - closed and complete
“Geo-stuff” and R for the “last 10 %”
Remain understandable for scientists
house vs. appartment
house vs. appartment
Images including views (protmetcore, etc.)
Dockerizing R
Dockerizing Research and Development Environments
Running Tests
Dockerizing Documents
Controll Docker Containers from R
R and Docker for Complex Web Applications
YES, you could do this manually, but the moment that there are other container solutions supported things become more interesting!
Also, repetative tasks