Distributed, Incremental Dataflow Processing on AWS with GRAIL's Reflow (CMP348) - AWS re:Invent 2018

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Distributed, Incremental Dataflow
Processing on AWS with GRAIL’s Reflow
Marius Eriksen
Software Engineer
GRAIL Inc.
C M P 3 4 8

GRAIL
• GRAIL’s purpose: Detect cancer early, when it can be cured
• How? Analyses of cell-free DNA (cfDNA) shed by tumors into the blood
stream
• Necessitates large scale:
• Sequencing data: Up to a terabyte per sample
• Studies: 100,000s of subjects
• Large scale computing and storage problems

Data processing at GRAIL
• Bioinformatics
• Machine learning
• Sample analysis
• Classifier evaluation
• Ad-hoc queries

Example: Bioinformatics
* Courtesy of Illumina, Inc.

Example: Machine learning
* Courtesy of Illumina, Inc. * Courtesy of Illumina, Inc.* Courtesy of Illumina, Inc.

… and more

Data workflow systems
• Similar to software build systems, but:
• Time: Days or hours, not minutes
• Volume: Terabytes, not gigabytes
• Flexibility: Dynamic, not static
• “File-grained” parallelism. Not record level.
• Ubiquitous in ETL (extract, transform, load) workloads as well
bioinformatics

Workflow systems landscape
• Crowded landscape; mostly thin frontends
• Backends: Kubernetes, Celery, Hadoop, AWS Batch,
… or combinations of these
• Little coherency. Usually EDSLs (or worse) to describe dependency
graphs.

Observations
• The state of the art is too mechanical
• Systems are constrained by lack of data model
… instead they are principally task execution frameworks
• Workflows are just programs, and benefit from programmatic
abstraction

Let’s start from scratch

Reflow
Functional
Statically typed, modular
Composes external tools
Referentially transparent
Incremental
Parallel
Cluster computing

Hello, (bioinformatics) world!
val reference = file("s3://.../g1k_v37.fa")
func align(r1, r2 file) =
exec(image := "biocontainers/bwa") (out file) {"
bwa mem -M -t 16 {{reference}} {{r1}} {{r2}} > {{out}}
"}
val Main = {
r1 := file("s3://.../SRR062640_1.filt.fastq.gz")
r2 := file("s3://.../SRR062640_2.filt.fastq.gz")
align(r1, r2)
}

Why
• Workflows are programs by another name
• Strong data model with a lot of leverage
• Simplicity in use and operations
• Data deserves an API

Goals
• Give data engineering the tools of modern software development
• Seamless cluster computing: Reflow should just work
• Safety
• Strong data model, incremental computation
• Minimal dependency footprint; self-contained
• No infrastructure

Why a new system?
• These properties are fundamental; can’t be bolted on
• Co-design of language and runtime leads to simplicity
• Giving APIs to data provides a lot of leverage for other infrastructure, in
powerful and surprising ways

Demo

Reference slides

Design principles
• Workflows as ordinary programs:
• Abstraction: Functions, modules
• Typing: Simple, structural type system; inferred
• Seamless integration of third-party tools
• Powerful enough, but maybe not Turing complete
• Simplicity of use and implementation:
• Co-design of language and runtime to simplify
• Minimize external dependencies

Language basics
val x = 123 // value binding: types are inferred
val y = {name: "reflow", age: 2*year} // records
val z = (1, "two", 3.0) // tuples
val xs = [1, 2, 3] // lists
val map = ["one": 1, "two": 2] // maps
val f = file("s3://test-bucket/test-path") // files
func foo(x, y int, z string) = // return type inferred
strings.FromInt(x*y)+z
val {name, x: age} = y // destructuring records
val (first, _, _) = z // destructuring tuples

Language basics
// Blocks are delimited by {}s, and contain
// a set of bindings followed by a naked
// “return” expression.
val Main string = {
val name = getName()
val greeting = getGreeting()
greeting+", "+name
}
// Conditionals are expressions, too.
// Block syntax is mandatory.
val merged =
if len(aligned) == 1 {
aligned[0]
} else {
bam.Merge(aligned)
}

Execs: The magic ingredient
val input file = …
// Run a command inside of the “ubuntu” Docker image,
// reserving 20MB of RAM and 2 CPUs. Place the output
// in a file. (The type of this expression is ‘file’.)
exec(image := "ubuntu", mem := 20*MiB, cpu := 2) (out file) {"
cat {{input}} > {{out}}
“}
// Execs can have multiple outputs and inputs.
// Outputs may be file or dir typed; inputs are
// interpolated.
var level = 80
exec(…) (out dir, log file) {"
decompress -l {{level}} -i {{input}} -o {{out}} >{{log}} 2>&1
“}

Clarity-preserving brevity
val record = {a, b, c} // {a: a, b: b, c: c}
val {a, b, c} = record // val {a: a, b: b, c: c} = …
// exec(image := image, cpu := cpu, …)
val result = exec(image, cpu, mem, disk) (out file) {“…
val mod = make("./aligner.rf",
bandwidth, rounds, clippingPenalty := 10)
func Foo(x, y string, z int) = …
exec(…) (aligned, index file, diagnostic dir)

Comprehensions
func align(r1, r2 file) (out, stats file) = …
val samples [{name: string, files: [(file, file)]}] = …
val inputs [(r1, r2)] = …
val aligned = [align(r1, r2) | (r1, r2) <- inputs]
val aligned = [
align(r1, r2) |(r1, r2) <- files, {files} <- samples
]
val aligned = [
align(r1, r2) | (r1, r2) <- files,
if name != "BADSAMPLE"
{name, files} <- samples,
]

Modules
// hello.rf:
param greeting string
func Hello(who string) = greeting+", "+who
// main.rf:
param (
greeting = "hello"
subject string
)
val hello = make("./hello.rf", greeting)
val Main = hello.Hello("world")
# command:
$ reflow run main.rf -subject=world
hello, world
$ reflow run main.rf -greeting=hi -subject=there
hi, there

Modules
// hello.rf:
param (
// Salutation indicates how to greet a person.
salutation string
// Greeting indicates which greeting to use.
greeting = "hello"
)
// Greet returns the greeting for a subject.
func Greet(who string) = greeting+", "+salutation+" “+who
$ reflow doc hello.rf
Parameters
val salutation string (required)
Salutation indicates how to greet a person.
val greeting string = "hello"
Greeting indicates which greeting to use.
Declarations
val Greet func(who string) string
Greet returns the greeting for a subject.

Modules give data an API
// pipeline.rf:
param sample string
val inputs = make("inputs.rf", sample)
// Aligned is the alignment of sample.
val Aligned = align.Align(sample.R1, sample.R2)
// Index is the index of the aligned sample.
val Index = bam.Index(Aligned)

Modules are data APIs
• A module provides:
• A type
• Documentation
• Compositionally
• Introspection capabilities
• Moreover, values are names: They provide a stable reference to a
particular result/computation

Evaluation in Reflow
• Evaluation is lazy
• Evaluation follows data flow
• Expressions are memoized
• In combination these leads to incremental semantics

Lazy evaluation
val hello (string, file) = {
val out = exec(image, cpu) (out file) {"
echo hello >{{out}}
"}
("hello", out)
}
// The exec is never executed.
val Main = {
val (str, _) = hello
str
}

Lazy evaluation
val index = exec(image, cpu, mem) (ref file) {"
bwa index {{reference}}
“}
func align(r1, r2, index file) (out file) =
if len(r1) > threshold {
indexedAlign(r1, r2, index)
} else {
unindexedAlign(r1, r2)
}
// The index is computed only if the length of the
// reads matches some threshold.
val Main = align(r1, r2, index)

Lazy evaluation
• An expression is evaluated only if it is needed by a data dependency
• The user need not think about avoiding unnecessary computation
• Improves modularity

Dataflow evaluation
func align(r1, r2 file) file = …
func stats(files [file]) file = …
val pairs [(file, file)] = …
// Here, the runtime parallelizes each invocation of
// align since there is no data dependency between them.
val aligned [file] = [align(r1, r2) | (r1, r2) <- pairs]
// Alignment and stats computation run in parallel
// since there is no data dependency between them.
val Main = {
merged: merge(aligned),
stats: stats(flatten([[r1, r2] | (r1, r2) <- pairs])),
}

Dataflow evaluation
• Any computation that can be parallelized is
• Users needn’t (can’t) express parallelism directly
• Only data dependencies imply sequencing

Referential transparency, memoization
val index = computeIndex(reference)
func align(r1, r2 file) file = exec(..) (out file) {"
bwa mem … {{index}} … {{r1}} {{r2}} > {{out}}
"}
// align(r1, r2) computed only once.
(align(r1, r2), align(r1, r2))
// align(r1b, r2b) computed, but index is reused.
align(r1b, r2b)

Automatic memoization
• Expensive computations are reused, needn’t be considered separately
• Indices, models, references, and so on can be expressed directly in code,
no need to stage computation
• Means that computation is precise: For example, if the input reference
changes, an index may be recomputed
• Makes evaluation incremental

Incremental evaluation
// Any change to either code or input data
// results in the smallest (re-)computation
// required to “catch up” what was previously
// computed.
func extract(sample file) file = …
func train(samples [file]) file = …
func evaluate(model file, samples [file]) file = …
val training [file] = …
val testing [file] = …
val model = train([extract(x) | x <- training])
val results = evaluate(model, testing)

Implementation notes

Implementation goals
• Language semantics and data model accommodate implementation
simplicity
• Maintain minimal API surface to external components
• Small set of core abstractions around which the system is built
• … in service of minimizing systems complexity

The big picture
assoc
(e.g., Amazon DynamoDB)
repository
(cache, e.g. Amazon S3)
evaluator
(e.g., a laptop)
cluster (e.g., Amazon EC2)
alloc
alloc
…
dockerd
dockerd
repository
(local, file)
repository
(local, file)

Evaluation
• Two evaluators:
• Direct evaluation of AST into a flow graph
• Flow graph evaluation, separately
• The flow graph is a dependency graph between executions:
• Execs nodes encode a run-to-completion execution (intern, Docker,
extern)
• Continuation nodes call into the evaluator to produce a new
(dynamic) subgraph

Evaluation strategy
• Top-down, then bottom-up:
• Compute key for every node for cache lookups
• Continuation nodes compute keys based on dependencies + partially
evaluated AST
• Derive both static (AST+static params) and dynamic (based on concrete
inputs) keys

Evaluation example
Root
Todo
Done
Intern s3://…/dogs/
Continue
val Dogs = [
resize(img) |
(_, img) <-
list(dir(dogs))
]

Evaluation example
Root
Exec 1 Exec 4 Exec NTodo
Done
val Dogs = [
resize(img) |
(_, img) <-
list(dir(dogs))
]
Exec 2 Exec 3
Continue

Evaluation example
Todo
Done
montage(dogs.Dogs)
Exec
Continue
Root

Taking it further: dataspaces.

Observations
• With referential transparency, an expression is a stable name for the
value of that expression
• Reflow can derive a key from any expression
• We have the makings of a data namespace, a dataspace

Bundles
• Reflow’s bundle mechanisms produce a single, self-contained artifact
for a module
• $ reflow bundle module.rf
• Bundles are modules:
• They can be run
• They can be introspected (for example, doc)
• They can be instantiated by other modules

Dataspaces
A dataspace is a mapping of symbolic names to bundles, for example:
$ reflow mount wgs.rfx
marius@grailbio.com/exp/wgs:v1
$ reflow doc marius@grailbio.com/exp/wgs:v1
$ reflow run marius@grailbio.com/exp/wgs:v1

Dataspaces are data APIs
• Makes data less ad-hoc:
• Typing
• Modules
• Documentation
• Stable names
• Dataspaces enforce subtyping rules:
• Can’t unintentionally break users
• Still fully incremental

Status
Open source: https://github.com/grailbio/reflow
Reflow is in heavy use at GRAIL, CZI, others
Bring your credentials, Reflow takes care of the rest!

Thank you!

Please complete the session
survey in the mobile app.
!

Distributed, Incremental Dataflow Processing on AWS with GRAIL's Reflow (CMP348) - AWS re:Invent 2018

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Distributed, Incremental Dataflow Processing on AWS with GRAIL's Reflow (CMP348) - AWS re:Invent 2018

Semelhante a Distributed, Incremental Dataflow Processing on AWS with GRAIL's Reflow (CMP348) - AWS re:Invent 2018 (20)

Mais de Amazon Web Services

Mais de Amazon Web Services (20)

Distributed, Incremental Dataflow Processing on AWS with GRAIL's Reflow (CMP348) - AWS re:Invent 2018