Mais conteúdo relacionado Semelhante a Distributed, Incremental Dataflow Processing on AWS with GRAIL's Reflow (CMP348) - AWS re:Invent 2018 (20) Mais de Amazon Web Services (20) Distributed, Incremental Dataflow Processing on AWS with GRAIL's Reflow (CMP348) - AWS re:Invent 20182. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Distributed, Incremental Dataflow
Processing on AWS with GRAIL’s Reflow
Marius Eriksen
Software Engineer
GRAIL Inc.
C M P 3 4 8
3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
GRAIL
• GRAIL’s purpose: Detect cancer early, when it can be cured
• How? Analyses of cell-free DNA (cfDNA) shed by tumors into the blood
stream
• Necessitates large scale:
• Sequencing data: Up to a terabyte per sample
• Studies: 100,000s of subjects
• Large scale computing and storage problems
4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data processing at GRAIL
• Bioinformatics
• Machine learning
• Sample analysis
• Classifier evaluation
• Ad-hoc queries
5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Bioinformatics
* Courtesy of Illumina, Inc.
6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Machine learning
* Courtesy of Illumina, Inc. * Courtesy of Illumina, Inc.* Courtesy of Illumina, Inc.
7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
… and more
8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data workflow systems
• Similar to software build systems, but:
• Time: Days or hours, not minutes
• Volume: Terabytes, not gigabytes
• Flexibility: Dynamic, not static
• “File-grained” parallelism. Not record level.
• Ubiquitous in ETL (extract, transform, load) workloads as well
bioinformatics
9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Workflow systems landscape
• Crowded landscape; mostly thin frontends
• Backends: Kubernetes, Celery, Hadoop, AWS Batch,
… or combinations of these
• Little coherency. Usually EDSLs (or worse) to describe dependency
graphs.
10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Observations
• The state of the art is too mechanical
• Systems are constrained by lack of data model
… instead they are principally task execution frameworks
• Workflows are just programs, and benefit from programmatic
abstraction
11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Let’s start from scratch
12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Reflow
Functional
Statically typed, modular
Composes external tools
Referentially transparent
Incremental
Parallel
Cluster computing
13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hello, (bioinformatics) world!
val reference = file("s3://.../g1k_v37.fa")
func align(r1, r2 file) =
exec(image := "biocontainers/bwa") (out file) {"
bwa mem -M -t 16 {{reference}} {{r1}} {{r2}} > {{out}}
"}
val Main = {
r1 := file("s3://.../SRR062640_1.filt.fastq.gz")
r2 := file("s3://.../SRR062640_2.filt.fastq.gz")
align(r1, r2)
}
14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Why
• Workflows are programs by another name
• Strong data model with a lot of leverage
• Simplicity in use and operations
• Data deserves an API
15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Goals
• Give data engineering the tools of modern software development
• Seamless cluster computing: Reflow should just work
• Safety
• Strong data model, incremental computation
• Minimal dependency footprint; self-contained
• No infrastructure
16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Why a new system?
• These properties are fundamental; can’t be bolted on
• Co-design of language and runtime leads to simplicity
• Giving APIs to data provides a lot of leverage for other infrastructure, in
powerful and surprising ways
17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Demo
18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Reference slides
19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Design principles
• Workflows as ordinary programs:
• Abstraction: Functions, modules
• Typing: Simple, structural type system; inferred
• Seamless integration of third-party tools
• Powerful enough, but maybe not Turing complete
• Simplicity of use and implementation:
• Co-design of language and runtime to simplify
• Minimize external dependencies
20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Language basics
val x = 123 // value binding: types are inferred
val y = {name: "reflow", age: 2*year} // records
val z = (1, "two", 3.0) // tuples
val xs = [1, 2, 3] // lists
val map = ["one": 1, "two": 2] // maps
val f = file("s3://test-bucket/test-path") // files
func foo(x, y int, z string) = // return type inferred
strings.FromInt(x*y)+z
val {name, x: age} = y // destructuring records
val (first, _, _) = z // destructuring tuples
21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Language basics
// Blocks are delimited by {}s, and contain
// a set of bindings followed by a naked
// “return” expression.
val Main string = {
val name = getName()
val greeting = getGreeting()
greeting+", "+name
}
// Conditionals are expressions, too.
// Block syntax is mandatory.
val merged =
if len(aligned) == 1 {
aligned[0]
} else {
bam.Merge(aligned)
}
22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Execs: The magic ingredient
val input file = …
// Run a command inside of the “ubuntu” Docker image,
// reserving 20MB of RAM and 2 CPUs. Place the output
// in a file. (The type of this expression is ‘file’.)
exec(image := "ubuntu", mem := 20*MiB, cpu := 2) (out file) {"
cat {{input}} > {{out}}
“}
// Execs can have multiple outputs and inputs.
// Outputs may be file or dir typed; inputs are
// interpolated.
var level = 80
exec(…) (out dir, log file) {"
decompress -l {{level}} -i {{input}} -o {{out}} >{{log}} 2>&1
“}
23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Clarity-preserving brevity
val record = {a, b, c} // {a: a, b: b, c: c}
val {a, b, c} = record // val {a: a, b: b, c: c} = …
// exec(image := image, cpu := cpu, …)
val result = exec(image, cpu, mem, disk) (out file) {“…
val mod = make("./aligner.rf",
bandwidth, rounds, clippingPenalty := 10)
func Foo(x, y string, z int) = …
exec(…) (aligned, index file, diagnostic dir)
24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Comprehensions
func align(r1, r2 file) (out, stats file) = …
val samples [{name: string, files: [(file, file)]}] = …
val inputs [(r1, r2)] = …
val aligned = [align(r1, r2) | (r1, r2) <- inputs]
val aligned = [
align(r1, r2) |(r1, r2) <- files, {files} <- samples
]
val aligned = [
align(r1, r2) | (r1, r2) <- files,
if name != "BADSAMPLE"
{name, files} <- samples,
]
25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Modules
// hello.rf:
param greeting string
func Hello(who string) = greeting+", "+who
// main.rf:
param (
greeting = "hello"
subject string
)
val hello = make("./hello.rf", greeting)
val Main = hello.Hello("world")
# command:
$ reflow run main.rf -subject=world
hello, world
$ reflow run main.rf -greeting=hi -subject=there
hi, there
26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Modules
// hello.rf:
param (
// Salutation indicates how to greet a person.
salutation string
// Greeting indicates which greeting to use.
greeting = "hello"
)
// Greet returns the greeting for a subject.
func Greet(who string) = greeting+", "+salutation+" “+who
$ reflow doc hello.rf
Parameters
val salutation string (required)
Salutation indicates how to greet a person.
val greeting string = "hello"
Greeting indicates which greeting to use.
Declarations
val Greet func(who string) string
Greet returns the greeting for a subject.
27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Modules give data an API
// pipeline.rf:
param sample string
val inputs = make("inputs.rf", sample)
// Aligned is the alignment of sample.
val Aligned = align.Align(sample.R1, sample.R2)
// Index is the index of the aligned sample.
val Index = bam.Index(Aligned)
28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Modules are data APIs
• A module provides:
• A type
• Documentation
• Compositionally
• Introspection capabilities
• Moreover, values are names: They provide a stable reference to a
particular result/computation
29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Evaluation in Reflow
• Evaluation is lazy
• Evaluation follows data flow
• Expressions are memoized
• In combination these leads to incremental semantics
30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Lazy evaluation
val hello (string, file) = {
val out = exec(image, cpu) (out file) {"
echo hello >{{out}}
"}
("hello", out)
}
// The exec is never executed.
val Main = {
val (str, _) = hello
str
}
31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Lazy evaluation
val index = exec(image, cpu, mem) (ref file) {"
bwa index {{reference}}
“}
func align(r1, r2, index file) (out file) =
if len(r1) > threshold {
indexedAlign(r1, r2, index)
} else {
unindexedAlign(r1, r2)
}
// The index is computed only if the length of the
// reads matches some threshold.
val Main = align(r1, r2, index)
32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Lazy evaluation
• An expression is evaluated only if it is needed by a data dependency
• The user need not think about avoiding unnecessary computation
• Improves modularity
33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Dataflow evaluation
func align(r1, r2 file) file = …
func stats(files [file]) file = …
val pairs [(file, file)] = …
// Here, the runtime parallelizes each invocation of
// align since there is no data dependency between them.
val aligned [file] = [align(r1, r2) | (r1, r2) <- pairs]
// Alignment and stats computation run in parallel
// since there is no data dependency between them.
val Main = {
merged: merge(aligned),
stats: stats(flatten([[r1, r2] | (r1, r2) <- pairs])),
}
34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Dataflow evaluation
• Any computation that can be parallelized is
• Users needn’t (can’t) express parallelism directly
• Only data dependencies imply sequencing
35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Referential transparency, memoization
val index = computeIndex(reference)
func align(r1, r2 file) file = exec(..) (out file) {"
bwa mem … {{index}} … {{r1}} {{r2}} > {{out}}
"}
// align(r1, r2) computed only once.
(align(r1, r2), align(r1, r2))
// align(r1b, r2b) computed, but index is reused.
align(r1b, r2b)
36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Automatic memoization
• Expensive computations are reused, needn’t be considered separately
• Indices, models, references, and so on can be expressed directly in code,
no need to stage computation
• Means that computation is precise: For example, if the input reference
changes, an index may be recomputed
• Makes evaluation incremental
37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Incremental evaluation
// Any change to either code or input data
// results in the smallest (re-)computation
// required to “catch up” what was previously
// computed.
func extract(sample file) file = …
func train(samples [file]) file = …
func evaluate(model file, samples [file]) file = …
val training [file] = …
val testing [file] = …
val model = train([extract(x) | x <- training])
val results = evaluate(model, testing)
38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Implementation notes
39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Implementation goals
• Language semantics and data model accommodate implementation
simplicity
• Maintain minimal API surface to external components
• Small set of core abstractions around which the system is built
• … in service of minimizing systems complexity
40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
The big picture
assoc
(e.g., Amazon DynamoDB)
repository
(cache, e.g. Amazon S3)
evaluator
(e.g., a laptop)
cluster (e.g., Amazon EC2)
alloc
alloc
…
dockerd
dockerd
repository
(local, file)
repository
(local, file)
41. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Evaluation
• Two evaluators:
• Direct evaluation of AST into a flow graph
• Flow graph evaluation, separately
• The flow graph is a dependency graph between executions:
• Execs nodes encode a run-to-completion execution (intern, Docker,
extern)
• Continuation nodes call into the evaluator to produce a new
(dynamic) subgraph
42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Evaluation strategy
• Top-down, then bottom-up:
• Compute key for every node for cache lookups
• Continuation nodes compute keys based on dependencies + partially
evaluated AST
• Derive both static (AST+static params) and dynamic (based on concrete
inputs) keys
43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Evaluation example
Root
Todo
Done
Intern s3://…/dogs/
Continue
val Dogs = [
resize(img) |
(_, img) <-
list(dir(dogs))
]
44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Evaluation example
Root
Exec 1 Exec 4 Exec NTodo
Done
val Dogs = [
resize(img) |
(_, img) <-
list(dir(dogs))
]
Exec 2 Exec 3
Continue
45. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Evaluation example
Todo
Done
montage(dogs.Dogs)
Exec
Continue
Root
46. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Taking it further: dataspaces.
47. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Observations
• With referential transparency, an expression is a stable name for the
value of that expression
• Reflow can derive a key from any expression
• We have the makings of a data namespace, a dataspace
48. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Bundles
• Reflow’s bundle mechanisms produce a single, self-contained artifact
for a module
• $ reflow bundle module.rf
• Bundles are modules:
• They can be run
• They can be introspected (for example, doc)
• They can be instantiated by other modules
49. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Dataspaces
A dataspace is a mapping of symbolic names to bundles, for example:
$ reflow mount wgs.rfx
marius@grailbio.com/exp/wgs:v1
$ reflow doc marius@grailbio.com/exp/wgs:v1
$ reflow run marius@grailbio.com/exp/wgs:v1
50. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Dataspaces are data APIs
• Makes data less ad-hoc:
• Typing
• Modules
• Documentation
• Stable names
• Dataspaces enforce subtyping rules:
• Can’t unintentionally break users
• Still fully incremental
51. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Status
Open source: https://github.com/grailbio/reflow
Reflow is in heavy use at GRAIL, CZI, others
Bring your credentials, Reflow takes care of the rest!
53. Please complete the session
survey in the mobile app.
!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.