SlideShare uma empresa Scribd logo
1 de 53
Baixar para ler offline
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Distributed, Incremental Dataflow
Processing on AWS with GRAIL’s Reflow
Marius Eriksen
Software Engineer
GRAIL Inc.
C M P 3 4 8
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
GRAIL
• GRAIL’s purpose: Detect cancer early, when it can be cured
• How? Analyses of cell-free DNA (cfDNA) shed by tumors into the blood
stream
• Necessitates large scale:
• Sequencing data: Up to a terabyte per sample
• Studies: 100,000s of subjects
• Large scale computing and storage problems
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data processing at GRAIL
• Bioinformatics
• Machine learning
• Sample analysis
• Classifier evaluation
• Ad-hoc queries
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Bioinformatics
* Courtesy of Illumina, Inc.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example: Machine learning
* Courtesy of Illumina, Inc. * Courtesy of Illumina, Inc.* Courtesy of Illumina, Inc.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
… and more
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data workflow systems
• Similar to software build systems, but:
• Time: Days or hours, not minutes
• Volume: Terabytes, not gigabytes
• Flexibility: Dynamic, not static
• “File-grained” parallelism. Not record level.
• Ubiquitous in ETL (extract, transform, load) workloads as well
bioinformatics
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Workflow systems landscape
• Crowded landscape; mostly thin frontends
• Backends: Kubernetes, Celery, Hadoop, AWS Batch,
… or combinations of these
• Little coherency. Usually EDSLs (or worse) to describe dependency
graphs.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Observations
• The state of the art is too mechanical
• Systems are constrained by lack of data model
… instead they are principally task execution frameworks
• Workflows are just programs, and benefit from programmatic
abstraction
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Let’s start from scratch
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Reflow
Functional
Statically typed, modular
Composes external tools
Referentially transparent
Incremental
Parallel
Cluster computing
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hello, (bioinformatics) world!
val reference = file("s3://.../g1k_v37.fa")
func align(r1, r2 file) =
exec(image := "biocontainers/bwa") (out file) {"
bwa mem -M -t 16 {{reference}} {{r1}} {{r2}} > {{out}}
"}
val Main = {
r1 := file("s3://.../SRR062640_1.filt.fastq.gz")
r2 := file("s3://.../SRR062640_2.filt.fastq.gz")
align(r1, r2)
}
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Why
• Workflows are programs by another name
• Strong data model with a lot of leverage
• Simplicity in use and operations
• Data deserves an API
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Goals
• Give data engineering the tools of modern software development
• Seamless cluster computing: Reflow should just work
• Safety
• Strong data model, incremental computation
• Minimal dependency footprint; self-contained
• No infrastructure
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Why a new system?
• These properties are fundamental; can’t be bolted on
• Co-design of language and runtime leads to simplicity
• Giving APIs to data provides a lot of leverage for other infrastructure, in
powerful and surprising ways
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Demo
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Reference slides
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Design principles
• Workflows as ordinary programs:
• Abstraction: Functions, modules
• Typing: Simple, structural type system; inferred
• Seamless integration of third-party tools
• Powerful enough, but maybe not Turing complete
• Simplicity of use and implementation:
• Co-design of language and runtime to simplify
• Minimize external dependencies
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Language basics
val x = 123 // value binding: types are inferred
val y = {name: "reflow", age: 2*year} // records
val z = (1, "two", 3.0) // tuples
val xs = [1, 2, 3] // lists
val map = ["one": 1, "two": 2] // maps
val f = file("s3://test-bucket/test-path") // files
func foo(x, y int, z string) = // return type inferred
strings.FromInt(x*y)+z
val {name, x: age} = y // destructuring records
val (first, _, _) = z // destructuring tuples
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Language basics
// Blocks are delimited by {}s, and contain
// a set of bindings followed by a naked
// “return” expression.
val Main string = {
val name = getName()
val greeting = getGreeting()
greeting+", "+name
}
// Conditionals are expressions, too.
// Block syntax is mandatory.
val merged =
if len(aligned) == 1 {
aligned[0]
} else {
bam.Merge(aligned)
}
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Execs: The magic ingredient
val input file = …
// Run a command inside of the “ubuntu” Docker image,
// reserving 20MB of RAM and 2 CPUs. Place the output
// in a file. (The type of this expression is ‘file’.)
exec(image := "ubuntu", mem := 20*MiB, cpu := 2) (out file) {"
cat {{input}} > {{out}}
“}
// Execs can have multiple outputs and inputs.
// Outputs may be file or dir typed; inputs are
// interpolated.
var level = 80
exec(…) (out dir, log file) {"
decompress -l {{level}} -i {{input}} -o {{out}} >{{log}} 2>&1
“}
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Clarity-preserving brevity
val record = {a, b, c} // {a: a, b: b, c: c}
val {a, b, c} = record // val {a: a, b: b, c: c} = …
// exec(image := image, cpu := cpu, …)
val result = exec(image, cpu, mem, disk) (out file) {“…
val mod = make("./aligner.rf",
bandwidth, rounds, clippingPenalty := 10)
func Foo(x, y string, z int) = …
exec(…) (aligned, index file, diagnostic dir)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Comprehensions
func align(r1, r2 file) (out, stats file) = …
val samples [{name: string, files: [(file, file)]}] = …
val inputs [(r1, r2)] = …
val aligned = [align(r1, r2) | (r1, r2) <- inputs]
val aligned = [
align(r1, r2) |(r1, r2) <- files, {files} <- samples
]
val aligned = [
align(r1, r2) | (r1, r2) <- files,
if name != "BADSAMPLE"
{name, files} <- samples,
]
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Modules
// hello.rf:
param greeting string
func Hello(who string) = greeting+", "+who
// main.rf:
param (
greeting = "hello"
subject string
)
val hello = make("./hello.rf", greeting)
val Main = hello.Hello("world")
# command:
$ reflow run main.rf -subject=world
hello, world
$ reflow run main.rf -greeting=hi -subject=there
hi, there
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Modules
// hello.rf:
param (
// Salutation indicates how to greet a person.
salutation string
// Greeting indicates which greeting to use.
greeting = "hello"
)
// Greet returns the greeting for a subject.
func Greet(who string) = greeting+", "+salutation+" “+who
$ reflow doc hello.rf
Parameters
val salutation string (required)
Salutation indicates how to greet a person.
val greeting string = "hello"
Greeting indicates which greeting to use.
Declarations
val Greet func(who string) string
Greet returns the greeting for a subject.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Modules give data an API
// pipeline.rf:
param sample string
val inputs = make("inputs.rf", sample)
// Aligned is the alignment of sample.
val Aligned = align.Align(sample.R1, sample.R2)
// Index is the index of the aligned sample.
val Index = bam.Index(Aligned)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Modules are data APIs
• A module provides:
• A type
• Documentation
• Compositionally
• Introspection capabilities
• Moreover, values are names: They provide a stable reference to a
particular result/computation
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Evaluation in Reflow
• Evaluation is lazy
• Evaluation follows data flow
• Expressions are memoized
• In combination these leads to incremental semantics
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Lazy evaluation
val hello (string, file) = {
val out = exec(image, cpu) (out file) {"
echo hello >{{out}}
"}
("hello", out)
}
// The exec is never executed.
val Main = {
val (str, _) = hello
str
}
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Lazy evaluation
val index = exec(image, cpu, mem) (ref file) {"
bwa index {{reference}}
“}
func align(r1, r2, index file) (out file) =
if len(r1) > threshold {
indexedAlign(r1, r2, index)
} else {
unindexedAlign(r1, r2)
}
// The index is computed only if the length of the
// reads matches some threshold.
val Main = align(r1, r2, index)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Lazy evaluation
• An expression is evaluated only if it is needed by a data dependency
• The user need not think about avoiding unnecessary computation
• Improves modularity
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Dataflow evaluation
func align(r1, r2 file) file = …
func stats(files [file]) file = …
val pairs [(file, file)] = …
// Here, the runtime parallelizes each invocation of
// align since there is no data dependency between them.
val aligned [file] = [align(r1, r2) | (r1, r2) <- pairs]
// Alignment and stats computation run in parallel
// since there is no data dependency between them.
val Main = {
merged: merge(aligned),
stats: stats(flatten([[r1, r2] | (r1, r2) <- pairs])),
}
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Dataflow evaluation
• Any computation that can be parallelized is
• Users needn’t (can’t) express parallelism directly
• Only data dependencies imply sequencing
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Referential transparency, memoization
val index = computeIndex(reference)
func align(r1, r2 file) file = exec(..) (out file) {"
bwa mem … {{index}} … {{r1}} {{r2}} > {{out}}
"}
// align(r1, r2) computed only once.
(align(r1, r2), align(r1, r2))
// align(r1b, r2b) computed, but index is reused.
align(r1b, r2b)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Automatic memoization
• Expensive computations are reused, needn’t be considered separately
• Indices, models, references, and so on can be expressed directly in code,
no need to stage computation
• Means that computation is precise: For example, if the input reference
changes, an index may be recomputed
• Makes evaluation incremental
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Incremental evaluation
// Any change to either code or input data
// results in the smallest (re-)computation
// required to “catch up” what was previously
// computed.
func extract(sample file) file = …
func train(samples [file]) file = …
func evaluate(model file, samples [file]) file = …
val training [file] = …
val testing [file] = …
val model = train([extract(x) | x <- training])
val results = evaluate(model, testing)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Implementation notes
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Implementation goals
• Language semantics and data model accommodate implementation
simplicity
• Maintain minimal API surface to external components
• Small set of core abstractions around which the system is built
• … in service of minimizing systems complexity
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
The big picture
assoc
(e.g., Amazon DynamoDB)
repository
(cache, e.g. Amazon S3)
evaluator
(e.g., a laptop)
cluster (e.g., Amazon EC2)
alloc
alloc
…
dockerd
dockerd
repository
(local, file)
repository
(local, file)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Evaluation
• Two evaluators:
• Direct evaluation of AST into a flow graph
• Flow graph evaluation, separately
• The flow graph is a dependency graph between executions:
• Execs nodes encode a run-to-completion execution (intern, Docker,
extern)
• Continuation nodes call into the evaluator to produce a new
(dynamic) subgraph
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Evaluation strategy
• Top-down, then bottom-up:
• Compute key for every node for cache lookups
• Continuation nodes compute keys based on dependencies + partially
evaluated AST
• Derive both static (AST+static params) and dynamic (based on concrete
inputs) keys
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Evaluation example
Root
Todo
Done
Intern s3://…/dogs/
Continue
val Dogs = [
resize(img) |
(_, img) <-
list(dir(dogs))
]
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Evaluation example
Root
Exec 1 Exec 4 Exec NTodo
Done
val Dogs = [
resize(img) |
(_, img) <-
list(dir(dogs))
]
Exec 2 Exec 3
Continue
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Evaluation example
Todo
Done
montage(dogs.Dogs)
Exec
Continue
Root
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Taking it further: dataspaces.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Observations
• With referential transparency, an expression is a stable name for the
value of that expression
• Reflow can derive a key from any expression
• We have the makings of a data namespace, a dataspace
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Bundles
• Reflow’s bundle mechanisms produce a single, self-contained artifact
for a module
• $ reflow bundle module.rf
• Bundles are modules:
• They can be run
• They can be introspected (for example, doc)
• They can be instantiated by other modules
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Dataspaces
A dataspace is a mapping of symbolic names to bundles, for example:
$ reflow mount wgs.rfx 
marius@grailbio.com/exp/wgs:v1
$ reflow doc marius@grailbio.com/exp/wgs:v1
$ reflow run marius@grailbio.com/exp/wgs:v1
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Dataspaces are data APIs
• Makes data less ad-hoc:
• Typing
• Modules
• Documentation
• Stable names
• Dataspaces enforce subtyping rules:
• Can’t unintentionally break users
• Still fully incremental
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Status
Open source: https://github.com/grailbio/reflow
Reflow is in heavy use at GRAIL, CZI, others
Bring your credentials, Reflow takes care of the rest!
Thank you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Please complete the session
survey in the mobile app.
!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Mais conteúdo relacionado

Mais procurados

Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...
Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...
Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...
Lucidworks
 
MLeap: Deploy Spark ML Pipelines to Production API Servers
MLeap: Deploy Spark ML Pipelines to Production API ServersMLeap: Deploy Spark ML Pipelines to Production API Servers
MLeap: Deploy Spark ML Pipelines to Production API Servers
DataWorks Summit
 

Mais procurados (20)

Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQL
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
Spark meetup v2.0.5
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...
Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...
Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Apache Accumulo 1.8.0 Overview
Apache Accumulo 1.8.0 OverviewApache Accumulo 1.8.0 Overview
Apache Accumulo 1.8.0 Overview
 
MLeap: Deploy Spark ML Pipelines to Production API Servers
MLeap: Deploy Spark ML Pipelines to Production API ServersMLeap: Deploy Spark ML Pipelines to Production API Servers
MLeap: Deploy Spark ML Pipelines to Production API Servers
 
Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlib
 
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRMADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
 
Map Reduce data types and formats
Map Reduce data types and formatsMap Reduce data types and formats
Map Reduce data types and formats
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
 
Spark core
Spark coreSpark core
Spark core
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
 
Apache hive
Apache hiveApache hive
Apache hive
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
 
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, DatabricksSpark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 

Semelhante a Distributed, Incremental Dataflow Processing on AWS with GRAIL's Reflow (CMP348) - AWS re:Invent 2018

The Best Practices and Hard Lessons Learned of Serverless Applications - AWS ...
The Best Practices and Hard Lessons Learned of Serverless Applications - AWS ...The Best Practices and Hard Lessons Learned of Serverless Applications - AWS ...
The Best Practices and Hard Lessons Learned of Serverless Applications - AWS ...
Amazon Web Services
 

Semelhante a Distributed, Incremental Dataflow Processing on AWS with GRAIL's Reflow (CMP348) - AWS re:Invent 2018 (20)

Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...
Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...
Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...
 
Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...
Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...
Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...
 
AWS18 Startup Day Toronto- The Best Practices and Hard Lessons Learned of Ser...
AWS18 Startup Day Toronto- The Best Practices and Hard Lessons Learned of Ser...AWS18 Startup Day Toronto- The Best Practices and Hard Lessons Learned of Ser...
AWS18 Startup Day Toronto- The Best Practices and Hard Lessons Learned of Ser...
 
Beyond the Basics: Advanced Infrastructure as Code Programming on AWS (DEV327...
Beyond the Basics: Advanced Infrastructure as Code Programming on AWS (DEV327...Beyond the Basics: Advanced Infrastructure as Code Programming on AWS (DEV327...
Beyond the Basics: Advanced Infrastructure as Code Programming on AWS (DEV327...
 
The Best Practices and Hard Lessons Learned of Serverless Applications - AWS ...
The Best Practices and Hard Lessons Learned of Serverless Applications - AWS ...The Best Practices and Hard Lessons Learned of Serverless Applications - AWS ...
The Best Practices and Hard Lessons Learned of Serverless Applications - AWS ...
 
AWS Startup Day - Boston 2018 - The Best Practices and Hard Lessons Learned o...
AWS Startup Day - Boston 2018 - The Best Practices and Hard Lessons Learned o...AWS Startup Day - Boston 2018 - The Best Practices and Hard Lessons Learned o...
AWS Startup Day - Boston 2018 - The Best Practices and Hard Lessons Learned o...
 
AWS Machine Learning Week SF: Amazon SageMaker & TensorFlow
AWS Machine Learning Week SF: Amazon SageMaker & TensorFlowAWS Machine Learning Week SF: Amazon SageMaker & TensorFlow
AWS Machine Learning Week SF: Amazon SageMaker & TensorFlow
 
Driving DevOps Transformation in Enterprises (DEV320) - AWS re:Invent 2018
Driving DevOps Transformation in Enterprises (DEV320) - AWS re:Invent 2018Driving DevOps Transformation in Enterprises (DEV320) - AWS re:Invent 2018
Driving DevOps Transformation in Enterprises (DEV320) - AWS re:Invent 2018
 
Automate your Amazon SageMaker Workflows (July 2019)
Automate your Amazon SageMaker Workflows (July 2019)Automate your Amazon SageMaker Workflows (July 2019)
Automate your Amazon SageMaker Workflows (July 2019)
 
Building Serverless ETL Pipelines
Building Serverless ETL PipelinesBuilding Serverless ETL Pipelines
Building Serverless ETL Pipelines
 
The Best Practices and Hard Lessons Learned of Serverless Applications
The Best Practices and Hard Lessons Learned of Serverless ApplicationsThe Best Practices and Hard Lessons Learned of Serverless Applications
The Best Practices and Hard Lessons Learned of Serverless Applications
 
ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...
ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...
ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...
 
From Notebook to production with Amazon SageMaker
From Notebook to production with Amazon SageMakerFrom Notebook to production with Amazon SageMaker
From Notebook to production with Amazon SageMaker
 
Build Deep Learning Applications with TensorFlow & SageMaker: Machine Learnin...
Build Deep Learning Applications with TensorFlow & SageMaker: Machine Learnin...Build Deep Learning Applications with TensorFlow & SageMaker: Machine Learnin...
Build Deep Learning Applications with TensorFlow & SageMaker: Machine Learnin...
 
Build Deep Learning Applications with TensorFlow & SageMaker
Build Deep Learning Applications with TensorFlow & SageMakerBuild Deep Learning Applications with TensorFlow & SageMaker
Build Deep Learning Applications with TensorFlow & SageMaker
 
Build Deep Learning Applications with TensorFlow and Amazon SageMaker
Build Deep Learning Applications with TensorFlow and Amazon SageMakerBuild Deep Learning Applications with TensorFlow and Amazon SageMaker
Build Deep Learning Applications with TensorFlow and Amazon SageMaker
 
Best Practices and Hard Lessons of Serverless- AWS Startup Day Toronto- Diego...
Best Practices and Hard Lessons of Serverless- AWS Startup Day Toronto- Diego...Best Practices and Hard Lessons of Serverless- AWS Startup Day Toronto- Diego...
Best Practices and Hard Lessons of Serverless- AWS Startup Day Toronto- Diego...
 
Building Deep Learning Applications with TensorFlow and Amazon SageMaker
Building Deep Learning Applications with TensorFlow and Amazon SageMakerBuilding Deep Learning Applications with TensorFlow and Amazon SageMaker
Building Deep Learning Applications with TensorFlow and Amazon SageMaker
 
AWS Greengrass, Containers, and Your Dev Process for Edge Apps (GPSWS404) - A...
AWS Greengrass, Containers, and Your Dev Process for Edge Apps (GPSWS404) - A...AWS Greengrass, Containers, and Your Dev Process for Edge Apps (GPSWS404) - A...
AWS Greengrass, Containers, and Your Dev Process for Edge Apps (GPSWS404) - A...
 
Amazon Elasticsearch Service Deep Dive - AWS Online Tech Talks
Amazon Elasticsearch Service Deep Dive - AWS Online Tech TalksAmazon Elasticsearch Service Deep Dive - AWS Online Tech Talks
Amazon Elasticsearch Service Deep Dive - AWS Online Tech Talks
 

Mais de Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Mais de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Distributed, Incremental Dataflow Processing on AWS with GRAIL's Reflow (CMP348) - AWS re:Invent 2018

  • 1.
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Distributed, Incremental Dataflow Processing on AWS with GRAIL’s Reflow Marius Eriksen Software Engineer GRAIL Inc. C M P 3 4 8
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. GRAIL • GRAIL’s purpose: Detect cancer early, when it can be cured • How? Analyses of cell-free DNA (cfDNA) shed by tumors into the blood stream • Necessitates large scale: • Sequencing data: Up to a terabyte per sample • Studies: 100,000s of subjects • Large scale computing and storage problems
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data processing at GRAIL • Bioinformatics • Machine learning • Sample analysis • Classifier evaluation • Ad-hoc queries
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Bioinformatics * Courtesy of Illumina, Inc.
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example: Machine learning * Courtesy of Illumina, Inc. * Courtesy of Illumina, Inc.* Courtesy of Illumina, Inc.
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. … and more
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data workflow systems • Similar to software build systems, but: • Time: Days or hours, not minutes • Volume: Terabytes, not gigabytes • Flexibility: Dynamic, not static • “File-grained” parallelism. Not record level. • Ubiquitous in ETL (extract, transform, load) workloads as well bioinformatics
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Workflow systems landscape • Crowded landscape; mostly thin frontends • Backends: Kubernetes, Celery, Hadoop, AWS Batch, … or combinations of these • Little coherency. Usually EDSLs (or worse) to describe dependency graphs.
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Observations • The state of the art is too mechanical • Systems are constrained by lack of data model … instead they are principally task execution frameworks • Workflows are just programs, and benefit from programmatic abstraction
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Let’s start from scratch
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Reflow Functional Statically typed, modular Composes external tools Referentially transparent Incremental Parallel Cluster computing
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hello, (bioinformatics) world! val reference = file("s3://.../g1k_v37.fa") func align(r1, r2 file) = exec(image := "biocontainers/bwa") (out file) {" bwa mem -M -t 16 {{reference}} {{r1}} {{r2}} > {{out}} "} val Main = { r1 := file("s3://.../SRR062640_1.filt.fastq.gz") r2 := file("s3://.../SRR062640_2.filt.fastq.gz") align(r1, r2) }
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Why • Workflows are programs by another name • Strong data model with a lot of leverage • Simplicity in use and operations • Data deserves an API
  • 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Goals • Give data engineering the tools of modern software development • Seamless cluster computing: Reflow should just work • Safety • Strong data model, incremental computation • Minimal dependency footprint; self-contained • No infrastructure
  • 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Why a new system? • These properties are fundamental; can’t be bolted on • Co-design of language and runtime leads to simplicity • Giving APIs to data provides a lot of leverage for other infrastructure, in powerful and surprising ways
  • 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Demo
  • 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Reference slides
  • 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Design principles • Workflows as ordinary programs: • Abstraction: Functions, modules • Typing: Simple, structural type system; inferred • Seamless integration of third-party tools • Powerful enough, but maybe not Turing complete • Simplicity of use and implementation: • Co-design of language and runtime to simplify • Minimize external dependencies
  • 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Language basics val x = 123 // value binding: types are inferred val y = {name: "reflow", age: 2*year} // records val z = (1, "two", 3.0) // tuples val xs = [1, 2, 3] // lists val map = ["one": 1, "two": 2] // maps val f = file("s3://test-bucket/test-path") // files func foo(x, y int, z string) = // return type inferred strings.FromInt(x*y)+z val {name, x: age} = y // destructuring records val (first, _, _) = z // destructuring tuples
  • 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Language basics // Blocks are delimited by {}s, and contain // a set of bindings followed by a naked // “return” expression. val Main string = { val name = getName() val greeting = getGreeting() greeting+", "+name } // Conditionals are expressions, too. // Block syntax is mandatory. val merged = if len(aligned) == 1 { aligned[0] } else { bam.Merge(aligned) }
  • 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Execs: The magic ingredient val input file = … // Run a command inside of the “ubuntu” Docker image, // reserving 20MB of RAM and 2 CPUs. Place the output // in a file. (The type of this expression is ‘file’.) exec(image := "ubuntu", mem := 20*MiB, cpu := 2) (out file) {" cat {{input}} > {{out}} “} // Execs can have multiple outputs and inputs. // Outputs may be file or dir typed; inputs are // interpolated. var level = 80 exec(…) (out dir, log file) {" decompress -l {{level}} -i {{input}} -o {{out}} >{{log}} 2>&1 “}
  • 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Clarity-preserving brevity val record = {a, b, c} // {a: a, b: b, c: c} val {a, b, c} = record // val {a: a, b: b, c: c} = … // exec(image := image, cpu := cpu, …) val result = exec(image, cpu, mem, disk) (out file) {“… val mod = make("./aligner.rf", bandwidth, rounds, clippingPenalty := 10) func Foo(x, y string, z int) = … exec(…) (aligned, index file, diagnostic dir)
  • 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Comprehensions func align(r1, r2 file) (out, stats file) = … val samples [{name: string, files: [(file, file)]}] = … val inputs [(r1, r2)] = … val aligned = [align(r1, r2) | (r1, r2) <- inputs] val aligned = [ align(r1, r2) |(r1, r2) <- files, {files} <- samples ] val aligned = [ align(r1, r2) | (r1, r2) <- files, if name != "BADSAMPLE" {name, files} <- samples, ]
  • 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Modules // hello.rf: param greeting string func Hello(who string) = greeting+", "+who // main.rf: param ( greeting = "hello" subject string ) val hello = make("./hello.rf", greeting) val Main = hello.Hello("world") # command: $ reflow run main.rf -subject=world hello, world $ reflow run main.rf -greeting=hi -subject=there hi, there
  • 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Modules // hello.rf: param ( // Salutation indicates how to greet a person. salutation string // Greeting indicates which greeting to use. greeting = "hello" ) // Greet returns the greeting for a subject. func Greet(who string) = greeting+", "+salutation+" “+who $ reflow doc hello.rf Parameters val salutation string (required) Salutation indicates how to greet a person. val greeting string = "hello" Greeting indicates which greeting to use. Declarations val Greet func(who string) string Greet returns the greeting for a subject.
  • 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Modules give data an API // pipeline.rf: param sample string val inputs = make("inputs.rf", sample) // Aligned is the alignment of sample. val Aligned = align.Align(sample.R1, sample.R2) // Index is the index of the aligned sample. val Index = bam.Index(Aligned)
  • 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Modules are data APIs • A module provides: • A type • Documentation • Compositionally • Introspection capabilities • Moreover, values are names: They provide a stable reference to a particular result/computation
  • 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Evaluation in Reflow • Evaluation is lazy • Evaluation follows data flow • Expressions are memoized • In combination these leads to incremental semantics
  • 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Lazy evaluation val hello (string, file) = { val out = exec(image, cpu) (out file) {" echo hello >{{out}} "} ("hello", out) } // The exec is never executed. val Main = { val (str, _) = hello str }
  • 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Lazy evaluation val index = exec(image, cpu, mem) (ref file) {" bwa index {{reference}} “} func align(r1, r2, index file) (out file) = if len(r1) > threshold { indexedAlign(r1, r2, index) } else { unindexedAlign(r1, r2) } // The index is computed only if the length of the // reads matches some threshold. val Main = align(r1, r2, index)
  • 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Lazy evaluation • An expression is evaluated only if it is needed by a data dependency • The user need not think about avoiding unnecessary computation • Improves modularity
  • 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Dataflow evaluation func align(r1, r2 file) file = … func stats(files [file]) file = … val pairs [(file, file)] = … // Here, the runtime parallelizes each invocation of // align since there is no data dependency between them. val aligned [file] = [align(r1, r2) | (r1, r2) <- pairs] // Alignment and stats computation run in parallel // since there is no data dependency between them. val Main = { merged: merge(aligned), stats: stats(flatten([[r1, r2] | (r1, r2) <- pairs])), }
  • 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Dataflow evaluation • Any computation that can be parallelized is • Users needn’t (can’t) express parallelism directly • Only data dependencies imply sequencing
  • 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Referential transparency, memoization val index = computeIndex(reference) func align(r1, r2 file) file = exec(..) (out file) {" bwa mem … {{index}} … {{r1}} {{r2}} > {{out}} "} // align(r1, r2) computed only once. (align(r1, r2), align(r1, r2)) // align(r1b, r2b) computed, but index is reused. align(r1b, r2b)
  • 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Automatic memoization • Expensive computations are reused, needn’t be considered separately • Indices, models, references, and so on can be expressed directly in code, no need to stage computation • Means that computation is precise: For example, if the input reference changes, an index may be recomputed • Makes evaluation incremental
  • 37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Incremental evaluation // Any change to either code or input data // results in the smallest (re-)computation // required to “catch up” what was previously // computed. func extract(sample file) file = … func train(samples [file]) file = … func evaluate(model file, samples [file]) file = … val training [file] = … val testing [file] = … val model = train([extract(x) | x <- training]) val results = evaluate(model, testing)
  • 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Implementation notes
  • 39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Implementation goals • Language semantics and data model accommodate implementation simplicity • Maintain minimal API surface to external components • Small set of core abstractions around which the system is built • … in service of minimizing systems complexity
  • 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. The big picture assoc (e.g., Amazon DynamoDB) repository (cache, e.g. Amazon S3) evaluator (e.g., a laptop) cluster (e.g., Amazon EC2) alloc alloc … dockerd dockerd repository (local, file) repository (local, file)
  • 41. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Evaluation • Two evaluators: • Direct evaluation of AST into a flow graph • Flow graph evaluation, separately • The flow graph is a dependency graph between executions: • Execs nodes encode a run-to-completion execution (intern, Docker, extern) • Continuation nodes call into the evaluator to produce a new (dynamic) subgraph
  • 42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Evaluation strategy • Top-down, then bottom-up: • Compute key for every node for cache lookups • Continuation nodes compute keys based on dependencies + partially evaluated AST • Derive both static (AST+static params) and dynamic (based on concrete inputs) keys
  • 43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Evaluation example Root Todo Done Intern s3://…/dogs/ Continue val Dogs = [ resize(img) | (_, img) <- list(dir(dogs)) ]
  • 44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Evaluation example Root Exec 1 Exec 4 Exec NTodo Done val Dogs = [ resize(img) | (_, img) <- list(dir(dogs)) ] Exec 2 Exec 3 Continue
  • 45. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Evaluation example Todo Done montage(dogs.Dogs) Exec Continue Root
  • 46. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Taking it further: dataspaces.
  • 47. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Observations • With referential transparency, an expression is a stable name for the value of that expression • Reflow can derive a key from any expression • We have the makings of a data namespace, a dataspace
  • 48. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Bundles • Reflow’s bundle mechanisms produce a single, self-contained artifact for a module • $ reflow bundle module.rf • Bundles are modules: • They can be run • They can be introspected (for example, doc) • They can be instantiated by other modules
  • 49. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Dataspaces A dataspace is a mapping of symbolic names to bundles, for example: $ reflow mount wgs.rfx marius@grailbio.com/exp/wgs:v1 $ reflow doc marius@grailbio.com/exp/wgs:v1 $ reflow run marius@grailbio.com/exp/wgs:v1
  • 50. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Dataspaces are data APIs • Makes data less ad-hoc: • Typing • Modules • Documentation • Stable names • Dataspaces enforce subtyping rules: • Can’t unintentionally break users • Still fully incremental
  • 51. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Status Open source: https://github.com/grailbio/reflow Reflow is in heavy use at GRAIL, CZI, others Bring your credentials, Reflow takes care of the rest!
  • 52. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 53. Please complete the session survey in the mobile app. ! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.