Apache Beam (incubating)

Apache Beam (incubating)
Kenneth Knowles
klk@google.com
@KennKnowles Apache Apex Meetup, 2016-06-27
https://goo.gl/LTLjKt

Motivation
Beam Model
Beam Project / Technical Vision
Agenda
1
2
3
2

https://commons.wikimedia.org/wiki/File:Globe_centered_in_the_Atlantic_Ocean_(green_and_grey_globe_scheme).svg
4

5
Unbounded, delayed, out of order
9:008:00 14:0013:0012:0011:0010:002:001:00 7:006:005:004:003:00
5
8:00
8:008:00

Organizing the stream
7
8:00
8:00
8:00

Completeness Latency Cost
$$$
Data Processing Tradeoffs
8

What is important for your application?
Completeness Low Latency Low Cost
Important
Not Important
$$$
9

Monthly Billing
Important
Not Important
$$$
10

Billing estimate
Important
Not Important
$$$
11

Abuse Detection
Important
Not Important
$$$
12

The Beam Model
Pipeline
14
PTransform
PCollection

The Beam Vision (for users)
Sum Per Key
15
input.apply(
Sum.integersPerKey())
Java
input | Sum.PerKey()
Python
Apache Flink
Apache Spark
Cloud Dataflow
⋮ ⋮
Apache Apex
Apache
Gearpump
(incubating)

Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*"))
.apply(FlatMapElements.via(line → Arrays.asList(line.split("[^a-zA-Z']+"))))
.apply(Filter.byPredicate(word → !word.isEmpty()))
.apply(Count.perElement())
.apply(MapElements.via(count → count.getKey() + ": " + count.getValue())
.apply(TextIO.Write.to("gs://..."));
p.run();
What your (Java) Code Looks Like
16

The Beam Model: Asking the Right Questions
What are you computing?
Where in event time?
When in processing time are results produced?
How do refinements relate?
17

18
Aggregations,
transformations,
...

The Beam Model: What are you computing?
Sum Per
User
19

The Beam Model: What are you computing?
Sum Per Key
20
input.apply(Sum.integersPerKey())
.apply(BigQueryIO.Write.to(...));
Java
input | Sum.PerKey()
| Write(BigQuerySink(...))
Python
http://beam.apache.org/blog/2016/05/27/where-is-my-pcollection-dot-map.html

21
Event time
windowing

22
The Beam Model: Where in Event Time?
8:00
8:00
8:00

Processing Time vs Event Time
Event Time = Processing Time ??
23

24
ProcessingTime

ProcessingTime
Realtime
25
This is not possible

26
Processing Delay
ProcessingTime

Very delayed
27
ProcessingTime
Event Time

Processing Time windows
(probably are not what you want)
ProcessingTime
Event Time 28

Event Time Windows
29
ProcessingTime
Event Time

ProcessingTime
Event Time
Event Time Windows
30
(implementing processing time windows)
Just throw away
your data's
timestamps and
replace them with
"now()"

input | WindowInto(FixedWindows(3600)
| Sum.PerKey()
Python
The Beam Model: Where in Event Time?
Sum Per Key
Window Into
31
input.apply(
Window.into(
FixedWindows.of(
Duration.standardHours(1)))
.apply(Sum.integersPerKey())
.apply(BigQueryIO.Write.to(...))
Java

So that's what and where...
32

33
Watermarks
& Triggers

Event time windows
ProcessingTime
34
Event Time

Fixed cutoff (we can do better)
ProcessingTime
Event Time
35
Allowed
delay
Concurrent windows

Perfect watermark
ProcessingTime
36
Event Time
Check out Slava's
slides from Strata
London 2016 talk on
watermarks:
https://goo.gl/K4FnqQ

Heuristic Watermark
ProcessingTime
37
Event Time

Heuristic Watermark
ProcessingTime
38
Current processing time
Event Time

Heuristic Watermark
ProcessingTime
39
Event Time

Heuristic Watermark
ProcessingTime
40
Late data
Event Time

Watermarks measure completeness
41
$$$
$$$
$$
? Running Total
✔ Monthly billing
? Abuse Detection

The Beam Model: When in Processing Time?
Sum Per Key
Window Into
42
input
.apply(Window.into(FixedWindows.of(...))
.triggering(
AfterWatermark.pastEndOfWindow()))
Java
input | WindowInto(FixedWindows(3600),
trigger=AfterWatermark())
| Sum.PerKey()
Python
Trigger after end
of window

ProcessingTime
Event Time
AfterWatermark.pastEndOfWindow()
43

ProcessingTime
Event Time
44

ProcessingTime
Event Time
Late data
45

ProcessingTime
Event Time
46
High completeness
Potentially high latency
Low cost
$$$

ProcessingTime
Event Time
Repeatedly.forever(
AfterPane.elementCountAtLeast(2))
47

ProcessingTime
Event Time
48
Repeatedly.forever(

ProcessingTime
Event Time
49
Repeatedly.forever(

ProcessingTime
Event Time
50
Repeatedly.forever(

ProcessingTime
Event Time
51
Repeatedly.forever(

ProcessingTime
Event Time
52
Repeatedly.forever(
Low completeness
Low latency
Cost driven by input$$$

Build a finely tuned trigger for your use case
.withEarlyFirings(
AfterProcessingTime
.pastFirstElementInPane()
.plusDuration(Duration.standardMinutes(1))
.withLateFirings(AfterPane.elementCountAtLeast(1))
53
Bill at end of month
Near real-time estimates
Immediate corrections

ProcessingTime
Event Time
54
.withEarlyFirings(after 1 minute)
.withLateFirings(ASAP after each element)

ProcessingTime
Event Time
55

ProcessingTime
Event Time
56
Low completeness
Low latency
Low cost, driven by time$$$

ProcessingTime
Event Time
57

ProcessingTime
Event Time
Late output
58

ProcessingTime
Event Time
Late output
59

Trigger Catalogue
Composite TriggersBasic Triggers
60
AfterEndOfWindow()
AfterCount(n)
AfterProcessingTimeDelay(Δ)
AfterEndOfWindow()
.withEarlyFirings(A)
.withLateFirings(B)
AfterAny(A, B)
AfterAll(A, B)
Repeat(A)
Sequence(A, B)

61
Accumulation
Mode

The Beam Model: How do refinements relate?
62
input
.apply(Window.into(...).triggering(...).discardingFiredPanes())
vs
1
3 7
4
10
5
1
3 7
4
10
15
discarding accumulating

63

64
Beam Project / Technical Vision3

1. End users: who want to write
pipelines in a language that’s familiar.
2. SDK writers: who want to make Beam
concepts available in new languages.
3. Runner writers: who have a
distributed processing environment
and want to run Beam pipelines
Beam Fn API: Invoke user-definable functions
Apache
Flink
Apache
Spark
Beam Runner API: Build and submit a piepline
Other
LanguagesBeam Java
Beam
Python
Execution Execution
Cloud
Dataflow
Execution
The Beam Vision
Apache
Apex
Apache
Gearpump
(incubating)

Project Setup (vision meets code)
GoogleCloudPlatform/DataflowJavaSDK cloudera/spark-dataflow dataArtisans/flink-dataflow
apache/incubator-beam
Direct (on your laptop)
Google Cloud Dataflow
Flink
Spark
In pull request: Apex, Gearpump
Integration tests
Runners
Examples
I/O Connectors
sharing
HDFS
Kafka
BigQuery
Google Cloud Storage, Pubsub,
Bigtable, Datastore
In pull request: JMS, Cassandra
Proposed: Sqoop, Parquet, JDBC,
SocketStream, ...
SDKs

Committers from Google, Data Artisans, Cloudera, Talend, Paypal
● ~40 commits/week
● Rigorous code review for every commit
Contributors [with GitHub badges] from:
Spotify, Intel, Twitter, Capital One, DataTorrent, …, <your name here>
● Improvements to existing I/O connectors
● Improvements to Spark runner
● Utility classes for users
● Documentation fixes
● Bug diagnoses
● New I/O connectors
● Gearpump runner PoC
● Apex runner PoC!
… and it has been awesome
apache/incubator-beam

Java SDK: Transition from Dataflow
Dataflow Java 1.x
Apache Beam Java 0.x
Apache Beam Java 2.x
Bug Fix
Feature
Breaking Change
We
are
here
Feb
2016
Late
2016

Understanding: Capability Matrix
http://beam.incubator.apache.org/capability-matrix/

Why Apache Beam?
Unified - One model handles batch and
streaming use cases.
Portable - Pipelines can be executed on multiple
execution environments, avoiding lock-in.
Extensible - Supports user and community
driven SDKs, Runners, transformation libraries,
and IO connectors.

Why Apache Beam?
http://data-artisans.com/why-apache-beam/
"We firmly believe that the Beam model is the
correct programming model for streaming and
batch data processing."
- Kostas Tzoumas (Data Artisans)
https://cloud.google.com/blog/big-
data/2016/05/why-apache-beam-a-google-
perspective
"We hope it will lead to a healthy ecosystem of
sophisticated runners that compete by making
users happy, not [via] API lock in."
- Tyler Akidau (Google)

72
Creating an Apache Beam Community
Collaborate - Beam is becoming a community-driven
effort with participation from many organizations and
contributors.
Grow - We want to grow the Beam ecosystem and
community with active, open involvement so Beam is
a part of the larger OSS ecosystem.
We love contributions. Join us!

Apache Beam
http://beam.incubator.apache.org/
Why Apache Beam? (from Data Artisans)
Why Apache Beam? (from Google)
Programming Model Overviews
Streaming 101
Streaming 102
The Dataflow Beam Model
Join the community!
User discussions - user-subscribe@beam.incubator.apache.org
Development discussions - dev-subscribe@beam.incubator.apache.org
Follow @ApacheBeam on Twitter
Learn More!
73

Apache Beam (incubating)

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Apache Beam (incubating)

Semelhante a Apache Beam (incubating) (20)

Mais de Apache Apex

Mais de Apache Apex (20)

Último

Último (20)

Apache Beam (incubating)