Presenter: Kenn Knowles, Software Engineer, Google & Apache Beam (incubating) PPMC member
Apache Beam (incubating) is a programming model and library for unified batch & streaming big data processing. This talk will cover the Beam programming model broadly, including its origin story and vision for the future. We will dig into how Beam separates concerns for authors of streaming data processing pipelines, isolating what you want to compute from where your data is distributed in time and when you want to produce output. Time permitting, we might dive deeper into what goes into building a Beam runner, for example atop Apache Apex.
15. The Beam Vision (for users)
Sum Per Key
15
input.apply(
Sum.integersPerKey())
Java
input | Sum.PerKey()
Python
Apache Flink
Apache Spark
Cloud Dataflow
⋮ ⋮
Apache Apex
Apache
Gearpump
(incubating)
16. Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*"))
.apply(FlatMapElements.via(line → Arrays.asList(line.split("[^a-zA-Z']+"))))
.apply(Filter.byPredicate(word → !word.isEmpty()))
.apply(Count.perElement())
.apply(MapElements.via(count → count.getKey() + ": " + count.getValue())
.apply(TextIO.Write.to("gs://..."));
p.run();
What your (Java) Code Looks Like
16
17. The Beam Model: Asking the Right Questions
What are you computing?
Where in event time?
When in processing time are results produced?
How do refinements relate?
17
18. The Beam Model: Asking the Right Questions
What are you computing?
Where in event time?
When in processing time are results produced?
How do refinements relate?
18
Aggregations,
transformations,
...
20. The Beam Model: What are you computing?
Sum Per Key
20
input.apply(Sum.integersPerKey())
.apply(BigQueryIO.Write.to(...));
Java
input | Sum.PerKey()
| Write(BigQuerySink(...))
Python
http://beam.apache.org/blog/2016/05/27/where-is-my-pcollection-dot-map.html
21. The Beam Model: Asking the Right Questions
What are you computing?
Where in event time?
When in processing time are results produced?
How do refinements relate?
21
Event time
windowing
30. ProcessingTime
Event Time
Event Time Windows
30
(implementing processing time windows)
Just throw away
your data's
timestamps and
replace them with
"now()"
31. input | WindowInto(FixedWindows(3600)
| Sum.PerKey()
| Write(BigQuerySink(...))
Python
The Beam Model: Where in Event Time?
Sum Per Key
Window Into
31
input.apply(
Window.into(
FixedWindows.of(
Duration.standardHours(1)))
.apply(Sum.integersPerKey())
.apply(BigQueryIO.Write.to(...))
Java
33. The Beam Model: Asking the Right Questions
What are you computing?
Where in event time?
When in processing time are results produced?
How do refinements relate?
33
Watermarks
& Triggers
42. The Beam Model: When in Processing Time?
Sum Per Key
Window Into
42
input
.apply(Window.into(FixedWindows.of(...))
.triggering(
AfterWatermark.pastEndOfWindow()))
.apply(Sum.integersPerKey())
.apply(BigQueryIO.Write.to(...))
Java
input | WindowInto(FixedWindows(3600),
trigger=AfterWatermark())
| Sum.PerKey()
| Write(BigQuerySink(...))
Python
Trigger after end
of window
53. Build a finely tuned trigger for your use case
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(
AfterProcessingTime
.pastFirstElementInPane()
.plusDuration(Duration.standardMinutes(1))
.withLateFirings(AfterPane.elementCountAtLeast(1))
53
Bill at end of month
Near real-time estimates
Immediate corrections
60. Trigger Catalogue
Composite TriggersBasic Triggers
60
AfterEndOfWindow()
AfterCount(n)
AfterProcessingTimeDelay(Δ)
AfterEndOfWindow()
.withEarlyFirings(A)
.withLateFirings(B)
AfterAny(A, B)
AfterAll(A, B)
Repeat(A)
Sequence(A, B)
61. The Beam Model: Asking the Right Questions
What are you computing?
Where in event time?
When in processing time are results produced?
How do refinements relate?
61
Accumulation
Mode
62. The Beam Model: How do refinements relate?
62
input
.apply(Window.into(...).triggering(...).discardingFiredPanes())
.apply(Sum.integersPerKey())
.apply(BigQueryIO.Write.to(...))
vs
1
3 7
4
10
5
1
3 7
4
10
15
discarding accumulating
63. The Beam Model: Asking the Right Questions
What are you computing?
Where in event time?
When in processing time are results produced?
How do refinements relate?
63
65. 1. End users: who want to write
pipelines in a language that’s familiar.
2. SDK writers: who want to make Beam
concepts available in new languages.
3. Runner writers: who have a
distributed processing environment
and want to run Beam pipelines
Beam Fn API: Invoke user-definable functions
Apache
Flink
Apache
Spark
Beam Runner API: Build and submit a piepline
Other
LanguagesBeam Java
Beam
Python
Execution Execution
Cloud
Dataflow
Execution
The Beam Vision
Apache
Apex
Apache
Gearpump
(incubating)
66. Project Setup (vision meets code)
GoogleCloudPlatform/DataflowJavaSDK cloudera/spark-dataflow dataArtisans/flink-dataflow
apache/incubator-beam
Direct (on your laptop)
Google Cloud Dataflow
Flink
Spark
In pull request: Apex, Gearpump
Integration tests
Runners
Examples
I/O Connectors
sharing
HDFS
Kafka
BigQuery
Google Cloud Storage, Pubsub,
Bigtable, Datastore
In pull request: JMS, Cassandra
Proposed: Sqoop, Parquet, JDBC,
SocketStream, ...
SDKs
67. Committers from Google, Data Artisans, Cloudera, Talend, Paypal
● ~40 commits/week
● Rigorous code review for every commit
Contributors [with GitHub badges] from:
Spotify, Intel, Twitter, Capital One, DataTorrent, …, <your name here>
● Improvements to existing I/O connectors
● Improvements to Spark runner
● Utility classes for users
● Documentation fixes
● Bug diagnoses
● New I/O connectors
● Gearpump runner PoC
● Apex runner PoC!
… and it has been awesome
apache/incubator-beam
68. Java SDK: Transition from Dataflow
Dataflow Java 1.x
Apache Beam Java 0.x
Apache Beam Java 2.x
Bug Fix
Feature
Breaking Change
We
are
here
Feb
2016
Late
2016
70. Why Apache Beam?
Unified - One model handles batch and
streaming use cases.
Portable - Pipelines can be executed on multiple
execution environments, avoiding lock-in.
Extensible - Supports user and community
driven SDKs, Runners, transformation libraries,
and IO connectors.
71. Why Apache Beam?
http://data-artisans.com/why-apache-beam/
"We firmly believe that the Beam model is the
correct programming model for streaming and
batch data processing."
- Kostas Tzoumas (Data Artisans)
https://cloud.google.com/blog/big-
data/2016/05/why-apache-beam-a-google-
perspective
"We hope it will lead to a healthy ecosystem of
sophisticated runners that compete by making
users happy, not [via] API lock in."
- Tyler Akidau (Google)
72. 72
Creating an Apache Beam Community
Collaborate - Beam is becoming a community-driven
effort with participation from many organizations and
contributors.
Grow - We want to grow the Beam ecosystem and
community with active, open involvement so Beam is
a part of the larger OSS ecosystem.
We love contributions. Join us!
73. Apache Beam
http://beam.incubator.apache.org/
Why Apache Beam? (from Data Artisans)
Why Apache Beam? (from Google)
Programming Model Overviews
Streaming 101
Streaming 102
The Dataflow Beam Model
Join the community!
User discussions - user-subscribe@beam.incubator.apache.org
Development discussions - dev-subscribe@beam.incubator.apache.org
Follow @ApacheBeam on Twitter
Learn More!
73