Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Google Cloud Dataflow the next generation of managed big data service
1. Google Cloud Dataflow
the next generation of managed big data service
based on the Apache Beam programming model
Sub Szabolcs Feczak, Cloud Solutions Engineer
Google
9th Cloud & Data Center World 2016 - 한국 IDG
2. You leave here understanding the fundamentals of
the Apache Beam model and the Google Cloud Dataflow managed service
We have some fun.
1
Goals
2
11. Apache Beam (incubating)
Java https://github.com
/GoogleCloudPlatform/DataflowJavaSDK
Python (ALPHA)
Scala /darkjh/scalaflow
/jhlch/scala-dataflow-dsl
Software
Development Kits
Runners
http://incubator.apache.org/projects/beam.html
The Dataflow submission to the Apache Incubator was accepted on February 1, 2016,
and the resulting project is now called Apache Beam.
Spark runner@
/cloudera/spark-
dataflow
Flink runner @
/dataArtisans/flink-dataflow
12. • Movement
• Filtering
• Enrichment
• Shaping
• Reduction
• Batch computation
• Continuous
computation
• Composition
• External
orchestration
• Simulation
Where might you use Apache Beam?
AnalysisETL Orchestration
14. GCP
Managed Service
User Code & SDK
Work Manager
Deploy & Schedule
Monitoring UI
Job Manager
Cloud Dataflow Managed Service advantages
(GA since 2015 August)
Progress & Logs
15. Deploy Schedule & Monitor Tear Down
Worker Lifecycle Management
Cloud Dataflow Service
16. ❯ Time & life never stop
❯ Data rates & schema are not static
❯ Scaling models are not static
❯ Non-elastic compute is wasteful and
can create lag
Challenge: cost optimization
21. The Apache Beam Logical Model
What are you computing?
Where in event time?
When in processing time?
How do refinements relate?
22. What are you computing?
● A Pipeline represents a graph
● Nodes are data processing
transformations
● Edges are data sets flowing
through the pipeline
● Optimized and executed as a
unit for efficiency
23. What are you computing? PCollections
● is a collection of homogenous
data of the same type
● Maybe be bounded or
unbounded in size
● Each element has an implicit
timestamp
● Initially created from backing data
stores
25. What are you computing? PTransforms
transform PCollections into other PCollections.
What Where When How
Element-Wise
(Map + Reduce = ParDo)
Aggregating
(Combine, Join Group)
Composite
26. GroupByKey
Pair With Ones
Sum Values
Count
❯ Define new PTransforms by building up
subgraphs of existing transforms
❯ Some utilities are included in the SDK
• Count, RemoveDuplicates, Join,
Min, Max, Sum, ...
❯ You can define your own:
• DoSomething, DoSomethingElse,
etc.
❯ Why bother?
• Code reuse
• Better monitoring experience
Composite PTransforms
Apache BeamSDK
31. What Where When How
When in Processing Time?
● Triggers control
when results are
emitted.
● Triggers are often
relative to the
watermark.
ProcessingTime
Event Time
Watermark
Skew
33. What Where When How
Example: Triggering for Speculative & Late Data
34. What Where When How
How do Refinements Relate?
● How should multiple outputs per window
accumulate?
● Appropriate choice depends on consumer.
Firing Elements
Speculative 3
Watermark 5, 1
Late 2
Total Observ 11
Discarding
3
6
2
11
Accumulating
3
9
11
23
Acc. & Retracting
3
9, -3
11, -9
11
36. 1. Classic Batch 2. Batch with Fixed
Windows
3. Streaming 5. Streaming with
Retractions
4. Streaming with
Speculative + Late Data
Customizing What Where When How
What Where When How
38. Optimizing Your Time To Answer
More time to dig
into your data
Programming
Resource
provisioning
Performance
tuning
Monitoring
Reliability
Deployment &
configuration
Handling
Growing Scale
Utilization
improvements
Data Processing with Cloud DataflowTypical Data Processing
Programming
39. How much more
time?
You do not just save
on processing, but
code complexity
and size as well!
Source: https://cloud.google.
com/dataflow/blog/dataflow-beam-and-
spark-comparison
40. What do customers have to say about
Google Cloud Dataflow
"We are utilizing Cloud Dataflow to overcome elasticity
challenges with our current Hadoop cluster. Starting with
some basic ETL workflow for BigQuery ingestion, we
transitioned into full blown clickstream processing and
analysis. This has helped us significantly improve
performance of our overall system and reduce cost."
Sudhir Hasbe, Director of Software Engineering, Zullily.com
“The current iteration of Qubit’s real-time data supply chain
was heavily inspired by the ground-breaking stream
processing concepts described in Google’s MillWheel paper.
Today we are happy to come full circle and build streaming
pipelines on top of Cloud Dataflow - which has delivered
on the promise of a highly-available and fault-tolerant
data processing system with an incredibly powerful and
expressive API.”
Jibran Saithi, Lead Architect, Qubit
"We are very excited about the productivity benefits offered by
Cloud Dataflow and Cloud Pub/Sub. It took half a day to
rewrite something that had previously taken over six
months to build using Spark"
Paul Clarke, Director of Technology, Ocado
“Boosting performance isn’t the only thing we want to get
from the new system. Our bet is that by using cloud-managed
products we will have a much lower operational overhead.
That in turn means we will have much more time to make
Spotify’s products better.”
Igor Maravić, Software Engineer working at Spotify
42. Let’s build something - Demo!
Ingest stream from Wikipedia edits https://wikitech.wikimedia.
org/wiki/Stream.wikimedia.org
Inspect the result set in our data warehouse (BigQuery)
Create a pipeline and run a Dataflow job to extract the top
10 active editors and top 10 pages edited
Extract words from a Shakespeare corpus, count the
occurrences of each word, write sharded results as blobs
into a key value store (Cloud Storage)
1.
2.