3. Google confidential │ Do not
distribute
Agenda
Big Data the Cloud Way - Why would you ?
Fully Managed: NoOps Ingest, Process & Analyse
Hands On Demo: Building an Event Streaming Pipeline
1
2
3
5. 20-?? BILLION
devices will be
connected by 2020
$4-11 Trillion
Economic Impact
54% of top performer
companies will invest
more in sensors this yr
Sources: Gartner, PwC, McKinsey
6. 20-?? BILLION
devices will be
connected by 2020
$4-11 Trillion
Economic Impact
54% of top performer
companies will invest
more in sensors this yr
Sources: Gartner, PwC, McKinsey
12. A datacenter is not a collection of computers,
a datacenter is a computer.
The same is happening in the Cloud today
13. State of the art Data Centers.
For the past 17 years, Google has been building out the
world’s fastest, most powerful, highest quality cloud
infrastructure on the planet.
14. 2002 2004 2006 2008 2010 2012
Dremel ColossusMapReduce
GFS Bigtable Spanner
2014
Dataflow
Google’s Big Data Innovations go far back Flumejava
BigQuery
Millwheel
Bigtable
25. Building what’s next 25
Scales automatically
No setup or administration
Stream up to 100,000 rows p/sec
Easily integrates with third-party software
Google BigQuery
makes complex data analysis simple
26. Question:
Find root cause why ad was or was
not delivered in the last 30 days.
select date,
rejection_reason, count(*)
from line_item_table.
last30days
where line_item_id=56781234
1.2B Rows scanned
Result in ~5 seconds!
BigQuery Use @Google: DoubleClick Support
27. BigQuery scales “Google scale”
Streaming ingest at peak
Largest Data Lake on BigQuery
Largest query by data size
Largest query by rows 10.5 Trillion rows
2.3 Million rows per second
38 Petabytes
2.1 Petabytes
29. Building what’s next 29
Merges batch and stream processing
Data processing pipelines
Monitoring interface
Significantly lower cost
Runs on Google or Cloudera Spark (Github)
Google Cloud Dataflow
makes complex data analysis simple
30. What is Cloud Dataflow?
Cloud Dataflow is a
collection of SDKs for
building batch or
streaming parallelized
data processing pipelines.
Cloud Dataflow is a fully
managed service for
executing optimized
parallelized data processing
pipelines.
31. Cloud Pub/Sub
• Globally redundant
• Low latency (sub sec.)
• Batched read/write
• Custom labels
• Push & Pull
• Auto expiration
Publisher A Publisher B Publisher C
Message 1
Topic A Topic B Topic C
Subscription XA Subscription XB
Subscription
YC
Subscription
ZC
Cloud
Pub/Sub
Subscriber X Subscriber Y
Message 2 Message 3
Subscriber Z
Message 1
Message 2
Message 3
Message 3
32. Dataflow goodies
Autoscaling mid-job
Fully managed - No-Ops
Intuitive Data Processing Framework
Batch and Stream Processing in one
Liquid sharding mid-job
1
2
3
4
5
Pipeline p = Pipeline.create();
p.begin()
.apply(TextIO.Read.from(“gs://…”))
.apply(ParDo.of(new ExtractTags())
.apply(Count.create())
.apply(ParDo.of(new ExpandPrefixes())
.apply(Top.largestPerKey(3))
.apply(TextIO.Write.to(“gs://…”));
p.run();
33. Dataflow goodies
Autoscaling mid-job
Fully managed - No-Ops
Intuitive Data Processing Framework
Batch and Stream Processing in one
Liquid sharding mid-job
1
2
3
4
5
Deploy
Schedule & Monitor
34. Autoscaling mid-job
Fully managed - No-Ops
Intuitive Data Processing Framework
Batch and Stream Processing in one
Liquid sharding mid-job
1
2
3
4
5
Dataflow goodies
800 RPS 1200 RPS 5000 RPS 50 RPS
35. Autoscaling mid-job
Fully managed - No-Ops
Intuitive Data Processing Framework
Batch and Stream Processing in one
Liquid sharding mid-job
1
2
3
4
5
Dataflow goodies
36. Autoscaling mid-job
Fully managed - No-Ops
Intuitive Data Processing Framework
Batch and Stream Processing in one
Liquid sharding mid-job
1
2
3
4
5
Dataflow goodies
Pipeline p = Pipeline.create();
p.begin()
.apply(TextIO.Read.from(“gs://…”))
.apply(ParDo.of(new ExtractTags())
.apply(Count.create())
.apply(ParDo.of(new ExpandPrefixes())
.apply(Top.largestPerKey(3))
.apply(TextIO.Write.to(“gs://…”));
p.run();
.apply(PubsubIO.Read.from(“input_topic”))
.apply(Window.<Integer>by(FixedWindows.of(5, MINUTES))
.apply(PubsubIO.Write.to(“output_topic”));
37. Autoscaling mid-job
Fully managed - No-Ops
Intuitive Data Processing Framework
Batch and Stream Processing in one
Liquid sharding mid-job
1
2
3
4
5
Dataflow goodies
Nighttime Mid-Day Nighttime