The document provides information about Simon Su and his expertise in Google Dataflow. It includes Simon's contact information and links to his online profiles. It then discusses Simon's areas of specialization including data scientist, data engineer, and frontend engineer. The document proceeds to provide information about preparing for a Google Dataflow workshop, including documents and labs to review. It also discusses Google Cloud services for data processing and analysis like Dataflow, BigQuery, Pub/Sub, and Dataproc. Finally, it outlines the agenda for the workshop, which will include hands-on labs to deploy users' first Dataflow project and create a streaming Dataflow model.
8. Google Focused Cloud
Virtualized
Data Centers
Standard virtual kit for
Rent. Still yours to
manage.
2nd
Wave
Colocation
1st
Wave
Your kit, someone
else’s building.
Yours to manage.
Assembly required True On Demand Cloud
Next
Storage Processing Memory Network
Clusters
Distributed Storage, Processing
& Machine Learning
Containers
3rd
Wave
An actual, global
elastic cloud
Invest your energy in
great apps.
9. Google Cloud Family
Foundation
Infrastructure & Operations
Data Services
Application
Runtime Services
Enabling No-Touch Operations
Breakthrough Insights,
Breakthrough Applications
The Gear that Powers Google
10. GCP tools for data processing and analysis
Dataflow
StoreCapture Analyze
BigQuery Larger
Hadoop
Ecosystem
Pub/Sub
Logs
App Engine
BigQuery streaming
Process
Cloud
Storage
Cloud
Datastore
(NoSQL)
Cloud SQL
(mySQL)
BigQuery
Storage
Dataproc Dataproc
11. Common big data processing flow
Devices
Physical or
virtual servers
as frontend
data receiver
MapReduce
servers for large
data
transformation in
batch way or
streaming
Strong queue
service for
handling large
scale data
injection
Large scale data
store for storing
data and serve
query workload
Smart devices,
IoT devices
and Sensors
1M
Devices
16.6K
Events/sec
16.6K
Events/sec
43B
Events/month
27. Lab 1: Ready your dataflow
environment and create your
first dataflow project
● Create GCP project
● Install Eclipse and Dataflow plugin
● Create first Dataflow project
● Run your project
31. What we do in Big Data process...
Map
Shuffle
Reduce
ParDo
GroupByKey
ParDo
32. Pipeline
● A Direct Acyclic Graph of data processing
transformations
● Can be submitted to the Dataflow Service for
optimization and execution or executed on an
alternate runner e.g. Spark
● May include multiple inputs and multiple outputs
● May encompass many logical MapReduce
operations
● PCollections flow through the pipeline
33. Code Review
Pipeline p = Pipeline.create(
PipelineOptionsFactory.fromArgs(args).withValidation().create());
p.apply(Create.of("Hello", "World"))
.apply(ParDo.of(new DoFn<String, String>() {
@Override
public void processElement(ProcessContext c) {
c.output(c.element().toUpperCase());
}
}))
.apply(ParDo.of(new DoFn<String, Void>() {
@Override
public void processElement(ProcessContext c) {
LOG.info(c.element());
}
}));
p.run();
建立pipeline物件
執行pipeline
34. Inputs & Outputs
Your
Source/Sink
Here
❯ Read from standard Google Cloud Platform
data sources
• GCS, Pub/Sub, BigQuery, Datastore
❯ Write your own custom source by teaching
Dataflow how to read it in parallel
• Currently for bounded sources only
❯ Write to GCS, BigQuery, Pub/Sub
• More coming…
❯ Can use a combination of text, JSON, XML,
Avro formatted data
35. Code Review
Pipeline p = Pipeline.create(
PipelineOptionsFactory.fromArgs(args).withValidation().create());
p.apply(Create.of("Hello", "World"))
.apply(ParDo.of(new DoFn<String, String>() {
@Override
public void processElement(ProcessContext c) {
c.output(c.element().toUpperCase());
}
}))
.apply(ParDo.of(new DoFn<String, Void>() {
@Override
public void processElement(ProcessContext c) {
LOG.info(c.element());
}
}));
p.run();
建立Input
Parallel Do,Log輸出
36. PCollection
❯ A collection of data of type T in a pipeline
- PCollection<K,V>
❯ Maybe be either bounded or unbounded
in size
❯ Created by using a PTransform to:
• Build from a java.util.Collection
• Read from a backing data store
• Transform an existing PCollection
❯ Often contain the key-value pairs using
KV
{Seahawks, NFC, Champions, Seattle,
...}
{...,
“NFC Champions #GreenBay”,
“Green Bay #superbowl!”,
...
“#GoHawks”,
...}
37. Transforms
● A step, or a processing operation that transforms data
○ convert format , group , filter data
● Type of Transforms
○ ParDo
○ GroupByKey
○ Combine
○ Flatten
■ Multiple PCollection objects that contain the same data type, you can
merge them into a single logical PCollection using the Flatten transform
38. Code Review
Pipeline p = Pipeline.create(
PipelineOptionsFactory.fromArgs(args).withValidation().create());
p.apply(Create.of("Hello", "World"))
.apply(ParDo.of(new DoFn<String, String>() {
@Override
public void processElement(ProcessContext c) {
c.output(c.element().toUpperCase());
}
}))
.apply(ParDo.of(new DoFn<String, Void>() {
@Override
public void processElement(ProcessContext c) {
LOG.info(c.element());
}
}));
p.run();
Transform輸入,使用Parallel Do
方式,平行的將輸入物件轉大寫
Parallel Do,Log輸出
39. Pardo (Parallel do)
❯ Processes each element of a PCollection
independently using a user-provided DoFn
❯ Corresponds to both the Map and Reduce
phases in Hadoop i.e. ParDo->GBK->ParDo
❯ Useful for
Filtering a data set.
Formatting or converting the type of each element
in a data set.
Extracting parts of each element in a data set.
Performing computations on each element in a
data set.
{Seahawks, NFC,
Champions, Seattle, ...}
{
KV<S, Seahawks>,
KV<C,Champions>,
<KV<S, Seattle>,
KV<N, NFC>, …
}
KeyBySessionId
40. Group by key
Wait a minute…
How do you do a GroupByKey on an unbounded PCollection?
{KV<S, Seahawks>, KV<C,Champions>,
<KV<S, Seattle>, KV<N, NFC>, ...}
{KV<S, Seahawks>, KV<C,Champions>,
<KV<S, Seattle>, KV<N, NFC>, ...}
GroupByKey
• Takes a PCollection of key-value pairs
and gathers up all values with the same
key
• Corresponds to the shuffle phase in
Hadoop
{KV<S, {Seahawks, Seattle, …},
KV<N, {NFC, …}
KV<C, {Champion, …}}
42. Lab 2: Deploy your first
project to Google Cloud
Platform
● Checking the Lab 1 project working well
● Deploy to cloud and watch the dataflow
task dashboard
● Implement the Input/Output/Transform
in your project
48. Simple Guide for Pub/Sub
Ref: https://gcpug-tw.gitbooks.io/google-cloud-platform-in-practice/content/pubsub_getting_start.html
49. Dataflow with Cloud Pub/Sub Use Case
Publisher A Publisher B Publisher C
Message 1
Topic A Topic B Topic C
Subscription XA Subscription XB
Subscription
YC
Subscription
ZC
Cloud
Pub/Sub
Subscriber X Subscriber Y
Message 2 Message 3
Subscriber Z
Message 1
Message 2
Message 3
Message 3
Globally redundant
Low latency (sub sec.)
N to N coupling
Batched read/write
Push & Pull
Guaranteed Delivery
Auto expiration
50. Example of PubSub IO to BigQuery IO
Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
options.setStreaming(true);
Pipeline p = Pipeline.create(options);
PCollection<String> input =
p.apply(PubsubIO.Read.topic("projects/sunny-573/topics/jcconf2016"));
PCollection<String> windowedWords =
input.apply(Window.<String> into(
FixedWindows.of(Duration.standardMinutes(options.getWindowSize()))));
PCollection<KV<String, Long>> wordCounts =
windowedWords.apply(new TestMain.MyCountWords());
wordCounts.apply(ParDo.of(new FormatAsTableRowFn())).apply(
BigQueryIO.Write.to(getTableReference(options)).withSchema(getSchema()));
p.run();