SlideShare uma empresa Scribd logo
1 de 60
Baixar para ler offline
Google Dataflow 小試
Simon Su @ LinkerNetworks
{Google Developer Expert}
https://goo.gl/xkANFT
var simon = {/** I am at GCPUG.TW **/};
simon.aboutme = 'http://about.me/peihsinsu';
simon.nodejs = ‘http://opennodes.arecord.us';
simon.googleshare = 'http://gappsnews.blogspot.tw'
simon.nodejsblog = ‘http://nodejs-in-example.blogspot.tw';
simon.blog = ‘http://peihsinsu.blogspot.com';
simon.slideshare = ‘http://slideshare.net/peihsinsu/';
simon.email = ‘simonsu.mail@gmail.com’;
simon.say(‘Good luck to everybody!');
I’m Simon Su...
What… I’m at Japan!!
https://www.facebook.com/groups/GCPUG.TW/
https://plus.google.com/u/0/communities/116100913832589966421
● Data scientist
● Data engineer
● Frontend engineer
Before Workshop...
Prepare Doc
http://www.slideshare.net/peihsinsu/jcconf2016-dataflow-workshop
LAB Doc
http://www.slideshare.net/peihsinsu/jcconf-2016-dataflow-workshop-labs
Or http://goo.gl/nfdBhV
Google Cloud in Big Data Solution
Google Focused Cloud
Virtualized
Data Centers
Standard virtual kit for
Rent. Still yours to
manage.
2nd
Wave
Colocation
1st
Wave
Your kit, someone
else’s building.
Yours to manage.
Assembly required True On Demand Cloud
Next
Storage Processing Memory Network
Clusters
Distributed Storage, Processing
& Machine Learning
Containers
3rd
Wave
An actual, global
elastic cloud
Invest your energy in
great apps.
Google Cloud Family
Foundation
Infrastructure & Operations
Data Services
Application
Runtime Services
Enabling No-Touch Operations
Breakthrough Insights,
Breakthrough Applications
The Gear that Powers Google
GCP tools for data processing and analysis
Dataflow
StoreCapture Analyze
BigQuery Larger
Hadoop
Ecosystem
Pub/Sub
Logs
App Engine
BigQuery streaming
Process
Cloud
Storage
Cloud
Datastore
(NoSQL)
Cloud SQL
(mySQL)
BigQuery
Storage
Dataproc Dataproc
Common big data processing flow
Devices
Physical or
virtual servers
as frontend
data receiver
MapReduce
servers for large
data
transformation in
batch way or
streaming
Strong queue
service for
handling large
scale data
injection
Large scale data
store for storing
data and serve
query workload
Smart devices,
IoT devices
and Sensors
1M
Devices
16.6K
Events/sec
16.6K
Events/sec
43B
Events/month
SpannerDremelMapReduce
Big Table Colossus
2012 20132002 2004 2006 2008 2010
GFS MillWheel
Flume
Google Big Data Evolution History
Look into Dataflow
GCP
Managed Service
User Code & SDK
Work Manager
Deploy & Schedule
Monitoring UI
Job Manager
Progress & Logs
Dataflow use case
• Movement
• Filtering
• Enrichment
• Shaping
• Reduction
• Batch computation
• Continuous
computation
• Composition
• External
orchestration
• Simulation
OrchestrationAnalysisETL
Getting Start - Installation
Get your GCP project
Install Eclipse Plugin for Dataflow
Verify your installation
Another Choice...
https://github.com/peihsinsu/gcp-dataflow-java
Getting start with GCS
Google Cloud Storage Features
Online cloud
import (Cloud
Storage Transfer
Service)
Object lifecycle
management
ACLs
Object change
notification
Offline import
(third party)
Regional buckets
Object
versioning
Create your bucket for Dataflow use
Run Dataflow in Local
Create dataflow project
Dataflow Sample
Execute Dataflow
Lab 1: Ready your dataflow
environment and create your
first dataflow project
● Create GCP project
● Install Eclipse and Dataflow plugin
● Create first Dataflow project
● Run your project
After lab - What thing in GCS bucket
After lab - The Run Configuration
Dataflow in Batch Mode
What we do in Big Data process...
Map
Shuffle
Reduce
ParDo
GroupByKey
ParDo
Pipeline
● A Direct Acyclic Graph of data processing
transformations
● Can be submitted to the Dataflow Service for
optimization and execution or executed on an
alternate runner e.g. Spark
● May include multiple inputs and multiple outputs
● May encompass many logical MapReduce
operations
● PCollections flow through the pipeline
Code Review
Pipeline p = Pipeline.create(
PipelineOptionsFactory.fromArgs(args).withValidation().create());
p.apply(Create.of("Hello", "World"))
.apply(ParDo.of(new DoFn<String, String>() {
@Override
public void processElement(ProcessContext c) {
c.output(c.element().toUpperCase());
}
}))
.apply(ParDo.of(new DoFn<String, Void>() {
@Override
public void processElement(ProcessContext c) {
LOG.info(c.element());
}
}));
p.run();
建立pipeline物件
執行pipeline
Inputs & Outputs
Your
Source/Sink
Here
❯ Read from standard Google Cloud Platform
data sources
• GCS, Pub/Sub, BigQuery, Datastore
❯ Write your own custom source by teaching
Dataflow how to read it in parallel
• Currently for bounded sources only
❯ Write to GCS, BigQuery, Pub/Sub
• More coming…
❯ Can use a combination of text, JSON, XML,
Avro formatted data
Code Review
Pipeline p = Pipeline.create(
PipelineOptionsFactory.fromArgs(args).withValidation().create());
p.apply(Create.of("Hello", "World"))
.apply(ParDo.of(new DoFn<String, String>() {
@Override
public void processElement(ProcessContext c) {
c.output(c.element().toUpperCase());
}
}))
.apply(ParDo.of(new DoFn<String, Void>() {
@Override
public void processElement(ProcessContext c) {
LOG.info(c.element());
}
}));
p.run();
建立Input
Parallel Do,Log輸出
PCollection
❯ A collection of data of type T in a pipeline
- PCollection<K,V>
❯ Maybe be either bounded or unbounded
in size
❯ Created by using a PTransform to:
• Build from a java.util.Collection
• Read from a backing data store
• Transform an existing PCollection
❯ Often contain the key-value pairs using
KV
{Seahawks, NFC, Champions, Seattle,
...}
{...,
“NFC Champions #GreenBay”,
“Green Bay #superbowl!”,
...
“#GoHawks”,
...}
Transforms
● A step, or a processing operation that transforms data
○ convert format , group , filter data
● Type of Transforms
○ ParDo
○ GroupByKey
○ Combine
○ Flatten
■ Multiple PCollection objects that contain the same data type, you can
merge them into a single logical PCollection using the Flatten transform
Code Review
Pipeline p = Pipeline.create(
PipelineOptionsFactory.fromArgs(args).withValidation().create());
p.apply(Create.of("Hello", "World"))
.apply(ParDo.of(new DoFn<String, String>() {
@Override
public void processElement(ProcessContext c) {
c.output(c.element().toUpperCase());
}
}))
.apply(ParDo.of(new DoFn<String, Void>() {
@Override
public void processElement(ProcessContext c) {
LOG.info(c.element());
}
}));
p.run();
Transform輸入,使用Parallel Do
方式,平行的將輸入物件轉大寫
Parallel Do,Log輸出
Pardo (Parallel do)
❯ Processes each element of a PCollection
independently using a user-provided DoFn
❯ Corresponds to both the Map and Reduce
phases in Hadoop i.e. ParDo->GBK->ParDo
❯ Useful for
Filtering a data set.
Formatting or converting the type of each element
in a data set.
Extracting parts of each element in a data set.
Performing computations on each element in a
data set.
{Seahawks, NFC,
Champions, Seattle, ...}
{
KV<S, Seahawks>,
KV<C,Champions>,
<KV<S, Seattle>,
KV<N, NFC>, …
}
KeyBySessionId
Group by key
Wait a minute…
How do you do a GroupByKey on an unbounded PCollection?
{KV<S, Seahawks>, KV<C,Champions>,
<KV<S, Seattle>, KV<N, NFC>, ...}
{KV<S, Seahawks>, KV<C,Champions>,
<KV<S, Seattle>, KV<N, NFC>, ...}
GroupByKey
• Takes a PCollection of key-value pairs
and gathers up all values with the same
key
• Corresponds to the shuffle phase in
Hadoop
{KV<S, {Seahawks, Seattle, …},
KV<N, {NFC, …}
KV<C, {Champion, …}}
Group by key sample
Lab 2: Deploy your first
project to Google Cloud
Platform
● Checking the Lab 1 project working well
● Deploy to cloud and watch the dataflow
task dashboard
● Implement the Input/Output/Transform
in your project
Dataflow Task Dashboard
Dataflow in Streaming Mode
Pub/Sub working model
Pub/Sub Operations - Topics
Pub/Sub Operations - Subscriber
Simple Guide for Pub/Sub
Ref: https://gcpug-tw.gitbooks.io/google-cloud-platform-in-practice/content/pubsub_getting_start.html
Dataflow with Cloud Pub/Sub Use Case
Publisher A Publisher B Publisher C
Message 1
Topic A Topic B Topic C
Subscription XA Subscription XB
Subscription
YC
Subscription
ZC
Cloud
Pub/Sub
Subscriber X Subscriber Y
Message 2 Message 3
Subscriber Z
Message 1
Message 2
Message 3
Message 3
Globally redundant
Low latency (sub sec.)
N to N coupling
Batched read/write
Push & Pull
Guaranteed Delivery
Auto expiration
Example of PubSub IO to BigQuery IO
Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
options.setStreaming(true);
Pipeline p = Pipeline.create(options);
PCollection<String> input =
p.apply(PubsubIO.Read.topic("projects/sunny-573/topics/jcconf2016"));
PCollection<String> windowedWords =
input.apply(Window.<String> into(
FixedWindows.of(Duration.standardMinutes(options.getWindowSize()))));
PCollection<KV<String, Long>> wordCounts =
windowedWords.apply(new TestMain.MyCountWords());
wordCounts.apply(ParDo.of(new FormatAsTableRowFn())).apply(
BigQueryIO.Write.to(getTableReference(options)).withSchema(getSchema()));
p.run();
Lab 3: Create a Streaming
Dataflow model
● Create PubSub topic
● Deploy Dataflow streaming sample
● Watch Dataflow task dashboard
Streaming Dataflow Executing Log
Don’t forget to shut down streaming process...
After Dataflow
Datalab
An easy tool for analysis and report
Analysis tools - BigQuery & Datalab
BigQuery
An interactive analysis service
Google Data Studio
https://datastudio.google.com/
Thinking your data orchestration...
● Google Codelab:
https://codelabs.developers.google.com/?cat=Cloud
● Dataflow Doc:
https://cloud.google.com/dataflow/docs/
More learning resources
--- THANKS ---

Mais conteúdo relacionado

Mais procurados

node.js on Google Compute Engine
node.js on Google Compute Enginenode.js on Google Compute Engine
node.js on Google Compute Engine
Arun Nagarajan
 
Introduction to Google Compute Engine
Introduction to Google Compute EngineIntroduction to Google Compute Engine
Introduction to Google Compute Engine
Colin Su
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - final
Sub Szabolcs Feczak
 

Mais procurados (20)

GCPUG meetup 201610 - Dataflow Introduction
GCPUG meetup 201610 - Dataflow IntroductionGCPUG meetup 201610 - Dataflow Introduction
GCPUG meetup 201610 - Dataflow Introduction
 
Google Cloud Platform Special Training
Google Cloud Platform Special TrainingGoogle Cloud Platform Special Training
Google Cloud Platform Special Training
 
Docker in Action
Docker in ActionDocker in Action
Docker in Action
 
node.js on Google Compute Engine
node.js on Google Compute Enginenode.js on Google Compute Engine
node.js on Google Compute Engine
 
From airflow to google cloud composer
From airflow to google cloud composerFrom airflow to google cloud composer
From airflow to google cloud composer
 
Using Google Compute Engine
Using Google Compute EngineUsing Google Compute Engine
Using Google Compute Engine
 
Hands on Compute Engine
Hands on Compute EngineHands on Compute Engine
Hands on Compute Engine
 
Introduction to Google Compute Engine
Introduction to Google Compute EngineIntroduction to Google Compute Engine
Introduction to Google Compute Engine
 
Intro to the Google Cloud for Developers
Intro to the Google Cloud for DevelopersIntro to the Google Cloud for Developers
Intro to the Google Cloud for Developers
 
Streaming Auto-scaling in Google Cloud Dataflow
Streaming Auto-scaling in Google Cloud DataflowStreaming Auto-scaling in Google Cloud Dataflow
Streaming Auto-scaling in Google Cloud Dataflow
 
Google Cloud Platform - Eric Johnson, Joe Selman - ManageIQ Design Summit 2016
Google Cloud Platform - Eric Johnson, Joe Selman - ManageIQ Design Summit 2016Google Cloud Platform - Eric Johnson, Joe Selman - ManageIQ Design Summit 2016
Google Cloud Platform - Eric Johnson, Joe Selman - ManageIQ Design Summit 2016
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
 
Hands on App Engine
Hands on App EngineHands on App Engine
Hands on App Engine
 
Getting started with Google Cloud Training Material - 2018
Getting started with Google Cloud Training Material - 2018Getting started with Google Cloud Training Material - 2018
Getting started with Google Cloud Training Material - 2018
 
A Tour of Google Cloud Platform
A Tour of Google Cloud PlatformA Tour of Google Cloud Platform
A Tour of Google Cloud Platform
 
Google Cloud Platform as a Backend Solution for your Product
Google Cloud Platform as a Backend Solution for your ProductGoogle Cloud Platform as a Backend Solution for your Product
Google Cloud Platform as a Backend Solution for your Product
 
Google Cloud Platform Introduction - 2016Q3
Google Cloud Platform Introduction - 2016Q3Google Cloud Platform Introduction - 2016Q3
Google Cloud Platform Introduction - 2016Q3
 
Google Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneGoogle Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better One
 
Google Dataflow Intro
Google Dataflow IntroGoogle Dataflow Intro
Google Dataflow Intro
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - final
 

Destaque

Destaque (20)

Brocade - Stingray Application Firewall
Brocade - Stingray Application FirewallBrocade - Stingray Application Firewall
Brocade - Stingray Application Firewall
 
GCPNext17' Extend 開始GCP了嗎?
GCPNext17' Extend   開始GCP了嗎?GCPNext17' Extend   開始GCP了嗎?
GCPNext17' Extend 開始GCP了嗎?
 
Google I/O Extended 2016 - 台北場活動回顧
Google I/O Extended 2016 - 台北場活動回顧Google I/O Extended 2016 - 台北場活動回顧
Google I/O Extended 2016 - 台北場活動回顧
 
Google I/O 2016 Recap - Google Cloud Platform News Update
Google I/O 2016 Recap - Google Cloud Platform News UpdateGoogle I/O 2016 Recap - Google Cloud Platform News Update
Google I/O 2016 Recap - Google Cloud Platform News Update
 
中原大學 Shift to cloud
中原大學   Shift to cloud中原大學   Shift to cloud
中原大學 Shift to cloud
 
GCPUG.TW - 2016活動討論
GCPUG.TW - 2016活動討論GCPUG.TW - 2016活動討論
GCPUG.TW - 2016活動討論
 
Google Cloud Platform 2014Q4
Google Cloud Platform 2014Q4Google Cloud Platform 2014Q4
Google Cloud Platform 2014Q4
 
技術單兵作戰及團隊開發流程差異
技術單兵作戰及團隊開發流程差異技術單兵作戰及團隊開發流程差異
技術單兵作戰及團隊開發流程差異
 
html5 & phonegap
html5 & phonegaphtml5 & phonegap
html5 & phonegap
 
中華電信 教育訓練
中華電信 教育訓練中華電信 教育訓練
中華電信 教育訓練
 
Developer team review of 2014
Developer team review of 2014Developer team review of 2014
Developer team review of 2014
 
為 Node.js 專案打造專屬管家進行開發流程整合及健康檢測
為 Node.js 專案打造專屬管家進行開發流程整合及健康檢測為 Node.js 專案打造專屬管家進行開發流程整合及健康檢測
為 Node.js 專案打造專屬管家進行開發流程整合及健康檢測
 
GCPUG.TW - 2015活動回顧
GCPUG.TW - 2015活動回顧GCPUG.TW - 2015活動回顧
GCPUG.TW - 2015活動回顧
 
Web development, from git flow to github flow
Web development, from git flow to github flowWeb development, from git flow to github flow
Web development, from git flow to github flow
 
Docker with Cloud Service GCPUG
Docker with Cloud Service  GCPUGDocker with Cloud Service  GCPUG
Docker with Cloud Service GCPUG
 
GCP - GCE, Cloud SQL, Cloud Storage, BigQuery Basic Training
GCP - GCE, Cloud SQL, Cloud Storage, BigQuery Basic TrainingGCP - GCE, Cloud SQL, Cloud Storage, BigQuery Basic Training
GCP - GCE, Cloud SQL, Cloud Storage, BigQuery Basic Training
 
Google IO - When Bigquery meeet Node.js
Google IO - When Bigquery meeet Node.jsGoogle IO - When Bigquery meeet Node.js
Google IO - When Bigquery meeet Node.js
 
遠端團隊專案建立與管理 remote team management 2016
遠端團隊專案建立與管理 remote team management 2016遠端團隊專案建立與管理 remote team management 2016
遠端團隊專案建立與管理 remote team management 2016
 
Try Cloud Spanner
Try Cloud SpannerTry Cloud Spanner
Try Cloud Spanner
 
從失敗中學習打造技術團隊
從失敗中學習打造技術團隊從失敗中學習打造技術團隊
從失敗中學習打造技術團隊
 

Semelhante a JCConf 2016 - Google Dataflow 小試

Turku loves-storybook-styleguidist-styled-components
Turku loves-storybook-styleguidist-styled-componentsTurku loves-storybook-styleguidist-styled-components
Turku loves-storybook-styleguidist-styled-components
James Stone
 

Semelhante a JCConf 2016 - Google Dataflow 小試 (20)

Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
 
Google cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache FlinkGoogle cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache Flink
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Untangling - fall2017 - week 9
Untangling - fall2017 - week 9Untangling - fall2017 - week 9
Untangling - fall2017 - week 9
 
Turku loves-storybook-styleguidist-styled-components
Turku loves-storybook-styleguidist-styled-componentsTurku loves-storybook-styleguidist-styled-components
Turku loves-storybook-styleguidist-styled-components
 
Xephon K A Time series database with multiple backends
Xephon K A Time series database with multiple backendsXephon K A Time series database with multiple backends
Xephon K A Time series database with multiple backends
 
How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
How to build unified Batch & Streaming Pipelines with Apache Beam and DataflowHow to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
 
How Bitbucket Pipelines Loads Connect UI Assets Super-fast
How Bitbucket Pipelines Loads Connect UI Assets Super-fastHow Bitbucket Pipelines Loads Connect UI Assets Super-fast
How Bitbucket Pipelines Loads Connect UI Assets Super-fast
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 
TypeScript and SharePoint Framework
TypeScript and SharePoint FrameworkTypeScript and SharePoint Framework
TypeScript and SharePoint Framework
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
 
Setting Up a TIG Stack for Your Testing
Setting Up a TIG Stack for Your TestingSetting Up a TIG Stack for Your Testing
Setting Up a TIG Stack for Your Testing
 
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPSimpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
 
Introduction to Django
Introduction to DjangoIntroduction to Django
Introduction to Django
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
Serverless Computing with Google Cloud
Serverless Computing with Google CloudServerless Computing with Google Cloud
Serverless Computing with Google Cloud
 
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDK
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDKBigQuery case study in Groovenauts & Dive into the DataflowJavaSDK
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDK
 

Mais de Simon Su

Mais de Simon Su (14)

Kubernetes Basic Operation
Kubernetes Basic OperationKubernetes Basic Operation
Kubernetes Basic Operation
 
Google IoT Core 初體驗
Google IoT Core 初體驗Google IoT Core 初體驗
Google IoT Core 初體驗
 
JSDC 2017 - 使用google cloud 從雲到端,動手刻個IoT
JSDC 2017 - 使用google cloud 從雲到端,動手刻個IoTJSDC 2017 - 使用google cloud 從雲到端,動手刻個IoT
JSDC 2017 - 使用google cloud 從雲到端,動手刻個IoT
 
GCPUG.TW meetup #28 - GKE上運作您的k8s服務
GCPUG.TW meetup #28 - GKE上運作您的k8s服務GCPUG.TW meetup #28 - GKE上運作您的k8s服務
GCPUG.TW meetup #28 - GKE上運作您的k8s服務
 
GCE Windows Serial Console Usage Guide
GCE Windows Serial Console Usage GuideGCE Windows Serial Console Usage Guide
GCE Windows Serial Console Usage Guide
 
Google Cloud Monitoring
Google Cloud MonitoringGoogle Cloud Monitoring
Google Cloud Monitoring
 
JCConf2016 - Dataflow Workshop Setup
JCConf2016 - Dataflow Workshop SetupJCConf2016 - Dataflow Workshop Setup
JCConf2016 - Dataflow Workshop Setup
 
IThome DevOps Summit - IoT、docker與DevOps
IThome DevOps Summit - IoT、docker與DevOpsIThome DevOps Summit - IoT、docker與DevOps
IThome DevOps Summit - IoT、docker與DevOps
 
GCS - Access Control Lists (中文)
GCS - Access Control Lists (中文)GCS - Access Control Lists (中文)
GCS - Access Control Lists (中文)
 
Google Cloud Platform - for Mobile Solutions
Google Cloud Platform - for Mobile SolutionsGoogle Cloud Platform - for Mobile Solutions
Google Cloud Platform - for Mobile Solutions
 
JCConf 2015 - 輕鬆學google的雲端開發 - Google App Engine入門(下)
JCConf 2015  - 輕鬆學google的雲端開發 - Google App Engine入門(下)JCConf 2015  - 輕鬆學google的雲端開發 - Google App Engine入門(下)
JCConf 2015 - 輕鬆學google的雲端開發 - Google App Engine入門(下)
 
JCConf 2015 - 輕鬆學google的雲端開發 - Google App Engine入門(上)
JCConf 2015  - 輕鬆學google的雲端開發 - Google App Engine入門(上)JCConf 2015  - 輕鬆學google的雲端開發 - Google App Engine入門(上)
JCConf 2015 - 輕鬆學google的雲端開發 - Google App Engine入門(上)
 
CouchDB Getting Start
CouchDB Getting StartCouchDB Getting Start
CouchDB Getting Start
 
Google Cloud Platform專案建立說明
Google Cloud Platform專案建立說明Google Cloud Platform專案建立說明
Google Cloud Platform專案建立說明
 

Último

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 

JCConf 2016 - Google Dataflow 小試

  • 1. Google Dataflow 小試 Simon Su @ LinkerNetworks {Google Developer Expert} https://goo.gl/xkANFT
  • 2. var simon = {/** I am at GCPUG.TW **/}; simon.aboutme = 'http://about.me/peihsinsu'; simon.nodejs = ‘http://opennodes.arecord.us'; simon.googleshare = 'http://gappsnews.blogspot.tw' simon.nodejsblog = ‘http://nodejs-in-example.blogspot.tw'; simon.blog = ‘http://peihsinsu.blogspot.com'; simon.slideshare = ‘http://slideshare.net/peihsinsu/'; simon.email = ‘simonsu.mail@gmail.com’; simon.say(‘Good luck to everybody!'); I’m Simon Su...
  • 5. ● Data scientist ● Data engineer ● Frontend engineer
  • 6. Before Workshop... Prepare Doc http://www.slideshare.net/peihsinsu/jcconf2016-dataflow-workshop LAB Doc http://www.slideshare.net/peihsinsu/jcconf-2016-dataflow-workshop-labs Or http://goo.gl/nfdBhV
  • 7. Google Cloud in Big Data Solution
  • 8. Google Focused Cloud Virtualized Data Centers Standard virtual kit for Rent. Still yours to manage. 2nd Wave Colocation 1st Wave Your kit, someone else’s building. Yours to manage. Assembly required True On Demand Cloud Next Storage Processing Memory Network Clusters Distributed Storage, Processing & Machine Learning Containers 3rd Wave An actual, global elastic cloud Invest your energy in great apps.
  • 9. Google Cloud Family Foundation Infrastructure & Operations Data Services Application Runtime Services Enabling No-Touch Operations Breakthrough Insights, Breakthrough Applications The Gear that Powers Google
  • 10. GCP tools for data processing and analysis Dataflow StoreCapture Analyze BigQuery Larger Hadoop Ecosystem Pub/Sub Logs App Engine BigQuery streaming Process Cloud Storage Cloud Datastore (NoSQL) Cloud SQL (mySQL) BigQuery Storage Dataproc Dataproc
  • 11. Common big data processing flow Devices Physical or virtual servers as frontend data receiver MapReduce servers for large data transformation in batch way or streaming Strong queue service for handling large scale data injection Large scale data store for storing data and serve query workload Smart devices, IoT devices and Sensors 1M Devices 16.6K Events/sec 16.6K Events/sec 43B Events/month
  • 12. SpannerDremelMapReduce Big Table Colossus 2012 20132002 2004 2006 2008 2010 GFS MillWheel Flume Google Big Data Evolution History
  • 13. Look into Dataflow GCP Managed Service User Code & SDK Work Manager Deploy & Schedule Monitoring UI Job Manager Progress & Logs
  • 14. Dataflow use case • Movement • Filtering • Enrichment • Shaping • Reduction • Batch computation • Continuous computation • Composition • External orchestration • Simulation OrchestrationAnalysisETL
  • 15. Getting Start - Installation
  • 16. Get your GCP project
  • 17. Install Eclipse Plugin for Dataflow
  • 21. Google Cloud Storage Features Online cloud import (Cloud Storage Transfer Service) Object lifecycle management ACLs Object change notification Offline import (third party) Regional buckets Object versioning
  • 22. Create your bucket for Dataflow use
  • 27. Lab 1: Ready your dataflow environment and create your first dataflow project ● Create GCP project ● Install Eclipse and Dataflow plugin ● Create first Dataflow project ● Run your project
  • 28. After lab - What thing in GCS bucket
  • 29. After lab - The Run Configuration
  • 31. What we do in Big Data process... Map Shuffle Reduce ParDo GroupByKey ParDo
  • 32. Pipeline ● A Direct Acyclic Graph of data processing transformations ● Can be submitted to the Dataflow Service for optimization and execution or executed on an alternate runner e.g. Spark ● May include multiple inputs and multiple outputs ● May encompass many logical MapReduce operations ● PCollections flow through the pipeline
  • 33. Code Review Pipeline p = Pipeline.create( PipelineOptionsFactory.fromArgs(args).withValidation().create()); p.apply(Create.of("Hello", "World")) .apply(ParDo.of(new DoFn<String, String>() { @Override public void processElement(ProcessContext c) { c.output(c.element().toUpperCase()); } })) .apply(ParDo.of(new DoFn<String, Void>() { @Override public void processElement(ProcessContext c) { LOG.info(c.element()); } })); p.run(); 建立pipeline物件 執行pipeline
  • 34. Inputs & Outputs Your Source/Sink Here ❯ Read from standard Google Cloud Platform data sources • GCS, Pub/Sub, BigQuery, Datastore ❯ Write your own custom source by teaching Dataflow how to read it in parallel • Currently for bounded sources only ❯ Write to GCS, BigQuery, Pub/Sub • More coming… ❯ Can use a combination of text, JSON, XML, Avro formatted data
  • 35. Code Review Pipeline p = Pipeline.create( PipelineOptionsFactory.fromArgs(args).withValidation().create()); p.apply(Create.of("Hello", "World")) .apply(ParDo.of(new DoFn<String, String>() { @Override public void processElement(ProcessContext c) { c.output(c.element().toUpperCase()); } })) .apply(ParDo.of(new DoFn<String, Void>() { @Override public void processElement(ProcessContext c) { LOG.info(c.element()); } })); p.run(); 建立Input Parallel Do,Log輸出
  • 36. PCollection ❯ A collection of data of type T in a pipeline - PCollection<K,V> ❯ Maybe be either bounded or unbounded in size ❯ Created by using a PTransform to: • Build from a java.util.Collection • Read from a backing data store • Transform an existing PCollection ❯ Often contain the key-value pairs using KV {Seahawks, NFC, Champions, Seattle, ...} {..., “NFC Champions #GreenBay”, “Green Bay #superbowl!”, ... “#GoHawks”, ...}
  • 37. Transforms ● A step, or a processing operation that transforms data ○ convert format , group , filter data ● Type of Transforms ○ ParDo ○ GroupByKey ○ Combine ○ Flatten ■ Multiple PCollection objects that contain the same data type, you can merge them into a single logical PCollection using the Flatten transform
  • 38. Code Review Pipeline p = Pipeline.create( PipelineOptionsFactory.fromArgs(args).withValidation().create()); p.apply(Create.of("Hello", "World")) .apply(ParDo.of(new DoFn<String, String>() { @Override public void processElement(ProcessContext c) { c.output(c.element().toUpperCase()); } })) .apply(ParDo.of(new DoFn<String, Void>() { @Override public void processElement(ProcessContext c) { LOG.info(c.element()); } })); p.run(); Transform輸入,使用Parallel Do 方式,平行的將輸入物件轉大寫 Parallel Do,Log輸出
  • 39. Pardo (Parallel do) ❯ Processes each element of a PCollection independently using a user-provided DoFn ❯ Corresponds to both the Map and Reduce phases in Hadoop i.e. ParDo->GBK->ParDo ❯ Useful for Filtering a data set. Formatting or converting the type of each element in a data set. Extracting parts of each element in a data set. Performing computations on each element in a data set. {Seahawks, NFC, Champions, Seattle, ...} { KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, … } KeyBySessionId
  • 40. Group by key Wait a minute… How do you do a GroupByKey on an unbounded PCollection? {KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, ...} {KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, ...} GroupByKey • Takes a PCollection of key-value pairs and gathers up all values with the same key • Corresponds to the shuffle phase in Hadoop {KV<S, {Seahawks, Seattle, …}, KV<N, {NFC, …} KV<C, {Champion, …}}
  • 41. Group by key sample
  • 42. Lab 2: Deploy your first project to Google Cloud Platform ● Checking the Lab 1 project working well ● Deploy to cloud and watch the dataflow task dashboard ● Implement the Input/Output/Transform in your project
  • 47. Pub/Sub Operations - Subscriber
  • 48. Simple Guide for Pub/Sub Ref: https://gcpug-tw.gitbooks.io/google-cloud-platform-in-practice/content/pubsub_getting_start.html
  • 49. Dataflow with Cloud Pub/Sub Use Case Publisher A Publisher B Publisher C Message 1 Topic A Topic B Topic C Subscription XA Subscription XB Subscription YC Subscription ZC Cloud Pub/Sub Subscriber X Subscriber Y Message 2 Message 3 Subscriber Z Message 1 Message 2 Message 3 Message 3 Globally redundant Low latency (sub sec.) N to N coupling Batched read/write Push & Pull Guaranteed Delivery Auto expiration
  • 50. Example of PubSub IO to BigQuery IO Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class); options.setStreaming(true); Pipeline p = Pipeline.create(options); PCollection<String> input = p.apply(PubsubIO.Read.topic("projects/sunny-573/topics/jcconf2016")); PCollection<String> windowedWords = input.apply(Window.<String> into( FixedWindows.of(Duration.standardMinutes(options.getWindowSize())))); PCollection<KV<String, Long>> wordCounts = windowedWords.apply(new TestMain.MyCountWords()); wordCounts.apply(ParDo.of(new FormatAsTableRowFn())).apply( BigQueryIO.Write.to(getTableReference(options)).withSchema(getSchema())); p.run();
  • 51. Lab 3: Create a Streaming Dataflow model ● Create PubSub topic ● Deploy Dataflow streaming sample ● Watch Dataflow task dashboard
  • 53. Don’t forget to shut down streaming process...
  • 55.
  • 56. Datalab An easy tool for analysis and report Analysis tools - BigQuery & Datalab BigQuery An interactive analysis service
  • 58. Thinking your data orchestration...
  • 59. ● Google Codelab: https://codelabs.developers.google.com/?cat=Cloud ● Dataflow Doc: https://cloud.google.com/dataflow/docs/ More learning resources