SlideShare uma empresa Scribd logo
1 de 30
Baixar para ler offline
BIG DATA PROCESSING WITH PUB/SUB,
DATAFLOW AND BIGQUERY
Thuyen Ho – Data Engineer @ KNOREX
© 2018 KNOREX
© 2018 KNOREX
Established in 2010, Knorex provides Precision Performance Marketing products and solutions to leading
trading desks, agencies and brands.
Offices and direct business presence across US, UK, Australia, China, India and Southeast Asia (SEA)
ABOUT KNOREX
8
OFFICES
110+
STAFFS
. .
.
. ....
© 2018 KNOREX
3
PROBLEM STATEMENT
Ingest large volume of streaming user data,
transform based on ever changing parameters, and
store them in a database in real time. This data will be
used for 2 purpose:
1. Targeting users in real time for advertising
campaigns
2. Aggregation of data for estimation of campaign
reach
Third-
party
partner
KNOREX
DMP
Ingest stream events
• QPS: ~1500 - 2000 events
• Event size: 50KB – 100KB
• Data Volume: ~1TB a day
Historical data
• Reprocess: ~30TB each day
• Aggregate: ~60TB each day
© 2018 KNOREX
4
• Quick Introduction To Pub/Sub, Dataflow and BigQuery
• KNOREX Approach
• Q&A
AGENDA
5
Quick Introduction To Pub/Sub, Dataflow and BigQuery
© 2018 KNOREX
6
SERVERLESS STREAM PROCESSING PIPELINE WITH GCP
Dataflow
stream processing
BigQuery
analytics
engine
Data events Processed data
Pub/Sub
messaging queue
© 2018 KNOREX
7
Cloud Pub/Sub is an asynchronous messaging service designed to be highly
reliable and scalable.
CLOUD PUB/SUB
© 2018 KNOREX
8
CLOUD PUB/SUB – PULL SUBSCRIPTION
© 2018 KNOREX
9
CLOUD PUB/SUB – PUSH SUBSCRIPTION
© 2018 KNOREX1
0
Lambda architecture is a data-processing architecture designed to handle massive quantities
of data by taking advantage of both batch and stream-processing methods. (source:
wikipedia.org)
To balance:
• Latency
• Throughput
• Fault-tolerance
LAMBDA ARCHITECTURE
© 2018 KNOREX1
1
DATA PROCESSING - TRANSFORMS
Storage
Group Aggregate
Filter
Transform
Input Data Output Data
Data Processing
© 2018 KNOREX1
2
Cloud Dataflow is a fully-managed service, autoscaling execution environment for
Beam pipelines.
Beams supports the following language-specific SDKs: Java, Python and Go
CLOUD DATAFLOW
Implement batch and streaming data
processing jobs that run on any
execution engine.
great execution environment
© 2018 KNOREX1
3
BEAM ABSTRACTIONS
Storage
Group Aggregate
Filter
Transform
Input Data Output Data
Data Processing
Bounded / Unbounded
PCollection
PTransform
PTransform
PTransform
PTransform
Pipeline
© 2018 KNOREX1
4
BEAM - FIXED TIME WINDOWS
1 7
2
1
8
Unbounded events
Processing time
3
8
6
3
5
3
8 8
2
4
2
1
9
3
7
30s window 0
00:00:00 00:00:30 00:01:00 00:01:30
30s window 1 30s window 2
© 2018 KNOREX1
5
BEAM – SLIDING TIME WINDOWS
1 7
2
1
8
Unbounded events
Processing time
3
8
6
3
5
3
8 8
2
4
2
1
9
3
7
30s window 0
00:00:00 00:00:30 00:01:00 00:01:30
30s window 1
30s window 2
© 2018 KNOREX1
6
BEAM – SESSION WINDOWS
1
2
Processing time
2
4
7
window 0
00:00:00 00:00:30 00:01:00 00:01:30
window 1 window 2
7
4
2
2 2 2
2 2
2
4
4 4
Gap duration
© 2018 KNOREX1
7
A fast, highly scalable, cost-effective, and fully managed enterprise data warehouse for
analytics.
Some of the features:
• Serverless
• Real-time Analytics
• Standard SQL
• Storage and Compute Separation
• Flexible Data Ingestion
• Petabyte Scale
CLOUD BIGQUERY
© 2018 KNOREX1
8
BIGQUERY STORAGE IS COLUMNAR
Column1 Column2 Column3
Each column in sperate. No
Indexes or key is required.
© 2018 KNOREX1
9
INGESTION-TIME PARTITIONED TABLE
19
Column1 Column2 Column3
SELECT Column1, Column2
FROM `database.table_name`
WHERE PARTITIONDATE >= "2018-12-01" AND _PARTITIONDATE < "2018-12-03"
2018-12-01 00:00:00
2018-12-01 00:00:00
2018-12-02 00:00:00
2018-12-02 00:00:00
2018-12-02 00:00:00
2018-12-03 00:00:00
2018-12-03 00:00:00
_PARTITIONTIME
2018-12-01
2018-12-01
2018-12-02
2018-12-02
2018-12-02
2018-12-03
2018-12-03
_PARTITIONDATE
© 2018 KNOREX2
0
INGESTION-TIME PARTITIONED TABLE
Column1 Column2 Column3
SELECT Column1, Column2
FROM `database.table_name`
WHERE PARTITIONDATE >= "2018-12-01" AND _PARTITIONDATE < "2018-12-03"
2018-12-01 00:00:00
2018-12-01 00:00:00
2018-12-02 00:00:00
2018-12-02 00:00:00
2018-12-02 00:00:00
2018-12-03 00:00:00
2018-12-03 00:00:00
_PARTITIONTIME
2018-12-01
2018-12-01
2018-12-02
2018-12-02
2018-12-02
2018-12-03
2018-12-03
_PARTITIONDATE
© 2018 KNOREX2
1
PARTITIONED TABLE
Column1 Column2
2018-12-01
2018-12-01
2018-12-02
2018-12-02
2018-12-02
2018-12-03
2018-12-03
Column3
Partitioned based on data in a
specified TIMESTAMP or DATE
column.
SELECT Column1, Column2
FROM `database.table_name`
WHERE Column3 >= "2018-12-01" AND Column3 < "2018-12-03"
22
KNOREX APPROACH
© 2018 KNOREX2
3
ARCHITECTURE – STREAMING PIPELINE
Third-Party partner Processing and analytics CMS
& RTB engine
API gateway
Cloud Load
Balancing
Data warehouse
BigQuery
Sharding +
Clustering
Stream proc
Cloud Dataflow
Autoscaling
API
Compute Engine
Autoscaling
Audience
Cloud Bigtable
3 regions
CMS
Cookie
Cloud Pub/Sub
Cookie topic
Device
Cloud Pub/Sub
Device topic
Segmented users
Cloud Pub/Sub
Device topic
Python script
Compute Engine
Autoscaling
Event ingest
© 2018 KNOREX2
4
ARCHITECTURE – EVENT INGEST
GCE run code with auto-scaling
instances.
it receives 1500 events a sec from
our partner.
API endpoint will put events into two
separate topics: cookie and device.
Cloud Load
Balancing
API
Compute Engine
Autoscaling
Cookie
Cloud Pub/Sub
Cookie topic
Device
Cloud Pub/Sub
Device topic
1500 events a sec
© 2018 KNOREX2
5
ARCHITECTURE – PROCESSING AND ANALYTICS
25
Cloud Dataflow transforms and
enriches raw events in real time
and inserts both processed data
into BigQuery as well as send them
to RTB engine through Pub/Sub.
Each region has a subscription to
pull data from segment topic, then
insert into BigTable.
BigQuery is a warehouse for
analytics. Tables are partitioned by
ingestion time. It keep data in 60
days.
Data warehouse
BigQuery
Partition +
Clustering
Stream proc
Cloud Dataflow
Autoscaling
Cookie
Cloud Pub/Sub
Cookie topic
Device
Cloud Pub/Sub
Device topic
Segmented users
Cloud Pub/Sub
segment topic Asia region
Compute
Engine
Cloud
BigTable
JP region
Compute
Engine
Cloud
BigTable
US region
Compute
Engine
Cloud
BigTable
CMS
KNX RTB Engine
© 2018 KNOREX2
6
ARCHITECTURE – BATCH PIPELINE
The Dataflow also takes data
from BigQuery in the past 30
days and reprocess again in
batch job.
Cloud Dataflow
batch processing
BigQuery
analytics
engine
Batch pipeline Batch loads
BigQuery
analytics
engine
Pub/Sub
© 2018 KNOREX2
7
DATAFLOW – PIPELINE VISUALIZATION
28
Q&A
29
Building Resilient Streaming Systems Lab
30
THANK YOU
KNOR E X.COM

Mais conteúdo relacionado

Mais procurados

Serving the Real-Time Data Needs of an Airport with Kafka Streams and KSQL
Serving the Real-Time Data Needs of an Airport with Kafka Streams and KSQL Serving the Real-Time Data Needs of an Airport with Kafka Streams and KSQL
Serving the Real-Time Data Needs of an Airport with Kafka Streams and KSQL confluent
 
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
 Migration and Coexistence between Relational and NoSQL Databases by Manuel H... Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...Big Data Spain
 
How to leverage Kafka data streams with Neo4j
How to leverage Kafka data streams with Neo4jHow to leverage Kafka data streams with Neo4j
How to leverage Kafka data streams with Neo4jGraphRM
 
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...Big Data Spain
 
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open SourceHigh Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open SourceDataWorks Summit
 
Presto Summit 2018 - 03 - Starburst CBO
Presto Summit 2018  - 03 - Starburst CBOPresto Summit 2018  - 03 - Starburst CBO
Presto Summit 2018 - 03 - Starburst CBOkbajda
 
Kafka as an Eventing System to Replatform a Monolith into Microservices
Kafka as an Eventing System to Replatform a Monolith into Microservices Kafka as an Eventing System to Replatform a Monolith into Microservices
Kafka as an Eventing System to Replatform a Monolith into Microservices confluent
 
Gain Deep Visibility into APIs and Integrations with Anypoint Monitoring
Gain Deep Visibility into APIs and Integrations with Anypoint MonitoringGain Deep Visibility into APIs and Integrations with Anypoint Monitoring
Gain Deep Visibility into APIs and Integrations with Anypoint MonitoringInfluxData
 
Building the Foundation for a Latency-Free Life
Building the Foundation for a Latency-Free LifeBuilding the Foundation for a Latency-Free Life
Building the Foundation for a Latency-Free LifeSingleStore
 
CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016Mathieu Dumoulin
 
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Big Data Spain
 
Evolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainEvolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainMapR Technologies
 
Les objets connectés : de nombreux cas d'usage
Les objets connectés : de nombreux cas d'usage Les objets connectés : de nombreux cas d'usage
Les objets connectés : de nombreux cas d'usage Jedha Bootcamp
 
Google cloud big data summit master gcp big data summit la - 10-20-2015
Google cloud big data summit   master gcp big data summit la - 10-20-2015Google cloud big data summit   master gcp big data summit la - 10-20-2015
Google cloud big data summit master gcp big data summit la - 10-20-2015Raj Babu
 
The State of the Data Warehouse in 2017 and Beyond
The State of the Data Warehouse in 2017 and BeyondThe State of the Data Warehouse in 2017 and Beyond
The State of the Data Warehouse in 2017 and BeyondSingleStore
 
Google Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneGoogle Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneDataWorks Summit
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformMapR Technologies
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das
 
Five ways database modernization simplifies your data life
Five ways database modernization simplifies your data lifeFive ways database modernization simplifies your data life
Five ways database modernization simplifies your data lifeSingleStore
 

Mais procurados (20)

Serving the Real-Time Data Needs of an Airport with Kafka Streams and KSQL
Serving the Real-Time Data Needs of an Airport with Kafka Streams and KSQL Serving the Real-Time Data Needs of an Airport with Kafka Streams and KSQL
Serving the Real-Time Data Needs of an Airport with Kafka Streams and KSQL
 
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
 Migration and Coexistence between Relational and NoSQL Databases by Manuel H... Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
 
How to leverage Kafka data streams with Neo4j
How to leverage Kafka data streams with Neo4jHow to leverage Kafka data streams with Neo4j
How to leverage Kafka data streams with Neo4j
 
IoT at Google Scale
IoT at Google ScaleIoT at Google Scale
IoT at Google Scale
 
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
 
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open SourceHigh Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
 
Presto Summit 2018 - 03 - Starburst CBO
Presto Summit 2018  - 03 - Starburst CBOPresto Summit 2018  - 03 - Starburst CBO
Presto Summit 2018 - 03 - Starburst CBO
 
Kafka as an Eventing System to Replatform a Monolith into Microservices
Kafka as an Eventing System to Replatform a Monolith into Microservices Kafka as an Eventing System to Replatform a Monolith into Microservices
Kafka as an Eventing System to Replatform a Monolith into Microservices
 
Gain Deep Visibility into APIs and Integrations with Anypoint Monitoring
Gain Deep Visibility into APIs and Integrations with Anypoint MonitoringGain Deep Visibility into APIs and Integrations with Anypoint Monitoring
Gain Deep Visibility into APIs and Integrations with Anypoint Monitoring
 
Building the Foundation for a Latency-Free Life
Building the Foundation for a Latency-Free LifeBuilding the Foundation for a Latency-Free Life
Building the Foundation for a Latency-Free Life
 
CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016
 
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
 
Evolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainEvolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and Rain
 
Les objets connectés : de nombreux cas d'usage
Les objets connectés : de nombreux cas d'usage Les objets connectés : de nombreux cas d'usage
Les objets connectés : de nombreux cas d'usage
 
Google cloud big data summit master gcp big data summit la - 10-20-2015
Google cloud big data summit   master gcp big data summit la - 10-20-2015Google cloud big data summit   master gcp big data summit la - 10-20-2015
Google cloud big data summit master gcp big data summit la - 10-20-2015
 
The State of the Data Warehouse in 2017 and Beyond
The State of the Data Warehouse in 2017 and BeyondThe State of the Data Warehouse in 2017 and Beyond
The State of the Data Warehouse in 2017 and Beyond
 
Google Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneGoogle Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better One
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
 
Five ways database modernization simplifies your data life
Five ways database modernization simplifies your data lifeFive ways database modernization simplifies your data life
Five ways database modernization simplifies your data life
 

Semelhante a Big data processing with PubSub, Dataflow, and BigQuery

Laboratorio práctico: Data warehouse en la nube
Laboratorio práctico: Data warehouse en la nubeLaboratorio práctico: Data warehouse en la nube
Laboratorio práctico: Data warehouse en la nubeSoftware Guru
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation Brett VanderPlaats
 
Data Warehouse Like a Tech Startup with Oracle Autonomous Data Warehouse
Data Warehouse Like a Tech Startup with Oracle Autonomous Data WarehouseData Warehouse Like a Tech Startup with Oracle Autonomous Data Warehouse
Data Warehouse Like a Tech Startup with Oracle Autonomous Data WarehouseRittman Analytics
 
Digital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming EraDigital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming EraAttunity
 
One bridge to connect them all. Oracle GoldenGate for Big Data.UKOUG Tech 2018
One bridge to connect them all. Oracle GoldenGate for Big Data.UKOUG Tech 2018One bridge to connect them all. Oracle GoldenGate for Big Data.UKOUG Tech 2018
One bridge to connect them all. Oracle GoldenGate for Big Data.UKOUG Tech 2018Gleb Otochkin
 
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...Amazon Web Services
 
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...HostedbyConfluent
 
Comparing three data ingestion approaches where Apache Kafka integrates with ...
Comparing three data ingestion approaches where Apache Kafka integrates with ...Comparing three data ingestion approaches where Apache Kafka integrates with ...
Comparing three data ingestion approaches where Apache Kafka integrates with ...HostedbyConfluent
 
Master the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - SnowflakeMaster the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - SnowflakeMatillion
 
Building Resilient and Scalable Data Pipelines by Decoupling Compute and Storage
Building Resilient and Scalable Data Pipelines by Decoupling Compute and StorageBuilding Resilient and Scalable Data Pipelines by Decoupling Compute and Storage
Building Resilient and Scalable Data Pipelines by Decoupling Compute and StorageDatabricks
 
The Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
The Real-Time CDO and the Cloud-Forward Path to Predictive AnalyticsThe Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
The Real-Time CDO and the Cloud-Forward Path to Predictive AnalyticsSingleStore
 
How Financial Services can Save On File Storage
How Financial Services can Save On File Storage How Financial Services can Save On File Storage
How Financial Services can Save On File Storage Charly Mostert
 
The Hidden Value of Hadoop Migration
The Hidden Value of Hadoop MigrationThe Hidden Value of Hadoop Migration
The Hidden Value of Hadoop MigrationDatabricks
 
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...Cloudera, Inc.
 
Delivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with SnowflakeDelivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with SnowflakeKent Graziano
 
Veritas + MongoDB
Veritas + MongoDBVeritas + MongoDB
Veritas + MongoDBMongoDB
 
Datenvirtualisierung: Wie Sie Ihre Datenarchitektur agiler machen (German)
Datenvirtualisierung: Wie Sie Ihre Datenarchitektur agiler machen (German)Datenvirtualisierung: Wie Sie Ihre Datenarchitektur agiler machen (German)
Datenvirtualisierung: Wie Sie Ihre Datenarchitektur agiler machen (German)Denodo
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSKimmo Kantojärvi
 
Paris FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant PresentationParis FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant PresentationAbdelkrim Hadjidj
 

Semelhante a Big data processing with PubSub, Dataflow, and BigQuery (20)

Laboratorio práctico: Data warehouse en la nube
Laboratorio práctico: Data warehouse en la nubeLaboratorio práctico: Data warehouse en la nube
Laboratorio práctico: Data warehouse en la nube
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation
 
Data Warehouse Like a Tech Startup with Oracle Autonomous Data Warehouse
Data Warehouse Like a Tech Startup with Oracle Autonomous Data WarehouseData Warehouse Like a Tech Startup with Oracle Autonomous Data Warehouse
Data Warehouse Like a Tech Startup with Oracle Autonomous Data Warehouse
 
Digital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming EraDigital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming Era
 
One bridge to connect them all. Oracle GoldenGate for Big Data.UKOUG Tech 2018
One bridge to connect them all. Oracle GoldenGate for Big Data.UKOUG Tech 2018One bridge to connect them all. Oracle GoldenGate for Big Data.UKOUG Tech 2018
One bridge to connect them all. Oracle GoldenGate for Big Data.UKOUG Tech 2018
 
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...
 
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
 
Comparing three data ingestion approaches where Apache Kafka integrates with ...
Comparing three data ingestion approaches where Apache Kafka integrates with ...Comparing three data ingestion approaches where Apache Kafka integrates with ...
Comparing three data ingestion approaches where Apache Kafka integrates with ...
 
Master the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - SnowflakeMaster the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - Snowflake
 
Building Resilient and Scalable Data Pipelines by Decoupling Compute and Storage
Building Resilient and Scalable Data Pipelines by Decoupling Compute and StorageBuilding Resilient and Scalable Data Pipelines by Decoupling Compute and Storage
Building Resilient and Scalable Data Pipelines by Decoupling Compute and Storage
 
The Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
The Real-Time CDO and the Cloud-Forward Path to Predictive AnalyticsThe Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
The Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
 
How Financial Services can Save On File Storage
How Financial Services can Save On File Storage How Financial Services can Save On File Storage
How Financial Services can Save On File Storage
 
The Hidden Value of Hadoop Migration
The Hidden Value of Hadoop MigrationThe Hidden Value of Hadoop Migration
The Hidden Value of Hadoop Migration
 
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
 
Delivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with SnowflakeDelivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with Snowflake
 
Veritas + MongoDB
Veritas + MongoDBVeritas + MongoDB
Veritas + MongoDB
 
Datenvirtualisierung: Wie Sie Ihre Datenarchitektur agiler machen (German)
Datenvirtualisierung: Wie Sie Ihre Datenarchitektur agiler machen (German)Datenvirtualisierung: Wie Sie Ihre Datenarchitektur agiler machen (German)
Datenvirtualisierung: Wie Sie Ihre Datenarchitektur agiler machen (German)
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWS
 
Paris FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant PresentationParis FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant Presentation
 

Último

Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 

Último (20)

Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 

Big data processing with PubSub, Dataflow, and BigQuery

  • 1. BIG DATA PROCESSING WITH PUB/SUB, DATAFLOW AND BIGQUERY Thuyen Ho – Data Engineer @ KNOREX © 2018 KNOREX
  • 2. © 2018 KNOREX Established in 2010, Knorex provides Precision Performance Marketing products and solutions to leading trading desks, agencies and brands. Offices and direct business presence across US, UK, Australia, China, India and Southeast Asia (SEA) ABOUT KNOREX 8 OFFICES 110+ STAFFS . . . . ....
  • 3. © 2018 KNOREX 3 PROBLEM STATEMENT Ingest large volume of streaming user data, transform based on ever changing parameters, and store them in a database in real time. This data will be used for 2 purpose: 1. Targeting users in real time for advertising campaigns 2. Aggregation of data for estimation of campaign reach Third- party partner KNOREX DMP Ingest stream events • QPS: ~1500 - 2000 events • Event size: 50KB – 100KB • Data Volume: ~1TB a day Historical data • Reprocess: ~30TB each day • Aggregate: ~60TB each day
  • 4. © 2018 KNOREX 4 • Quick Introduction To Pub/Sub, Dataflow and BigQuery • KNOREX Approach • Q&A AGENDA
  • 5. 5 Quick Introduction To Pub/Sub, Dataflow and BigQuery
  • 6. © 2018 KNOREX 6 SERVERLESS STREAM PROCESSING PIPELINE WITH GCP Dataflow stream processing BigQuery analytics engine Data events Processed data Pub/Sub messaging queue
  • 7. © 2018 KNOREX 7 Cloud Pub/Sub is an asynchronous messaging service designed to be highly reliable and scalable. CLOUD PUB/SUB
  • 8. © 2018 KNOREX 8 CLOUD PUB/SUB – PULL SUBSCRIPTION
  • 9. © 2018 KNOREX 9 CLOUD PUB/SUB – PUSH SUBSCRIPTION
  • 10. © 2018 KNOREX1 0 Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream-processing methods. (source: wikipedia.org) To balance: • Latency • Throughput • Fault-tolerance LAMBDA ARCHITECTURE
  • 11. © 2018 KNOREX1 1 DATA PROCESSING - TRANSFORMS Storage Group Aggregate Filter Transform Input Data Output Data Data Processing
  • 12. © 2018 KNOREX1 2 Cloud Dataflow is a fully-managed service, autoscaling execution environment for Beam pipelines. Beams supports the following language-specific SDKs: Java, Python and Go CLOUD DATAFLOW Implement batch and streaming data processing jobs that run on any execution engine. great execution environment
  • 13. © 2018 KNOREX1 3 BEAM ABSTRACTIONS Storage Group Aggregate Filter Transform Input Data Output Data Data Processing Bounded / Unbounded PCollection PTransform PTransform PTransform PTransform Pipeline
  • 14. © 2018 KNOREX1 4 BEAM - FIXED TIME WINDOWS 1 7 2 1 8 Unbounded events Processing time 3 8 6 3 5 3 8 8 2 4 2 1 9 3 7 30s window 0 00:00:00 00:00:30 00:01:00 00:01:30 30s window 1 30s window 2
  • 15. © 2018 KNOREX1 5 BEAM – SLIDING TIME WINDOWS 1 7 2 1 8 Unbounded events Processing time 3 8 6 3 5 3 8 8 2 4 2 1 9 3 7 30s window 0 00:00:00 00:00:30 00:01:00 00:01:30 30s window 1 30s window 2
  • 16. © 2018 KNOREX1 6 BEAM – SESSION WINDOWS 1 2 Processing time 2 4 7 window 0 00:00:00 00:00:30 00:01:00 00:01:30 window 1 window 2 7 4 2 2 2 2 2 2 2 4 4 4 Gap duration
  • 17. © 2018 KNOREX1 7 A fast, highly scalable, cost-effective, and fully managed enterprise data warehouse for analytics. Some of the features: • Serverless • Real-time Analytics • Standard SQL • Storage and Compute Separation • Flexible Data Ingestion • Petabyte Scale CLOUD BIGQUERY
  • 18. © 2018 KNOREX1 8 BIGQUERY STORAGE IS COLUMNAR Column1 Column2 Column3 Each column in sperate. No Indexes or key is required.
  • 19. © 2018 KNOREX1 9 INGESTION-TIME PARTITIONED TABLE 19 Column1 Column2 Column3 SELECT Column1, Column2 FROM `database.table_name` WHERE PARTITIONDATE >= "2018-12-01" AND _PARTITIONDATE < "2018-12-03" 2018-12-01 00:00:00 2018-12-01 00:00:00 2018-12-02 00:00:00 2018-12-02 00:00:00 2018-12-02 00:00:00 2018-12-03 00:00:00 2018-12-03 00:00:00 _PARTITIONTIME 2018-12-01 2018-12-01 2018-12-02 2018-12-02 2018-12-02 2018-12-03 2018-12-03 _PARTITIONDATE
  • 20. © 2018 KNOREX2 0 INGESTION-TIME PARTITIONED TABLE Column1 Column2 Column3 SELECT Column1, Column2 FROM `database.table_name` WHERE PARTITIONDATE >= "2018-12-01" AND _PARTITIONDATE < "2018-12-03" 2018-12-01 00:00:00 2018-12-01 00:00:00 2018-12-02 00:00:00 2018-12-02 00:00:00 2018-12-02 00:00:00 2018-12-03 00:00:00 2018-12-03 00:00:00 _PARTITIONTIME 2018-12-01 2018-12-01 2018-12-02 2018-12-02 2018-12-02 2018-12-03 2018-12-03 _PARTITIONDATE
  • 21. © 2018 KNOREX2 1 PARTITIONED TABLE Column1 Column2 2018-12-01 2018-12-01 2018-12-02 2018-12-02 2018-12-02 2018-12-03 2018-12-03 Column3 Partitioned based on data in a specified TIMESTAMP or DATE column. SELECT Column1, Column2 FROM `database.table_name` WHERE Column3 >= "2018-12-01" AND Column3 < "2018-12-03"
  • 23. © 2018 KNOREX2 3 ARCHITECTURE – STREAMING PIPELINE Third-Party partner Processing and analytics CMS & RTB engine API gateway Cloud Load Balancing Data warehouse BigQuery Sharding + Clustering Stream proc Cloud Dataflow Autoscaling API Compute Engine Autoscaling Audience Cloud Bigtable 3 regions CMS Cookie Cloud Pub/Sub Cookie topic Device Cloud Pub/Sub Device topic Segmented users Cloud Pub/Sub Device topic Python script Compute Engine Autoscaling Event ingest
  • 24. © 2018 KNOREX2 4 ARCHITECTURE – EVENT INGEST GCE run code with auto-scaling instances. it receives 1500 events a sec from our partner. API endpoint will put events into two separate topics: cookie and device. Cloud Load Balancing API Compute Engine Autoscaling Cookie Cloud Pub/Sub Cookie topic Device Cloud Pub/Sub Device topic 1500 events a sec
  • 25. © 2018 KNOREX2 5 ARCHITECTURE – PROCESSING AND ANALYTICS 25 Cloud Dataflow transforms and enriches raw events in real time and inserts both processed data into BigQuery as well as send them to RTB engine through Pub/Sub. Each region has a subscription to pull data from segment topic, then insert into BigTable. BigQuery is a warehouse for analytics. Tables are partitioned by ingestion time. It keep data in 60 days. Data warehouse BigQuery Partition + Clustering Stream proc Cloud Dataflow Autoscaling Cookie Cloud Pub/Sub Cookie topic Device Cloud Pub/Sub Device topic Segmented users Cloud Pub/Sub segment topic Asia region Compute Engine Cloud BigTable JP region Compute Engine Cloud BigTable US region Compute Engine Cloud BigTable CMS KNX RTB Engine
  • 26. © 2018 KNOREX2 6 ARCHITECTURE – BATCH PIPELINE The Dataflow also takes data from BigQuery in the past 30 days and reprocess again in batch job. Cloud Dataflow batch processing BigQuery analytics engine Batch pipeline Batch loads BigQuery analytics engine Pub/Sub
  • 27. © 2018 KNOREX2 7 DATAFLOW – PIPELINE VISUALIZATION