SlideShare uma empresa Scribd logo
1 de 64
Baixar para ler offline
1
David Morin
Big Data Devops
@davAtBzh
Change data capture in
production
How OVH became a Data Driven Business with the help of Apache Flink
Yann Pauly
Software Engineer
@impolitepanda
2
Why did we become Data Driven ?
Desire to grow
Need to raise funds from
investors
Investors needs insurances and
numbers about the business
OVH needs to produce
reliable financial KPIs to
answer regularly
Drive the business with said KPIs
3
How did we become Data Driven?
1999
Only one database for
most products
4
200+ databases
15K+ tables
10M+ events/day
How did we become Data Driven?
5
How did we become Data Driven?
6
Our ingestion pipeline
Databases
7
How did we become Data Driven?
8
Where are all our data stored?
9
Our ingestion pipeline
Databases HDFS
10
How are our data Extracted?
11
Our ingestion pipeline
Data
Collector
Databases HDFS
12
How are our data Transformed and Loaded?
13
How is Flink integrated in our pipeline?
14
How did we customize our storage?
15
How was Flink integrated with Kerberos?
Principal identity + credentials
TGT + service name
Ticket service
Ticket service
Ticket: TGT
16
How was Flink integrated with Kerberos?
Principal identity + credentials
TGT + service name
Ticket service
Ticket service
Ticket: TGT
Expiration !
17
How was Flink integrated with Kerberos?
Get keytab
keytab
TGT + service name
Ticket service
Ticket service
18
How was Flink integrated with Kerberos?
TGT
keytab
TGT + service name
Ticket service
Ticket service
19
How was Flink integrated with Kerberos?
TGT
keytab
Reference keytab in
Flink config
TGT + service name
Ticket service
Ticket service
20
Our ingestion pipeline
Data
Collector
Databases Flink HDFS
21
How do we analyze our data?
22
How are our data ingested by Hive?
{ api }CLI
JDBC Files
23
How are our data ingested by Hive?
{ api } JDBC Parquet ORC
Streaming
Performance
Transaction
Future proof
24
How are our data ingested by Hive?
25
How are our data ingested by Hive?
{ api } JDBC Parquet ORC
Streaming
Performance
Transaction
Future proof
26
How are our data ingested by Hive?Row Data
Stripe Footer
File Footer
Index Data
Row Data
Stripe Footer
Index Data
Row Data
Stripe Footer
Index Data
Row Data
Stripe Footer
Index Data
Row Data
Stripe Footer
Index Data
Postscript
Column 1
Column 2
Column 3
Column X
Column 1
Column 1
Column 3
Column X
27
Our ingestion pipeline
Data
Collector
Databases Flink HiveHDFS
28
Our Data Sources
MySQL ORC
29
PG
Our Data Sources – Not only MySQL
MySQL ORC
30
Our Data Sources – multiple sources / sinks
PG
MySQL ORC
?
31
Our Data Sources – multiple sources / sinks
PG
MySQL ORC
?
?
32
What’s the problem with that ?
PostgreSQL MySQL SQL Server Oracle ORC (Hive)
Boolean Boolean TinyInt(1) Bit Number(1) 0/1 Boolean
Float Real/Float Float Float Float Float
DateTime
Timestamp (no
timezone)
DateTime DateTime2 TimeStamp TimeStamp
Blob ByteA Blob Binary(n) Blob Binary
33
Introducing: a Pivot format
DBMS XMySQL
Oracle
MongoDB
Postgre
SQL
DBMS
X
SQL
Server
Postgre
SQL
SQL
Server
HIVE
MySQL
Pivot
34
{api}
Data Collector gets source
RAW schema
Our API saves it in our
backend
We manually launch our
converter job
Converter runs and generate
pivot schema and DDL for
target sink
Generated pivot schema and
target DDL are stored on
HDFS
01
02
03
05How do we generate Pivot
schemas?
04
35
Pivot format – Additional conversion data
Update
Insert
Delete
36
Our ingestion pipeline
Data
Collector
Databases Flink
Data
Collector
Databases
Consume
37
How do we push our data into Kafka?
1 kafka topic = N partitions
What about table name as kafka partition key ?
Area table
Ordering preserved by partition
Commands table Bad distribution !
38
Is round-robin the solution?
What about event order ?
Any table (partition 1)
Any table (partition 2)
Any table (partition 3)
39
Our ingestion pipeline
Data
Collector
Databases Flink
Data
Collector
Databases
Consume Sort
40
How do we maintain event order? Watermarks!
Watermark to mark the progress of event time
Watermark based on event timestamp
41
Our ingestion pipeline
Data
Collector
Databases Flink
Data
Collector
Databases
Consume Sort Filter
42
Our ingestion pipeline
Data
Collector
Databases Flink
Data
Collector
Databases
Consume Sort Filter Map Convert
43
Event conversion
Table schema
extraction
Event readJob spawns
How do we convert our events to pivot format?
The Flink job
retrieves it’s DB
pivot schema
from HDFS
The current
event’s table pivot
schema is
extracted from
the database
schema
Each event
corresponds to
one table only
Event is converted
from it’s source
format (RAW) to
the pivot format
44
Our ingestion pipeline
Data
Collector
Databases Flink
Data
Collector
Databases
Consume Sort Filter Map Convert Aggregate Store
45
Last steps: Windowing and sink
Custom window function based on size and duration
converted events
window aggregation
ORC files
conversion
46
Our ingestion pipeline
Data
Collector
Databases Flink
Data
Collector
Databases
Consume Sort Filter Map Convert Aggregate Store
47
Why do we need checkpoints?
Commit
48
How do we manage these anomalies?
Error: Cannot be converted to ORC !
Side output
Write to HDFS (data + error)
Push Metric
Alerting
49
How do we monitor our pipeline execution?
Prometheus Push Gateway
Reporter
Prometheus Push Gateway OVH Metrics Data Platform
50
How do we monitor our pipeline execution?
51
Why don’t we see any data in Hive?
52
Why don’t we see any data in Hive?
Hive Managed
tables (ACID)
ORC Files Table 1
ORC Files Table 2
External Tables
ORC Files Table 3
ORC Files Table 4
ORC Files Table X
ORC Files Table 1
ORC Files Table 2
ORC Files Table 3
ORC Files Table 4
ORC Files Table X
53
How do we see our data in Hive?
Hive
Hive Managed
tables
ORC Files Table 1
ORC Files Table 2
External Tables
ORC Files Table 3
ORC Files Table 4
ORC Files Table X
ORC Files Table 1
ORC Files Table 2
ORC Files Table 3
ORC Files Table 4
ORC Files Table X
SQL query to merge data
54
Why isn’t this the best solution?
55
What’s a better solution? ORC + Hive Metadata!
SELECT * FROM mytab;
mytab
id value
1 test
56
What’s a better solution? ORC + Hive Metadata!
SELECT row__id,* FROM mytab;
row__id
row
id value
row__id: {"transactionid":10,"bucketid":1,"rowid":0} 1 test
57
What’s a better solution? ORC Delta File + Hive Metadata!
INSERT INTO `mytab` VALUES(1,'test');
transaction 1 created
delta_0000001_0000001_0000/bucket_00001
{"operation":0,"originalTransaction":1,"bucket":1,"rowId":0,"currentTransaction":1,"row":{"_col0":1,"_col1":"test"}}
DELETE FROM `mytab` WHERE id=1;
transaction 2 created
delta_0000002_0000002_0000/bucket_00001
{"operation":2,"originalTransaction":1,"bucket":1,"rowId":0,"currentTransaction":2,"row":null}
UPDATE = DELETE + INSERT
58
What’s a better solution? ORC Delta File + Hive Metadata!
{"operation":2,"originalTransaction":1,"bucket":1,"rowId":0,"currentTransaction":2,"row":null}
Keep track of several metadatas
HiveMeta
id
pkValue
operation
originalTxId
bucketId
timestamp
59
Flink State
What’s a better solution? ORC Delta File + Hive Metadata!
{"operation":2,"originalTransaction":1,"bucket":1,"rowId":0,"currentTransaction":2,"row":null}
HiveMeta
id
pkValue
operation
originalTxId
bucketId
timestamp
test.mytab
1
2
1
1
1569362895...
Keep track of several metadatas
60
How do we store our Flink state?
Local
Scalable
Incremental
RocksDB
61
Our Flink usage: a summary
CheckpointsSide output
WindowingWatermarks
RocksDB
state
Metrics
62
Our Flink usage: some numbers…
3+ Billion
Rows
2500+
Synced tables
Up to 300 Millions
Query events per dump
200+
Flink containers on Yarn
100+
Flink jobs
10+ Millions
Streaming events per day
63
What’s next?
Hive 3?
Multiple other sinks
Automate all remaining manual
processes
Make it Open Source?
Rule engine to anonymize data
and perform more complex
transformations
Real-time data merging
64
Questions ?
Hive 3?
Multiple other sinks
Automate all remaining manual
processes
Make it Open Source?
Rule engine to anonymize data
and perform more complex
transformations
Real-time data merging

Mais conteúdo relacionado

Mais procurados

Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald NowlingInsights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Spark Summit
 
Final Presentation IRT - Jingxuan Wei V1.2
Final Presentation  IRT - Jingxuan Wei V1.2Final Presentation  IRT - Jingxuan Wei V1.2
Final Presentation IRT - Jingxuan Wei V1.2
JINGXUAN WEI
 

Mais procurados (20)

SPARQL and Linked Data Benchmarking
SPARQL and Linked Data BenchmarkingSPARQL and Linked Data Benchmarking
SPARQL and Linked Data Benchmarking
 
Druid
DruidDruid
Druid
 
Spark cassandra integration 2016
Spark cassandra integration 2016Spark cassandra integration 2016
Spark cassandra integration 2016
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
 
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
 
Graph databases and the Panama Papers - Stefan Armbruster - Codemotion Milan ...
Graph databases and the Panama Papers - Stefan Armbruster - Codemotion Milan ...Graph databases and the Panama Papers - Stefan Armbruster - Codemotion Milan ...
Graph databases and the Panama Papers - Stefan Armbruster - Codemotion Milan ...
 
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald NowlingInsights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
 
Apache con 2020 use cases and optimizations of iotdb
Apache con 2020 use cases and optimizations of iotdbApache con 2020 use cases and optimizations of iotdb
Apache con 2020 use cases and optimizations of iotdb
 
Neo4j 4.1 overview
Neo4j 4.1 overviewNeo4j 4.1 overview
Neo4j 4.1 overview
 
Druid at Hadoop Ecosystem
Druid at Hadoop EcosystemDruid at Hadoop Ecosystem
Druid at Hadoop Ecosystem
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
Engineering fast indexes
Engineering fast indexesEngineering fast indexes
Engineering fast indexes
 
Data Analytics with Druid
Data Analytics with DruidData Analytics with Druid
Data Analytics with Druid
 
Datastax enterprise presentation
Datastax enterprise presentationDatastax enterprise presentation
Datastax enterprise presentation
 
Final Presentation IRT - Jingxuan Wei V1.2
Final Presentation  IRT - Jingxuan Wei V1.2Final Presentation  IRT - Jingxuan Wei V1.2
Final Presentation IRT - Jingxuan Wei V1.2
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Relational databases for BigData
Relational databases for BigDataRelational databases for BigData
Relational databases for BigData
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
 
Spark meetup v2.0.5
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5
 

Semelhante a Rennes Meetup 2019-09-26 - Change data capture in production

Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSABuilding the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Databricks
 
Standard Provenance Reporting and Scientific Software Management in Virtual L...
Standard Provenance Reporting and Scientific Software Management in Virtual L...Standard Provenance Reporting and Scientific Software Management in Virtual L...
Standard Provenance Reporting and Scientific Software Management in Virtual L...
njcar
 
Bw training 5 ods and bc
Bw training   5 ods and bcBw training   5 ods and bc
Bw training 5 ods and bc
Joseph Tham
 

Semelhante a Rennes Meetup 2019-09-26 - Change data capture in production (20)

2017-01-08-scaling tribalknowledge
2017-01-08-scaling tribalknowledge2017-01-08-scaling tribalknowledge
2017-01-08-scaling tribalknowledge
 
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang PengBuilding Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
 
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
 
Flink in Zalando's world of Microservices
Flink in Zalando's world of Microservices   Flink in Zalando's world of Microservices
Flink in Zalando's world of Microservices
 
Flink in Zalando's World of Microservices
Flink in Zalando's World of Microservices  Flink in Zalando's World of Microservices
Flink in Zalando's World of Microservices
 
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSABuilding the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
 
Replicate from Oracle to data warehouses and analytics
Replicate from Oracle to data warehouses and analyticsReplicate from Oracle to data warehouses and analytics
Replicate from Oracle to data warehouses and analytics
 
Validating statistical Index Data represented in RDF using SPARQL Queries: Co...
Validating statistical Index Data represented in RDF using SPARQL Queries: Co...Validating statistical Index Data represented in RDF using SPARQL Queries: Co...
Validating statistical Index Data represented in RDF using SPARQL Queries: Co...
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
User Group3009
User Group3009User Group3009
User Group3009
 
Synapse 2018 Guarding against failure in a hundred step pipeline
Synapse 2018 Guarding against failure in a hundred step pipelineSynapse 2018 Guarding against failure in a hundred step pipeline
Synapse 2018 Guarding against failure in a hundred step pipeline
 
Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APS
 
Datacamp @ Transparency Camp 2010
Datacamp @ Transparency Camp 2010Datacamp @ Transparency Camp 2010
Datacamp @ Transparency Camp 2010
 
Standard Provenance Reporting and Scientific Software Management in Virtual L...
Standard Provenance Reporting and Scientific Software Management in Virtual L...Standard Provenance Reporting and Scientific Software Management in Virtual L...
Standard Provenance Reporting and Scientific Software Management in Virtual L...
 
Stream Analytics with SQL on Apache Flink
 Stream Analytics with SQL on Apache Flink Stream Analytics with SQL on Apache Flink
Stream Analytics with SQL on Apache Flink
 
"Building Data Warehouse with Google Cloud Platform", Artem Nikulchenko
"Building Data Warehouse with Google Cloud Platform",  Artem Nikulchenko"Building Data Warehouse with Google Cloud Platform",  Artem Nikulchenko
"Building Data Warehouse with Google Cloud Platform", Artem Nikulchenko
 
Bw training 5 ods and bc
Bw training   5 ods and bcBw training   5 ods and bc
Bw training 5 ods and bc
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
 
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDesigning the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
 
Etl with apache impala by athemaster
Etl with apache impala by athemasterEtl with apache impala by athemaster
Etl with apache impala by athemaster
 

Último

如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
0uyfyq0q4
 
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一
w7jl3eyno
 
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
fztigerwe
 
Toko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra Malang
Toko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra MalangToko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra Malang
Toko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra Malang
adet6151
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
cyebo
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
dq9vz1isj
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
DilipVasan
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Stephen266013
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
pyhepag
 
如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一
如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一
如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一
hwhqz6r1y
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
pyhepag
 

Último (20)

Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data Analytics
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call
 
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一
 
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
 
Toko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra Malang
Toko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra MalangToko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra Malang
Toko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra Malang
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeral
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdf
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
 
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
 
123.docx. .
123.docx.                                 .123.docx.                                 .
123.docx. .
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一
如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一
如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
 

Rennes Meetup 2019-09-26 - Change data capture in production

  • 1. 1 David Morin Big Data Devops @davAtBzh Change data capture in production How OVH became a Data Driven Business with the help of Apache Flink Yann Pauly Software Engineer @impolitepanda
  • 2. 2 Why did we become Data Driven ? Desire to grow Need to raise funds from investors Investors needs insurances and numbers about the business OVH needs to produce reliable financial KPIs to answer regularly Drive the business with said KPIs
  • 3. 3 How did we become Data Driven? 1999 Only one database for most products
  • 4. 4 200+ databases 15K+ tables 10M+ events/day How did we become Data Driven?
  • 5. 5 How did we become Data Driven?
  • 7. 7 How did we become Data Driven?
  • 8. 8 Where are all our data stored?
  • 10. 10 How are our data Extracted?
  • 12. 12 How are our data Transformed and Loaded?
  • 13. 13 How is Flink integrated in our pipeline?
  • 14. 14 How did we customize our storage?
  • 15. 15 How was Flink integrated with Kerberos? Principal identity + credentials TGT + service name Ticket service Ticket service Ticket: TGT
  • 16. 16 How was Flink integrated with Kerberos? Principal identity + credentials TGT + service name Ticket service Ticket service Ticket: TGT Expiration !
  • 17. 17 How was Flink integrated with Kerberos? Get keytab keytab TGT + service name Ticket service Ticket service
  • 18. 18 How was Flink integrated with Kerberos? TGT keytab TGT + service name Ticket service Ticket service
  • 19. 19 How was Flink integrated with Kerberos? TGT keytab Reference keytab in Flink config TGT + service name Ticket service Ticket service
  • 21. 21 How do we analyze our data?
  • 22. 22 How are our data ingested by Hive? { api }CLI JDBC Files
  • 23. 23 How are our data ingested by Hive? { api } JDBC Parquet ORC Streaming Performance Transaction Future proof
  • 24. 24 How are our data ingested by Hive?
  • 25. 25 How are our data ingested by Hive? { api } JDBC Parquet ORC Streaming Performance Transaction Future proof
  • 26. 26 How are our data ingested by Hive?Row Data Stripe Footer File Footer Index Data Row Data Stripe Footer Index Data Row Data Stripe Footer Index Data Row Data Stripe Footer Index Data Row Data Stripe Footer Index Data Postscript Column 1 Column 2 Column 3 Column X Column 1 Column 1 Column 3 Column X
  • 29. 29 PG Our Data Sources – Not only MySQL MySQL ORC
  • 30. 30 Our Data Sources – multiple sources / sinks PG MySQL ORC ?
  • 31. 31 Our Data Sources – multiple sources / sinks PG MySQL ORC ? ?
  • 32. 32 What’s the problem with that ? PostgreSQL MySQL SQL Server Oracle ORC (Hive) Boolean Boolean TinyInt(1) Bit Number(1) 0/1 Boolean Float Real/Float Float Float Float Float DateTime Timestamp (no timezone) DateTime DateTime2 TimeStamp TimeStamp Blob ByteA Blob Binary(n) Blob Binary
  • 33. 33 Introducing: a Pivot format DBMS XMySQL Oracle MongoDB Postgre SQL DBMS X SQL Server Postgre SQL SQL Server HIVE MySQL Pivot
  • 34. 34 {api} Data Collector gets source RAW schema Our API saves it in our backend We manually launch our converter job Converter runs and generate pivot schema and DDL for target sink Generated pivot schema and target DDL are stored on HDFS 01 02 03 05How do we generate Pivot schemas? 04
  • 35. 35 Pivot format – Additional conversion data Update Insert Delete
  • 36. 36 Our ingestion pipeline Data Collector Databases Flink Data Collector Databases Consume
  • 37. 37 How do we push our data into Kafka? 1 kafka topic = N partitions What about table name as kafka partition key ? Area table Ordering preserved by partition Commands table Bad distribution !
  • 38. 38 Is round-robin the solution? What about event order ? Any table (partition 1) Any table (partition 2) Any table (partition 3)
  • 39. 39 Our ingestion pipeline Data Collector Databases Flink Data Collector Databases Consume Sort
  • 40. 40 How do we maintain event order? Watermarks! Watermark to mark the progress of event time Watermark based on event timestamp
  • 41. 41 Our ingestion pipeline Data Collector Databases Flink Data Collector Databases Consume Sort Filter
  • 42. 42 Our ingestion pipeline Data Collector Databases Flink Data Collector Databases Consume Sort Filter Map Convert
  • 43. 43 Event conversion Table schema extraction Event readJob spawns How do we convert our events to pivot format? The Flink job retrieves it’s DB pivot schema from HDFS The current event’s table pivot schema is extracted from the database schema Each event corresponds to one table only Event is converted from it’s source format (RAW) to the pivot format
  • 44. 44 Our ingestion pipeline Data Collector Databases Flink Data Collector Databases Consume Sort Filter Map Convert Aggregate Store
  • 45. 45 Last steps: Windowing and sink Custom window function based on size and duration converted events window aggregation ORC files conversion
  • 46. 46 Our ingestion pipeline Data Collector Databases Flink Data Collector Databases Consume Sort Filter Map Convert Aggregate Store
  • 47. 47 Why do we need checkpoints? Commit
  • 48. 48 How do we manage these anomalies? Error: Cannot be converted to ORC ! Side output Write to HDFS (data + error) Push Metric Alerting
  • 49. 49 How do we monitor our pipeline execution? Prometheus Push Gateway Reporter Prometheus Push Gateway OVH Metrics Data Platform
  • 50. 50 How do we monitor our pipeline execution?
  • 51. 51 Why don’t we see any data in Hive?
  • 52. 52 Why don’t we see any data in Hive? Hive Managed tables (ACID) ORC Files Table 1 ORC Files Table 2 External Tables ORC Files Table 3 ORC Files Table 4 ORC Files Table X ORC Files Table 1 ORC Files Table 2 ORC Files Table 3 ORC Files Table 4 ORC Files Table X
  • 53. 53 How do we see our data in Hive? Hive Hive Managed tables ORC Files Table 1 ORC Files Table 2 External Tables ORC Files Table 3 ORC Files Table 4 ORC Files Table X ORC Files Table 1 ORC Files Table 2 ORC Files Table 3 ORC Files Table 4 ORC Files Table X SQL query to merge data
  • 54. 54 Why isn’t this the best solution?
  • 55. 55 What’s a better solution? ORC + Hive Metadata! SELECT * FROM mytab; mytab id value 1 test
  • 56. 56 What’s a better solution? ORC + Hive Metadata! SELECT row__id,* FROM mytab; row__id row id value row__id: {"transactionid":10,"bucketid":1,"rowid":0} 1 test
  • 57. 57 What’s a better solution? ORC Delta File + Hive Metadata! INSERT INTO `mytab` VALUES(1,'test'); transaction 1 created delta_0000001_0000001_0000/bucket_00001 {"operation":0,"originalTransaction":1,"bucket":1,"rowId":0,"currentTransaction":1,"row":{"_col0":1,"_col1":"test"}} DELETE FROM `mytab` WHERE id=1; transaction 2 created delta_0000002_0000002_0000/bucket_00001 {"operation":2,"originalTransaction":1,"bucket":1,"rowId":0,"currentTransaction":2,"row":null} UPDATE = DELETE + INSERT
  • 58. 58 What’s a better solution? ORC Delta File + Hive Metadata! {"operation":2,"originalTransaction":1,"bucket":1,"rowId":0,"currentTransaction":2,"row":null} Keep track of several metadatas HiveMeta id pkValue operation originalTxId bucketId timestamp
  • 59. 59 Flink State What’s a better solution? ORC Delta File + Hive Metadata! {"operation":2,"originalTransaction":1,"bucket":1,"rowId":0,"currentTransaction":2,"row":null} HiveMeta id pkValue operation originalTxId bucketId timestamp test.mytab 1 2 1 1 1569362895... Keep track of several metadatas
  • 60. 60 How do we store our Flink state? Local Scalable Incremental RocksDB
  • 61. 61 Our Flink usage: a summary CheckpointsSide output WindowingWatermarks RocksDB state Metrics
  • 62. 62 Our Flink usage: some numbers… 3+ Billion Rows 2500+ Synced tables Up to 300 Millions Query events per dump 200+ Flink containers on Yarn 100+ Flink jobs 10+ Millions Streaming events per day
  • 63. 63 What’s next? Hive 3? Multiple other sinks Automate all remaining manual processes Make it Open Source? Rule engine to anonymize data and perform more complex transformations Real-time data merging
  • 64. 64 Questions ? Hive 3? Multiple other sinks Automate all remaining manual processes Make it Open Source? Rule engine to anonymize data and perform more complex transformations Real-time data merging