Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Rennes Meetup 2019-09-26 - Change data capture in production
1. 1
David Morin
Big Data Devops
@davAtBzh
Change data capture in
production
How OVH became a Data Driven Business with the help of Apache Flink
Yann Pauly
Software Engineer
@impolitepanda
2. 2
Why did we become Data Driven ?
Desire to grow
Need to raise funds from
investors
Investors needs insurances and
numbers about the business
OVH needs to produce
reliable financial KPIs to
answer regularly
Drive the business with said KPIs
3. 3
How did we become Data Driven?
1999
Only one database for
most products
15. 15
How was Flink integrated with Kerberos?
Principal identity + credentials
TGT + service name
Ticket service
Ticket service
Ticket: TGT
16. 16
How was Flink integrated with Kerberos?
Principal identity + credentials
TGT + service name
Ticket service
Ticket service
Ticket: TGT
Expiration !
17. 17
How was Flink integrated with Kerberos?
Get keytab
keytab
TGT + service name
Ticket service
Ticket service
18. 18
How was Flink integrated with Kerberos?
TGT
keytab
TGT + service name
Ticket service
Ticket service
19. 19
How was Flink integrated with Kerberos?
TGT
keytab
Reference keytab in
Flink config
TGT + service name
Ticket service
Ticket service
25. 25
How are our data ingested by Hive?
{ api } JDBC Parquet ORC
Streaming
Performance
Transaction
Future proof
26. 26
How are our data ingested by Hive?Row Data
Stripe Footer
File Footer
Index Data
Row Data
Stripe Footer
Index Data
Row Data
Stripe Footer
Index Data
Row Data
Stripe Footer
Index Data
Row Data
Stripe Footer
Index Data
Postscript
Column 1
Column 2
Column 3
Column X
Column 1
Column 1
Column 3
Column X
32. 32
What’s the problem with that ?
PostgreSQL MySQL SQL Server Oracle ORC (Hive)
Boolean Boolean TinyInt(1) Bit Number(1) 0/1 Boolean
Float Real/Float Float Float Float Float
DateTime
Timestamp (no
timezone)
DateTime DateTime2 TimeStamp TimeStamp
Blob ByteA Blob Binary(n) Blob Binary
33. 33
Introducing: a Pivot format
DBMS XMySQL
Oracle
MongoDB
Postgre
SQL
DBMS
X
SQL
Server
Postgre
SQL
SQL
Server
HIVE
MySQL
Pivot
34. 34
{api}
Data Collector gets source
RAW schema
Our API saves it in our
backend
We manually launch our
converter job
Converter runs and generate
pivot schema and DDL for
target sink
Generated pivot schema and
target DDL are stored on
HDFS
01
02
03
05How do we generate Pivot
schemas?
04
37. 37
How do we push our data into Kafka?
1 kafka topic = N partitions
What about table name as kafka partition key ?
Area table
Ordering preserved by partition
Commands table Bad distribution !
38. 38
Is round-robin the solution?
What about event order ?
Any table (partition 1)
Any table (partition 2)
Any table (partition 3)
43. 43
Event conversion
Table schema
extraction
Event readJob spawns
How do we convert our events to pivot format?
The Flink job
retrieves it’s DB
pivot schema
from HDFS
The current
event’s table pivot
schema is
extracted from
the database
schema
Each event
corresponds to
one table only
Event is converted
from it’s source
format (RAW) to
the pivot format
55. 55
What’s a better solution? ORC + Hive Metadata!
SELECT * FROM mytab;
mytab
id value
1 test
56. 56
What’s a better solution? ORC + Hive Metadata!
SELECT row__id,* FROM mytab;
row__id
row
id value
row__id: {"transactionid":10,"bucketid":1,"rowid":0} 1 test
57. 57
What’s a better solution? ORC Delta File + Hive Metadata!
INSERT INTO `mytab` VALUES(1,'test');
transaction 1 created
delta_0000001_0000001_0000/bucket_00001
{"operation":0,"originalTransaction":1,"bucket":1,"rowId":0,"currentTransaction":1,"row":{"_col0":1,"_col1":"test"}}
DELETE FROM `mytab` WHERE id=1;
transaction 2 created
delta_0000002_0000002_0000/bucket_00001
{"operation":2,"originalTransaction":1,"bucket":1,"rowId":0,"currentTransaction":2,"row":null}
UPDATE = DELETE + INSERT
58. 58
What’s a better solution? ORC Delta File + Hive Metadata!
{"operation":2,"originalTransaction":1,"bucket":1,"rowId":0,"currentTransaction":2,"row":null}
Keep track of several metadatas
HiveMeta
id
pkValue
operation
originalTxId
bucketId
timestamp
59. 59
Flink State
What’s a better solution? ORC Delta File + Hive Metadata!
{"operation":2,"originalTransaction":1,"bucket":1,"rowId":0,"currentTransaction":2,"row":null}
HiveMeta
id
pkValue
operation
originalTxId
bucketId
timestamp
test.mytab
1
2
1
1
1569362895...
Keep track of several metadatas
60. 60
How do we store our Flink state?
Local
Scalable
Incremental
RocksDB
61. 61
Our Flink usage: a summary
CheckpointsSide output
WindowingWatermarks
RocksDB
state
Metrics
62. 62
Our Flink usage: some numbers…
3+ Billion
Rows
2500+
Synced tables
Up to 300 Millions
Query events per dump
200+
Flink containers on Yarn
100+
Flink jobs
10+ Millions
Streaming events per day
63. 63
What’s next?
Hive 3?
Multiple other sinks
Automate all remaining manual
processes
Make it Open Source?
Rule engine to anonymize data
and perform more complex
transformations
Real-time data merging
64. 64
Questions ?
Hive 3?
Multiple other sinks
Automate all remaining manual
processes
Make it Open Source?
Rule engine to anonymize data
and perform more complex
transformations
Real-time data merging