SlideShare uma empresa Scribd logo
1 de 68
Baixar para ler offline
Tame the Small Files Problem and Optimize
Data Layout for Streaming Ingestion to Iceberg
Steven Wu, Gang Ye, Haizhou Zhao | Apple
THIS IS NOT A CONTRIBUTION
Apache Iceberg is an open table format for huge analytic data
• Time travel

• Advanced
fi
ltering

• Serializable isolation
Where does Iceberg fit in the ecosystem
Table Format


(Metadata)
Compute


Engine
Storage


(Data) Cloud Blob
Storage
Ingest data to Iceberg data lake in streaming fashion
Flink Streaming

Ingestion
Iceberg

Data Lake
Kafka 

Msg Queue
Zoom into the Flink Iceberg sink
Iceberg

Data Lake
writer-1
writer-2
writer-n
…
Records
DFS
Data Files
committer
File Metadata
Case 1: event-time partitioned tables
hour=2022-08-03-00/
hour=2022-08-03-01/
…
Long tail problem with late arrival data
https://en.wikipedia.org/wiki/Long_tail
Hour
Percentage

of data
0 1 2 N
A data file can’t contain rows across partitions
hour=2022-08-03-00/
|- file-000.parquet
|- file-001.parquet
|- …
hour=2022-08-03-01/
|- …
…
How many data files are generated every hour?
writer-1
writer-2
writer-500
…
committer
720K
fi
les every hour (with 10 minute checkpoint interval)
Records for


24x10 partitions
Open 240
fi
les
Commit 120K
fi
les (240x500)
every checkpoint
Assuming table is partitioned
hourly and event time range
is capped at 10 days
Long-tail hours lead to small files
Percentile File Size
P50 55 KB
P75 77 KB
P90 13 MB
P99 18 MB
What are the implications of too many small files
• Poor read performance

• Request throttling

• Memory pressure

• Longer checkpoint duration and pipeline pause

• Stress the metadata system
Why not keyBy shuffle
writer-1
writer-2
writer-n
…
committer
operator-1
operator-2
operator-n
keyBy(hour)
Iceberg
There are two problems
• Tra
ffi
c are not evenly distributed across event hours

• keyBy for low cardinality column won’t be balanced [1]
[1] https://github.com/apache/iceberg/pull/4228
Need smarter shu
ffl
ing
Case 2: data clustering for non-partition columns
CREATE TABLE db.tbl (
ts timestamp,
data string,
event_type string)
USING iceberg
PARTITIONED BY (hours(ts))
Queries often filter on event_type
SELECT count(1) FROM db.tbl WHERE
ts >= '2022-01-01 08:00:00’ AND
ts < '2022-01-01 09:00:00' AND
event_type = ‘C’
Iceberg supports file pruning leveraging min-max stats
at column level
|- file-000.parquet (event_type: A-B)
|- file-001.parquet (event_type: C-C)
|- file-002.parquet (event_type: D-F)
…
event_type = ‘C’
Wide value range would make pruning ineffective
Wide value range
|- file-000.parquet (event_type: A-Z)
|- file-001.parquet (event_type: A-Z)
|- file-002.parquet (event_type: A-Z)
…
event_type = ‘C’
Making event_type a partition column can lead to
explosion of number of partitions
• Before: 8.8K partitions (365 days x 24 hours) [1]

• After: 4.4M partitions (365 days x 24 hours x 500 event_types) [2]

• Can stress metadata system and lead to small
fi
les
[1] Assuming 12 months retention

[2] Assuming 500 event types
Batch engines solve the clustering problem via shuffle
2. Shuffle to
cluster data
Stage Stage
…
1. Compute
data sketch
Event
Type
Weight
A 2%
B 7%
C 22%
…
Z 0.5%
…
A B A
C C C
Z Y X
A
B
A C
C
C
Z
Y
X
3. Sort data
before writing to
files
A A B
C C C
X Y Z
A-B
min-max
C-C
X-Z
Tight value
range
Shu
ffl
e for better data clustering
Why not compact small files or sort files via
background batch maintenance jobs
• Remediation is usually more expensive than prevention

• Doesn’t solve the throttling problem in the streaming path
Agenda
Motivation Evaluation
Design
Introduce a smart shuffling operator in Flink Iceberg sink
Iceberg
writer-1
writer-2
writer-n
…
committer
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-n
Smart shuffling
Step 1: calculate traffic distribution
writer-1
writer-2
writer-n
…
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-10
Hour Weight
0 33%
1 14%
2 5%
… …
240 0.001%
Step 2a: shuffle data based on traffic distribution
Hour Assigned tasks
0 1, 2, 3, 4
1 4, 5
2 6
… …
238 10
239 10
240 10
writer-1
writer-2
writer-n
…
Hour Weight
0 33%
1 14%
2 5%
… …
240 0.001%
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-n
Step 2b: range shuffle data for non-partition column
Event
type
Weight
A 2%
B 7%
C 28%
… …
Z 0.5%
Event
type
Assigned
task
A-B 1
C-C 2, 3, 4
… …
P-Z 10
writer-1
writer-2
writer-n
…
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-10
Range shuffling improves data clustering
A B A
C C C
Z Y X
Z X A
A C Y
C C B
Unsorted
data files
writer-1
writer-2
writer-n
…
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-n
Tight value
range
Sorting within a file brings additional benefits of row
group and page level skipping
Parquet
fi
le
X


X


X


X


X


Y


Y


Z


Z


Z


Z


Z


Row 

group 1
Row 

group 2
Row 

group 3
SELECT * FROM db.tbl WHERE
ts >= … AND ts < … AND
event_type = 'Y'
What if sorting is needed
• Sorting in streaming is possible but expensive

• Use batch sorting jobs
How to calculate tra
ffi
c distribution
FLIP-27 source interface introduced operator
coordinator component
JobManager TaskManager-1
TaskManager-n
…
Source
Reader-1
Source
Reader-k
…
Source

Coordinator
writer-2
writer-n
…
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-n
Smart shuffling
Hour Count
0 33
1 14
2 5
… …
240 0
Hour Count
0 33
1 14
2 5
… …
240 0
Hour Count
0 33
1 14
2 5
… …
240 1
Shuffle tasks calculate local stats and send them to
coordinator
writer-1
JobManager
shu
ffl
e

coordinator
writer-1
writer-2
writer-n
…
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-n
Smart shuffling
Hour Count
0 33
1 14
2 5
… …
240 0
Hour Count
0 33
1 14
2 5
… …
240 0
Hour Count
0 33
1 14
2 5
… …
240 1
Shuffle coordinator does global aggregation
Hour Weight
0 33%
1 14%
2 5%
… …
240 0.001%
Global aggregation
addresses the
potential problem of
different local views
shu
ffl
e

coordinator
JobManager
writer-1
writer-2
writer-n
…
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-n
Smart shuffling
Shuffle coordinator broadcasts the globally aggregated
stats to tasks
Hour Weight
0 33%
1 14%
2 5%
… …
240 0.001%
Shu
ffl
e

Coordinator
Hour Weight
0 33%
1 14%
2 5%
… …
240 0.001%
Hour Weight
0 33%
1 14%
2 5%
… …
240 0.001%
Hour Weight
0 33%
1 14%
2 5%
… …
240 0.001%
JobManager
All shuf
fl
e tasks make
the same decision based
on the same stats
How to shu
ffl
e data
Add a custom partitioner after the shuffle operator
dataStream
.transform("shuffleOperator", shuffleOperatorOutputType, operatorFactory)
.partitionCustom(binPackingPartitioner, keySelector)
public class BinPackingPartitioner<K> implements Partitioner<K> {
@Override
int partition(K key, int numPartitions);
}
There are two shuffling strategies
• Bin packing

• Range distribution
Bin packing can combine multiple small keys to a
single task or split a single large key to multiple tasks
Task Assigned keys
T0 K0, K2, K4, K6, K8
T1 K7
T2 K3
T3 K3
T4 K3
T5 K3
… …
T9 K1,K5
• Only focus on balanced
weight distribution

• Ignore ordering when
assigning keys

• Work well with shu
ffl
ing by
partition columns
Range shuffling split sort values into ranges and
assign them to tasks
• Balance weight distribution
with continuous ranges

• Work well with shu
ffl
ing by
non-partition columns
Value Assigned task
A
T1
B
C
…
D
T2
T3
T4
Optimizing balanced distribution in byte rate can lead to
file count skew where a task handles many long-tail hours
hours
0 1 2 N
https://en.wikipedia.org/wiki/Long_tail
Many long-tail hours can
be assigned to a single
task, which can become
bottleneck
There are two solutions
• Parallelize
fi
le
fl
ushing and upload

• Limit the
fi
le count skew via close-
fi
le-cost (like open-
fi
le-
cost)
Tune close-file-cost to balance btw file count skew
and byte rate skew
Skewness
Close-
fi
le-cost
Byte rate skew
File count skew
Agenda
Motivation Evaluation
Design
A: Simple Iceberg ingestion job without shuffling
source-1
source-2
source-n
writer-1
writer-2
writer-n
committer
Chained
…
• Job parallelism is 60

• Checkpoint interval is 10 min
B: Iceberg ingestion with smart shuffling
source-1
source-2
source-n
writer-1
writer-2
writer-n
committer
• Job parallelism is 60

• Checkpoint interval is 10 min
shuf
fl
e-1
shuf
fl
e-2
Shuf
fl
e-n
Chained Shuffle
Test setup
• Sink Iceberg table is partitioned hourly by event time

• Benchmark tra
ffi
c volume is 250 MB/sec

• Event time range is 192 hours
What are we comparing
• Number of
fi
les written in one cycle

• File size distribution

• Checkpoint duration

• CPU utilization

• Shu
ffl
ing skew
• Job parallelism is 60

• Event time range is 192 hours

Shu
ffl
e reduced the number of
fi
les by 20x
Without shu
ffl
ing one cycle
fl
ushed 10K
fi
les
With shu
ffl
ing one cycle
fl
ushed 500
fi
les
~2.5x of minimal
number of
fi
les
Shuffling greatly improved file size distribution
Percentile
Without
shuffling
With
shuffling
Improvement
P50 55 KB 913 KB 17x
P75 77 KB 7 MB 90x
P95 13 MB 301 MB 23x
P99 18 MB 306 MB 17x
Shuffling tamed the small files problem
During checkpoint, writer tasks flush and upload data files
writer-1
writer-2
writer-n
…
committer
DFS
Data Files
Reduced checkpoint duration by 8x
Without shu
ffl
ing, checkpoint takes 64s on average
With shu
ffl
ing, checkpoint takes 8s on average
Seconds
10
20
30
40
50
60
70
Record handover btw chained operators are simple
method call
source-1
source-2
source-n
writer-1
writer-2
writer-n
committer
Chained
1. Kafka Source 2. Iceberg Sink
…
Shuffling involves significant CPU overhead on serdes
and network I/O
2. Shuffle
1. Kafka Source 3. Iceberg Sink
source-1
source-2
source-n
writer-1
writer-2
writer-n
committer
shuf
fl
e-1
shuf
fl
e-2
Shuf
fl
e-n
Shuffle
Chained
Shuffling increased CPU usage by 62%
All about tradeo
ff
!
With shu
ffl
ing avg CPU util is 57%
Without shu
ffl
ing avg CPU util is 35%
Without shuffling, checkpoint pause is longer and
catch-up spike is bigger
With shu
ffl
ing
Without shu
ffl
ing
Catch-up spike
Trough caused
by pause
Bin packing shuffling won’t be perfect in weight distribution
source-1
source-2
source-n
writer-1
writer-2
writer-n
committer
shuf
fl
e-1
shuf
fl
e-2
Shuf
fl
e-n
Shuffle
Chained
processes data for
partitions a, b, c
processes data for
partitions y, z
Min of writer
record rate
Max of writer
record rate
Skewness
(max-min)/min
No shuffling 4.36 K 4.44 K 1.8%
Bin packing
(greedy algo)
4.02 K 6.39 MB 59%
Our greedy algo implementation of bin packing
introduces higher skew than we hoped for
Future work
• Implement other algorithm

• Better bin packing with less skew

• Range partitioner

• Support sketch statistics for high-cardinality keys

• Contribute it to OSS
References
• Design doc: https://docs.google.com/document/d/13N8cMqPi-
ZPSKbkXGOBMPOzbv2Fua59j8bIjjtxLWqo/
Q&A
Weight table should be
relatively stable
What about new hour as time moves forward?
Absolute hour Weight
2022-08-03-00 0.4
… …
2022-08-03-12 22
2022-08-03-13 27
2022-08-03-14 38
2022-08-03-15 ??
Weight table based on relative hour would be stable
Relative hour Weight
0 38
1 27
2 22
… …
14 0.4
… …
What about cold start problem?
• First-time run

• Restart with empty state

• New subtasks from scale-up
Cope with with cold start problems
• No shu
ffl
e while learning

• Bu
ff
er records until learned the
fi
rst stats

• New subtasks (scale-up) request stats from the coordinator

Mais conteúdo relacionado

Mais procurados

A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBill Liu
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internalsKostas Tzoumas
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in RustAndrew Lamb
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink Forward
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
 

Mais procurados (20)

A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in Rust
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 

Semelhante a Tame the small files problem and optimize data layout for streaming ingestion to Iceberg

Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analyticsamesar0
 
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...Flink Forward
 
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansApache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansEvention
 
Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Productionconfluent
 
Building Stream Processing as a Service
Building Stream Processing as a ServiceBuilding Stream Processing as a Service
Building Stream Processing as a ServiceSteven Wu
 
The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...
The Evolution of Trillion-level Real-time Messaging System in BIGO  - Puslar ...The Evolution of Trillion-level Real-time Messaging System in BIGO  - Puslar ...
The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...StreamNative
 
TenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience SharingTenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience SharingChen-en Lu
 
spark stream - kafka - the right way
spark stream - kafka - the right way spark stream - kafka - the right way
spark stream - kafka - the right way Dori Waldman
 
Flink at netflix paypal speaker series
Flink at netflix   paypal speaker seriesFlink at netflix   paypal speaker series
Flink at netflix paypal speaker seriesMonal Daxini
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Amazon Web Services
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Martin Zapletal
 
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly SolarWinds Loggly
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafkaSamuel Kerrien
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProSFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProChester Chen
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaDataWorks Summit
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingOh Chan Kwon
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Dayprogrammermag
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per SecondAmazon Web Services
 
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)Jeff Hung
 

Semelhante a Tame the small files problem and optimize data layout for streaming ingestion to Iceberg (20)

Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analytics
 
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
 
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansApache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
 
Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Production
 
Building Stream Processing as a Service
Building Stream Processing as a ServiceBuilding Stream Processing as a Service
Building Stream Processing as a Service
 
The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...
The Evolution of Trillion-level Real-time Messaging System in BIGO  - Puslar ...The Evolution of Trillion-level Real-time Messaging System in BIGO  - Puslar ...
The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...
 
TenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience SharingTenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience Sharing
 
spark stream - kafka - the right way
spark stream - kafka - the right way spark stream - kafka - the right way
spark stream - kafka - the right way
 
Flink at netflix paypal speaker series
Flink at netflix   paypal speaker seriesFlink at netflix   paypal speaker series
Flink at netflix paypal speaker series
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
 
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProSFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a Pro
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
Spark cep
Spark cepSpark cep
Spark cep
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
 
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
 

Mais de Flink Forward

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorFlink Forward
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkFlink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraFlink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022Flink Forward
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
 
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!Flink Forward
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsFlink Forward
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesFlink Forward
 
The top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleThe top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleFlink Forward
 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitFlink Forward
 
Changelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkChangelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkFlink Forward
 
Large Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior DetectionLarge Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior DetectionFlink Forward
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
 
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Flink Forward
 

Mais de Flink Forward (20)

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use cases
 
The top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleThe top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scale
 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and Profit
 
Changelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkChangelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache Flink
 
Large Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior DetectionLarge Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior Detection
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 

Último (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Tame the small files problem and optimize data layout for streaming ingestion to Iceberg

  • 1. Tame the Small Files Problem and Optimize Data Layout for Streaming Ingestion to Iceberg Steven Wu, Gang Ye, Haizhou Zhao | Apple THIS IS NOT A CONTRIBUTION
  • 2. Apache Iceberg is an open table format for huge analytic data • Time travel • Advanced fi ltering • Serializable isolation
  • 3. Where does Iceberg fit in the ecosystem Table Format (Metadata) Compute Engine Storage (Data) Cloud Blob Storage
  • 4. Ingest data to Iceberg data lake in streaming fashion Flink Streaming Ingestion Iceberg Data Lake Kafka Msg Queue
  • 5. Zoom into the Flink Iceberg sink Iceberg Data Lake writer-1 writer-2 writer-n … Records DFS Data Files committer File Metadata
  • 6. Case 1: event-time partitioned tables hour=2022-08-03-00/ hour=2022-08-03-01/ …
  • 7. Long tail problem with late arrival data https://en.wikipedia.org/wiki/Long_tail Hour Percentage of data 0 1 2 N
  • 8. A data file can’t contain rows across partitions hour=2022-08-03-00/ |- file-000.parquet |- file-001.parquet |- … hour=2022-08-03-01/ |- … …
  • 9. How many data files are generated every hour? writer-1 writer-2 writer-500 … committer 720K fi les every hour (with 10 minute checkpoint interval) Records for 24x10 partitions Open 240 fi les Commit 120K fi les (240x500) every checkpoint Assuming table is partitioned hourly and event time range is capped at 10 days
  • 10. Long-tail hours lead to small files Percentile File Size P50 55 KB P75 77 KB P90 13 MB P99 18 MB
  • 11. What are the implications of too many small files • Poor read performance • Request throttling • Memory pressure • Longer checkpoint duration and pipeline pause • Stress the metadata system
  • 12. Why not keyBy shuffle writer-1 writer-2 writer-n … committer operator-1 operator-2 operator-n keyBy(hour) Iceberg
  • 13. There are two problems • Tra ffi c are not evenly distributed across event hours • keyBy for low cardinality column won’t be balanced [1] [1] https://github.com/apache/iceberg/pull/4228
  • 15. Case 2: data clustering for non-partition columns CREATE TABLE db.tbl ( ts timestamp, data string, event_type string) USING iceberg PARTITIONED BY (hours(ts))
  • 16. Queries often filter on event_type SELECT count(1) FROM db.tbl WHERE ts >= '2022-01-01 08:00:00’ AND ts < '2022-01-01 09:00:00' AND event_type = ‘C’
  • 17. Iceberg supports file pruning leveraging min-max stats at column level |- file-000.parquet (event_type: A-B) |- file-001.parquet (event_type: C-C) |- file-002.parquet (event_type: D-F) … event_type = ‘C’
  • 18. Wide value range would make pruning ineffective Wide value range |- file-000.parquet (event_type: A-Z) |- file-001.parquet (event_type: A-Z) |- file-002.parquet (event_type: A-Z) … event_type = ‘C’
  • 19. Making event_type a partition column can lead to explosion of number of partitions • Before: 8.8K partitions (365 days x 24 hours) [1] • After: 4.4M partitions (365 days x 24 hours x 500 event_types) [2] • Can stress metadata system and lead to small fi les [1] Assuming 12 months retention [2] Assuming 500 event types
  • 20. Batch engines solve the clustering problem via shuffle 2. Shuffle to cluster data Stage Stage … 1. Compute data sketch Event Type Weight A 2% B 7% C 22% … Z 0.5% … A B A C C C Z Y X A B A C C C Z Y X 3. Sort data before writing to files A A B C C C X Y Z A-B min-max C-C X-Z Tight value range
  • 21. Shu ffl e for better data clustering
  • 22. Why not compact small files or sort files via background batch maintenance jobs • Remediation is usually more expensive than prevention • Doesn’t solve the throttling problem in the streaming path
  • 24. Introduce a smart shuffling operator in Flink Iceberg sink Iceberg writer-1 writer-2 writer-n … committer shuf fl e-1 shuf fl e-2 shuf fl e-n Smart shuffling
  • 25. Step 1: calculate traffic distribution writer-1 writer-2 writer-n … shuf fl e-1 shuf fl e-2 shuf fl e-10 Hour Weight 0 33% 1 14% 2 5% … … 240 0.001%
  • 26. Step 2a: shuffle data based on traffic distribution Hour Assigned tasks 0 1, 2, 3, 4 1 4, 5 2 6 … … 238 10 239 10 240 10 writer-1 writer-2 writer-n … Hour Weight 0 33% 1 14% 2 5% … … 240 0.001% shuf fl e-1 shuf fl e-2 shuf fl e-n
  • 27. Step 2b: range shuffle data for non-partition column Event type Weight A 2% B 7% C 28% … … Z 0.5% Event type Assigned task A-B 1 C-C 2, 3, 4 … … P-Z 10 writer-1 writer-2 writer-n … shuf fl e-1 shuf fl e-2 shuf fl e-10
  • 28. Range shuffling improves data clustering A B A C C C Z Y X Z X A A C Y C C B Unsorted data files writer-1 writer-2 writer-n … shuf fl e-1 shuf fl e-2 shuf fl e-n Tight value range
  • 29. Sorting within a file brings additional benefits of row group and page level skipping Parquet fi le X X X X X Y Y Z Z Z Z Z Row group 1 Row group 2 Row group 3 SELECT * FROM db.tbl WHERE ts >= … AND ts < … AND event_type = 'Y'
  • 30. What if sorting is needed • Sorting in streaming is possible but expensive • Use batch sorting jobs
  • 31. How to calculate tra ffi c distribution
  • 32. FLIP-27 source interface introduced operator coordinator component JobManager TaskManager-1 TaskManager-n … Source Reader-1 Source Reader-k … Source Coordinator
  • 33. writer-2 writer-n … shuf fl e-1 shuf fl e-2 shuf fl e-n Smart shuffling Hour Count 0 33 1 14 2 5 … … 240 0 Hour Count 0 33 1 14 2 5 … … 240 0 Hour Count 0 33 1 14 2 5 … … 240 1 Shuffle tasks calculate local stats and send them to coordinator writer-1 JobManager shu ffl e coordinator
  • 34. writer-1 writer-2 writer-n … shuf fl e-1 shuf fl e-2 shuf fl e-n Smart shuffling Hour Count 0 33 1 14 2 5 … … 240 0 Hour Count 0 33 1 14 2 5 … … 240 0 Hour Count 0 33 1 14 2 5 … … 240 1 Shuffle coordinator does global aggregation Hour Weight 0 33% 1 14% 2 5% … … 240 0.001% Global aggregation addresses the potential problem of different local views shu ffl e coordinator JobManager
  • 35. writer-1 writer-2 writer-n … shuf fl e-1 shuf fl e-2 shuf fl e-n Smart shuffling Shuffle coordinator broadcasts the globally aggregated stats to tasks Hour Weight 0 33% 1 14% 2 5% … … 240 0.001% Shu ffl e Coordinator Hour Weight 0 33% 1 14% 2 5% … … 240 0.001% Hour Weight 0 33% 1 14% 2 5% … … 240 0.001% Hour Weight 0 33% 1 14% 2 5% … … 240 0.001% JobManager All shuf fl e tasks make the same decision based on the same stats
  • 37. Add a custom partitioner after the shuffle operator dataStream .transform("shuffleOperator", shuffleOperatorOutputType, operatorFactory) .partitionCustom(binPackingPartitioner, keySelector) public class BinPackingPartitioner<K> implements Partitioner<K> { @Override int partition(K key, int numPartitions); }
  • 38. There are two shuffling strategies • Bin packing • Range distribution
  • 39. Bin packing can combine multiple small keys to a single task or split a single large key to multiple tasks Task Assigned keys T0 K0, K2, K4, K6, K8 T1 K7 T2 K3 T3 K3 T4 K3 T5 K3 … … T9 K1,K5 • Only focus on balanced weight distribution • Ignore ordering when assigning keys • Work well with shu ffl ing by partition columns
  • 40. Range shuffling split sort values into ranges and assign them to tasks • Balance weight distribution with continuous ranges • Work well with shu ffl ing by non-partition columns Value Assigned task A T1 B C … D T2 T3 T4
  • 41. Optimizing balanced distribution in byte rate can lead to file count skew where a task handles many long-tail hours hours 0 1 2 N https://en.wikipedia.org/wiki/Long_tail Many long-tail hours can be assigned to a single task, which can become bottleneck
  • 42. There are two solutions • Parallelize fi le fl ushing and upload • Limit the fi le count skew via close- fi le-cost (like open- fi le- cost)
  • 43. Tune close-file-cost to balance btw file count skew and byte rate skew Skewness Close- fi le-cost Byte rate skew File count skew
  • 45. A: Simple Iceberg ingestion job without shuffling source-1 source-2 source-n writer-1 writer-2 writer-n committer Chained … • Job parallelism is 60 • Checkpoint interval is 10 min
  • 46. B: Iceberg ingestion with smart shuffling source-1 source-2 source-n writer-1 writer-2 writer-n committer • Job parallelism is 60 • Checkpoint interval is 10 min shuf fl e-1 shuf fl e-2 Shuf fl e-n Chained Shuffle
  • 47. Test setup • Sink Iceberg table is partitioned hourly by event time • Benchmark tra ffi c volume is 250 MB/sec • Event time range is 192 hours
  • 48. What are we comparing • Number of fi les written in one cycle • File size distribution • Checkpoint duration • CPU utilization • Shu ffl ing skew
  • 49. • Job parallelism is 60 • Event time range is 192 hours Shu ffl e reduced the number of fi les by 20x Without shu ffl ing one cycle fl ushed 10K fi les With shu ffl ing one cycle fl ushed 500 fi les ~2.5x of minimal number of fi les
  • 50. Shuffling greatly improved file size distribution Percentile Without shuffling With shuffling Improvement P50 55 KB 913 KB 17x P75 77 KB 7 MB 90x P95 13 MB 301 MB 23x P99 18 MB 306 MB 17x
  • 51. Shuffling tamed the small files problem
  • 52. During checkpoint, writer tasks flush and upload data files writer-1 writer-2 writer-n … committer DFS Data Files
  • 53. Reduced checkpoint duration by 8x Without shu ffl ing, checkpoint takes 64s on average With shu ffl ing, checkpoint takes 8s on average Seconds 10 20 30 40 50 60 70
  • 54. Record handover btw chained operators are simple method call source-1 source-2 source-n writer-1 writer-2 writer-n committer Chained 1. Kafka Source 2. Iceberg Sink …
  • 55. Shuffling involves significant CPU overhead on serdes and network I/O 2. Shuffle 1. Kafka Source 3. Iceberg Sink source-1 source-2 source-n writer-1 writer-2 writer-n committer shuf fl e-1 shuf fl e-2 Shuf fl e-n Shuffle Chained
  • 56. Shuffling increased CPU usage by 62% All about tradeo ff ! With shu ffl ing avg CPU util is 57% Without shu ffl ing avg CPU util is 35%
  • 57. Without shuffling, checkpoint pause is longer and catch-up spike is bigger With shu ffl ing Without shu ffl ing Catch-up spike Trough caused by pause
  • 58. Bin packing shuffling won’t be perfect in weight distribution source-1 source-2 source-n writer-1 writer-2 writer-n committer shuf fl e-1 shuf fl e-2 Shuf fl e-n Shuffle Chained processes data for partitions a, b, c processes data for partitions y, z
  • 59. Min of writer record rate Max of writer record rate Skewness (max-min)/min No shuffling 4.36 K 4.44 K 1.8% Bin packing (greedy algo) 4.02 K 6.39 MB 59% Our greedy algo implementation of bin packing introduces higher skew than we hoped for
  • 60. Future work • Implement other algorithm • Better bin packing with less skew • Range partitioner • Support sketch statistics for high-cardinality keys • Contribute it to OSS
  • 61. References • Design doc: https://docs.google.com/document/d/13N8cMqPi- ZPSKbkXGOBMPOzbv2Fua59j8bIjjtxLWqo/
  • 62. Q&A
  • 63.
  • 64. Weight table should be relatively stable
  • 65. What about new hour as time moves forward? Absolute hour Weight 2022-08-03-00 0.4 … … 2022-08-03-12 22 2022-08-03-13 27 2022-08-03-14 38 2022-08-03-15 ??
  • 66. Weight table based on relative hour would be stable Relative hour Weight 0 38 1 27 2 22 … … 14 0.4 … …
  • 67. What about cold start problem? • First-time run • Restart with empty state • New subtasks from scale-up
  • 68. Cope with with cold start problems • No shu ffl e while learning • Bu ff er records until learned the fi rst stats • New subtasks (scale-up) request stats from the coordinator