Flink Forward San Francisco 2022.
In modern data platform architectures, stream processing engines such as Apache Flink are used to ingest continuous streams of data into data lakes such as Apache Iceberg. Streaming ingestion to iceberg tables can suffer by two problems (1) small files problem that can hurt read performance (2) poor data clustering that can make file pruning less effective. To address those two problems, we propose adding a shuffling stage to the Flink Iceberg streaming writer. The shuffling stage can intelligently group data via bin packing or range partition. This can reduce the number of concurrent files that every task writes. It can also improve data clustering. In this talk, we will explain the motivations in details and dive into the design of the shuffling stage. We will also share the evaluation results that demonstrate the effectiveness of smart shuffling.
by
Gang Ye & Steven Wu
Tame the small files problem and optimize data layout for streaming ingestion to Iceberg
1. Tame the Small Files Problem and Optimize
Data Layout for Streaming Ingestion to Iceberg
Steven Wu, Gang Ye, Haizhou Zhao | Apple
THIS IS NOT A CONTRIBUTION
2. Apache Iceberg is an open table format for huge analytic data
• Time travel
• Advanced
fi
ltering
• Serializable isolation
3. Where does Iceberg fit in the ecosystem
Table Format
(Metadata)
Compute
Engine
Storage
(Data) Cloud Blob
Storage
4. Ingest data to Iceberg data lake in streaming fashion
Flink Streaming
Ingestion
Iceberg
Data Lake
Kafka
Msg Queue
5. Zoom into the Flink Iceberg sink
Iceberg
Data Lake
writer-1
writer-2
writer-n
…
Records
DFS
Data Files
committer
File Metadata
6. Case 1: event-time partitioned tables
hour=2022-08-03-00/
hour=2022-08-03-01/
…
7. Long tail problem with late arrival data
https://en.wikipedia.org/wiki/Long_tail
Hour
Percentage
of data
0 1 2 N
8. A data file can’t contain rows across partitions
hour=2022-08-03-00/
|- file-000.parquet
|- file-001.parquet
|- …
hour=2022-08-03-01/
|- …
…
9. How many data files are generated every hour?
writer-1
writer-2
writer-500
…
committer
720K
fi
les every hour (with 10 minute checkpoint interval)
Records for
24x10 partitions
Open 240
fi
les
Commit 120K
fi
les (240x500)
every checkpoint
Assuming table is partitioned
hourly and event time range
is capped at 10 days
10. Long-tail hours lead to small files
Percentile File Size
P50 55 KB
P75 77 KB
P90 13 MB
P99 18 MB
11. What are the implications of too many small files
• Poor read performance
• Request throttling
• Memory pressure
• Longer checkpoint duration and pipeline pause
• Stress the metadata system
13. There are two problems
• Tra
ffi
c are not evenly distributed across event hours
• keyBy for low cardinality column won’t be balanced [1]
[1] https://github.com/apache/iceberg/pull/4228
15. Case 2: data clustering for non-partition columns
CREATE TABLE db.tbl (
ts timestamp,
data string,
event_type string)
USING iceberg
PARTITIONED BY (hours(ts))
16. Queries often filter on event_type
SELECT count(1) FROM db.tbl WHERE
ts >= '2022-01-01 08:00:00’ AND
ts < '2022-01-01 09:00:00' AND
event_type = ‘C’
18. Wide value range would make pruning ineffective
Wide value range
|- file-000.parquet (event_type: A-Z)
|- file-001.parquet (event_type: A-Z)
|- file-002.parquet (event_type: A-Z)
…
event_type = ‘C’
19. Making event_type a partition column can lead to
explosion of number of partitions
• Before: 8.8K partitions (365 days x 24 hours) [1]
• After: 4.4M partitions (365 days x 24 hours x 500 event_types) [2]
• Can stress metadata system and lead to small
fi
les
[1] Assuming 12 months retention
[2] Assuming 500 event types
20. Batch engines solve the clustering problem via shuffle
2. Shuffle to
cluster data
Stage Stage
…
1. Compute
data sketch
Event
Type
Weight
A 2%
B 7%
C 22%
…
Z 0.5%
…
A B A
C C C
Z Y X
A
B
A C
C
C
Z
Y
X
3. Sort data
before writing to
files
A A B
C C C
X Y Z
A-B
min-max
C-C
X-Z
Tight value
range
22. Why not compact small files or sort files via
background batch maintenance jobs
• Remediation is usually more expensive than prevention
• Doesn’t solve the throttling problem in the streaming path
27. Step 2b: range shuffle data for non-partition column
Event
type
Weight
A 2%
B 7%
C 28%
… …
Z 0.5%
Event
type
Assigned
task
A-B 1
C-C 2, 3, 4
… …
P-Z 10
writer-1
writer-2
writer-n
…
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-10
28. Range shuffling improves data clustering
A B A
C C C
Z Y X
Z X A
A C Y
C C B
Unsorted
data files
writer-1
writer-2
writer-n
…
shuf
fl
e-1
shuf
fl
e-2
shuf
fl
e-n
Tight value
range
29. Sorting within a file brings additional benefits of row
group and page level skipping
Parquet
fi
le
X
X
X
X
X
Y
Y
Z
Z
Z
Z
Z
Row
group 1
Row
group 2
Row
group 3
SELECT * FROM db.tbl WHERE
ts >= … AND ts < … AND
event_type = 'Y'
30. What if sorting is needed
• Sorting in streaming is possible but expensive
• Use batch sorting jobs
37. Add a custom partitioner after the shuffle operator
dataStream
.transform("shuffleOperator", shuffleOperatorOutputType, operatorFactory)
.partitionCustom(binPackingPartitioner, keySelector)
public class BinPackingPartitioner<K> implements Partitioner<K> {
@Override
int partition(K key, int numPartitions);
}
38. There are two shuffling strategies
• Bin packing
• Range distribution
39. Bin packing can combine multiple small keys to a
single task or split a single large key to multiple tasks
Task Assigned keys
T0 K0, K2, K4, K6, K8
T1 K7
T2 K3
T3 K3
T4 K3
T5 K3
… …
T9 K1,K5
• Only focus on balanced
weight distribution
• Ignore ordering when
assigning keys
• Work well with shu
ffl
ing by
partition columns
40. Range shuffling split sort values into ranges and
assign them to tasks
• Balance weight distribution
with continuous ranges
• Work well with shu
ffl
ing by
non-partition columns
Value Assigned task
A
T1
B
C
…
D
T2
T3
T4
41. Optimizing balanced distribution in byte rate can lead to
file count skew where a task handles many long-tail hours
hours
0 1 2 N
https://en.wikipedia.org/wiki/Long_tail
Many long-tail hours can
be assigned to a single
task, which can become
bottleneck
42. There are two solutions
• Parallelize
fi
le
fl
ushing and upload
• Limit the
fi
le count skew via close-
fi
le-cost (like open-
fi
le-
cost)
43. Tune close-file-cost to balance btw file count skew
and byte rate skew
Skewness
Close-
fi
le-cost
Byte rate skew
File count skew
45. A: Simple Iceberg ingestion job without shuffling
source-1
source-2
source-n
writer-1
writer-2
writer-n
committer
Chained
…
• Job parallelism is 60
• Checkpoint interval is 10 min
46. B: Iceberg ingestion with smart shuffling
source-1
source-2
source-n
writer-1
writer-2
writer-n
committer
• Job parallelism is 60
• Checkpoint interval is 10 min
shuf
fl
e-1
shuf
fl
e-2
Shuf
fl
e-n
Chained Shuffle
47. Test setup
• Sink Iceberg table is partitioned hourly by event time
• Benchmark tra
ffi
c volume is 250 MB/sec
• Event time range is 192 hours
48. What are we comparing
• Number of
fi
les written in one cycle
• File size distribution
• Checkpoint duration
• CPU utilization
• Shu
ffl
ing skew
49. • Job parallelism is 60
• Event time range is 192 hours
Shu
ffl
e reduced the number of
fi
les by 20x
Without shu
ffl
ing one cycle
fl
ushed 10K
fi
les
With shu
ffl
ing one cycle
fl
ushed 500
fi
les
~2.5x of minimal
number of
fi
les
52. During checkpoint, writer tasks flush and upload data files
writer-1
writer-2
writer-n
…
committer
DFS
Data Files
53. Reduced checkpoint duration by 8x
Without shu
ffl
ing, checkpoint takes 64s on average
With shu
ffl
ing, checkpoint takes 8s on average
Seconds
10
20
30
40
50
60
70
56. Shuffling increased CPU usage by 62%
All about tradeo
ff
!
With shu
ffl
ing avg CPU util is 57%
Without shu
ffl
ing avg CPU util is 35%
57. Without shuffling, checkpoint pause is longer and
catch-up spike is bigger
With shu
ffl
ing
Without shu
ffl
ing
Catch-up spike
Trough caused
by pause
58. Bin packing shuffling won’t be perfect in weight distribution
source-1
source-2
source-n
writer-1
writer-2
writer-n
committer
shuf
fl
e-1
shuf
fl
e-2
Shuf
fl
e-n
Shuffle
Chained
processes data for
partitions a, b, c
processes data for
partitions y, z
59. Min of writer
record rate
Max of writer
record rate
Skewness
(max-min)/min
No shuffling 4.36 K 4.44 K 1.8%
Bin packing
(greedy algo)
4.02 K 6.39 MB 59%
Our greedy algo implementation of bin packing
introduces higher skew than we hoped for
60. Future work
• Implement other algorithm
• Better bin packing with less skew
• Range partitioner
• Support sketch statistics for high-cardinality keys
• Contribute it to OSS
65. What about new hour as time moves forward?
Absolute hour Weight
2022-08-03-00 0.4
… …
2022-08-03-12 22
2022-08-03-13 27
2022-08-03-14 38
2022-08-03-15 ??
66. Weight table based on relative hour would be stable
Relative hour Weight
0 38
1 27
2 22
… …
14 0.4
… …
67. What about cold start problem?
• First-time run
• Restart with empty state
• New subtasks from scale-up
68. Cope with with cold start problems
• No shu
ffl
e while learning
• Bu
ff
er records until learned the
fi
rst stats
• New subtasks (scale-up) request stats from the coordinator