This document discusses challenges with writing streaming data from Kafka to Parquet files stored in HDFS. It evaluates several approaches: 1) windowing, which works but uses too much memory; 2) bucketing data into time-based files, which works for failures but requires modifying Flink; 3) closing files at checkpoints, which was later supported in Flink; and 4) hourly batch jobs, which avoids streaming complexity but limits real-time use. The conclusion is that streaming solutions are not trivial and it may be better to use a database or different tool instead of files for this use case. Supporting both real-time and batch processing is challenging.
18. Requirements
⢠Scalable solution
⢠10,000 message/sec
⢠Exactly-once
⢠Data consumers should not deduplicate
⢠Files in event-time
⢠Consumers should not worry about late events
!"late events
19. Requirements
⢠Scalable solution
⢠10,000 message/sec
⢠Exactly-once
⢠Data consumers should not deduplicate
⢠Files in event-time
⢠Consumers should not worry about late events
⢠Columnar format (Parquet)
⢠Optimize for reading, not for writing
!"
slow loading
28. Columnar format (Parquet)
CUSTOMER ID VISITED
PRODUCT
PRODUCT
CATEGORY
45182584 370000004333 Books
45182584 300000053536 Games
11538222 857334358658 Electronics
79245368 370000004333 Books
11538222 370000004333 Books
11538222 942000033234 Electronics
78438133 370000004333 Books
Column-oriented
!"â¤
fast loading
29. Requirements
⢠Scalable solution
⢠10 000 message/sec
⢠Exactly-once
⢠Data consumers should not deduplicate
⢠Files in event-time
⢠Consumers should not worry about late events
⢠Columnar format (Parquet)
⢠Optimize for reading, not for writing
30. Requirements
⢠Scalable solution
⢠10 000 message/sec
⢠Exactly-once
⢠Data consumers should not deduplicate
⢠Files in event-time
⢠Consumers should not worry about late events
⢠Columnar format (Parquet)
⢠Optimize for reading, not for writing
Apache Flink?
31. Requirements
⢠Scalable solution â
⢠10 000 message/sec
⢠Exactly-once â
⢠Data consumers should not deduplicate
⢠Files in event-time â
⢠Consumers should not worry about late events
⢠Columnar format (Parquet) â
⢠Optimize for reading, not for writing
Apache Flink?
47. Bucketing sink
⢠Writes data to file âbucketsâ based on time
Kafka Flink HDFS
â 17-00.parquet
â 18-00.parquet
â 19-00.parquet
18:00-18:59
19:00-19:59
61. ButâŚ
⢠We need to change Flink bucketing sink code
⢠Was also fixed in 1.6.0: StreamingFileSink can close files on checkpoints
⢠Kudos to Flink community!
62. ButâŚ
⢠We need to change Flink bucketing sink code
⢠Was also fixed in 1.6.0: StreamingFileSink can close files on checkpoints
⢠Kudos to Flink community!
⢠A lot of files
⢠Small files on HDFS is bad
76. Proper solution?
⢠Use a database instead of files, or
⢠Use a different tool (e.g. Kafka Streams), or
⢠Write small files and merge them in the end, or
⢠Skip late events
⢠e.g. accept 5 minutes late, but not 12 hours
77. Support real-time?
⢠Kappa-architecture
⢠Streaming-only
⢠Lambda-architecture
⢠Batch system + streaming system
⢠Late events in daily batches + 5-minute files dropping late events