Small file sizes can degrade performance in Spark and Hive queries. This is because each small file requires overhead to open, read, and process. The problem is common with event streaming data and IoT sensors that produce many small files. To detect the issue, check for data skew across partitions and Spark job writers processing many small files. Mitigation techniques include file hierarchy designs, repartitioning, Delta Lake optimizations, and Databricks Auto Optimize to merge small files.
5. Client-Request-ID=------ Retry policy did not allow for a retry: , HTTP status
code=Unknown, Exception=HTTPSConnectionPool(host='-----.net', port=443):
Max retries exceeded with url: /xxxxxxx?restype=container&comp=list
(Caused by
NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at
xxxxxxxx>: Failed to establish a new connection: [Errno 8] nodename nor
servname provided, or not known',)).
HTTPSConnectionPool(ho
st='your_account.blob.cor
e.windows.net', port=443):
Read timed out. (read
timeout=[your timeout])
Exceptions in Apache Spark Executers logs
@adipolak
8. About Me
M.Sc & B.Sc - BGU University
ML Researcher @ DT &BGU Cyber
Security lab
Sr. Big Data Engineer @ Akamai
Sr. Software Developer &
Cloud Advocate @ Microsoft
@adipolak
https://www.linkedin.com/in/adi
-polak-68548365/
9. Agenda
§ The Problem
§ Why it Happens
§ Detect and Mitigate
§ Delta Lake vs Parquet Demo
@adipolak
16. Where can it happen?
• Event streams
• IoT devices, servers, or applications are being translated into KB-scale JSON files during the ingestion
procedure
• Over Paralleled Apache Spark jobs Sub-bullet
• Over Partitioned Hive tables
@adipolak
17. What to check?
• Data skew - Hive partitions file sizes
• Spark job writers in the Spark History Server UI
• Ingestion file size
@adipolak
18. Mitigate
• Use file hierarchy - source/api_type/yyyy/mm/dd/hh/mm
• design partitions w/ usage in mind
• Re-partition vs Coalesce
• Databricks Auto Optimize
• SET TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true,
delta.autoOptimize.autoCompact = true)
• Delta Lake Optimize performance
• Compaction (bin-packing)
• ZORDER BY
• delta.targetFileSize
• delta.tuneFileSizesForRewrites
@adipolak