SlideShare uma empresa Scribd logo
1 de 19
Baixar para ler offline
Sophia Sun (sophia.sun@intel.com)
Qi Xie (qi.xie@intel.com)
Hao Cheng (hao.cheng@intel.com)
Best Practice of Compression
Codecs in Spark
2
Legal Disclaimer
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular
purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.
This document contains information on products, services and/or processes in development. All information provided here is subject to change
without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.
The products and services described may contain defects or errors known as errata which may cause deviations from published specifications.
Current characterized errata are available on request.
Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by
visiting www.intel.com/design/literature.htm.
Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site
and confirm whether referenced data are accurate.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others
Copyright ©2018 Intel Corporation.
3
For Performance Claims and Optimization
Notice
Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests,
such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change
to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating
your contemplated purchases, including the performance of that product when combined with other products.
Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique
to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the
availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations
in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel
microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets
covered by this notice. Notice Revision #20110804.
Performance estimates were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred
to as "Spectre" and "Meltdown." Implementation of these updates may make these results inapplicable to your device or system.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others
For more information go to http://www.intel.com/performance.
About me
• Big data software engineer from Intel.
• Focus on Spark performance profiling and optimization
for Intel Architecture.
4
Outlines
• Compression Needs & Motivations
• Data Compression Pipelines in Spark
• Experiment Compression Codecs Intros
• Intel® Codec Accelerator Architecture Overview
• Takeaways
• Future Works
5
Compression Needs
• Compression Needs
• Reduce data volume and save storage space.
• Speed up the disk I/O operations and data transfer across network,
optimize workload performance.
• Trade-off
• Computation overhead for high compression ratio codecs.
6
Motivations
• Understanding popular compression codecs in Spark.
• Take advantage of Intel® optimized libraries or
accelerate hardware for data
compression/decompression.
7
Data Compression Pipeline in Spark
8
Map
Map
Input
A HDFS file
Map
reduce
Output
A HDFS file
reduce
reduce
Intermediate Data
Each Map’s output
Shuffle (Multiple iterations)
Partition 0
Partition 1
Partition 0
Partition 2
Partition 1
Partition 0
Partition 2
Partition 0
Partition 2
Partition 1
Output 0
Output 1
Output 2
Input
split0
Input
split1
Input
split2
Input Decompression
Shuffle Compression
Output Compression
Shuffle Decompression
Data Compression Pipeline in Spark - I/O
Characteristics
• HDFS Storage
• Generally sequence read/write
• Generally one time read/write
9
Shuffle OperationsHDFS Storage
Input Read(Data Decompression) Shuffle Write(Data Compression)
Output Write(Data Compression) Shuffle Read(Data Decompression)
• Shuffle Operations
• Random read/write
• Multiple times read/write
Experiment Compression Codecs Intros
10
Codecs Supported
levels
Default
level
Degree of
Compression
Compression
speed
CPU Usage Comments
ISA-L(igzip) (0~1) 1 Medium Medium Medium~High Based on Intel® ISA-L
ver 2.0.19 optimization
Zlib-ipp (1~9) Best
balance(near
to 6)
High Slow High Based on Intel® IPP
library optimization
Zlib/gzip (1~9) Best
balance(near
to 6)
High Slow High Open source codec
zstd 1~22 3 High Medium Medium~High Open source codec
Lz4-ipp N/A N/A Medium Fast Low Based on Intel® IPP
library optimization
Lz4 Lz4 fast
Lz4 hc
Lz4 fast Low
Medium
Fast
Low
Low
Medium
Open source codec
snappy N/A N/A Low Fast Low Open source codec
High compression ratio codecs
High throughput codecs
Intel® ISA-L reference: https://software.intel.com/en-us/storage/ISA-L ; Intel® IPP reference: https://software.intel.com/en-us/intel-ipp
Compression Level
11
• zstd, gzip, zlib-ipp and igzip support compression level adjustment, while codec lz4 and
snappy does not support.
• No big data size difference among different compression level in TPC-DS parquet format data
generation test.
Compression
codec
Level9
Data Size
Level1
Data Size
*Default level
Data Size
Default
Vs Level9
Level1
Vs
Level9
gzip/zlib 2,500,252,836,007 2,528,269,315,543 2,502,656,222,082 0.096% 1.12%
zlib-ipp 2,482,050,449,516 2,492,687,484,854 2,482,595,509,721 0.022% 0.429%
Compression
codec
*Default level
Data Size
Level6
Data Size
Level9
Data Size
Default
Vs
Level6
Default
Vs
Level 9
zstd 2,472,315,429,619 2,446,857,474,146 2,440,389,051,782 1.04% 1.31%
0 2,000,000,000,000 4,000,000,000,000
gzip
zlib-ipp
zstd
TPC-DS Different Codec Compression
Level Data Size(Raw data: 10TB)
Default* Level 1 Level 6 Level 9
Performance estimates were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results
inapplicable to your device or system. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using
specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your
contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks. Tests performed by Intel® company. Configurations: see slides 16
Compression in Parquet Format
12
Col 1 Col 2 … Col N
… … … …
Col 1 Col 2 … Col N
Column Chunk
Row Group
Parquet File
…
• Columnar Storage (For Column Pruning)
• Compression / Decompression for each
Column Chunk
• Column Chunk has same data type even same
values (Default Compression Level is usually
effective)
Intel® Codec Accelerator Architecture(1/2)
Notes:
• QAT and ISA-L AVX512 is available on Intel® Skylake-X platform.
• Open Source codec zstd also can build with Intel® ISA-L AVX512 support to accelerate data compression/decompression.
Intel® Codec Accelerator Architecture(2/2)
14
Takeaways
15
• Better to choose high compression codecs for source data* for IO
intensive workload, such as zstd, zlib-ipp, zlib, igzip.
• Better to use high throughput codecs for spark shuffle compression
codec, such as lz4-ipp, lz4.
• Higher compression codec reduce I/O and network pressure, but
consumes CPU resource, use accelerate hardware such as QAT and
FPGA can help to offload CPU resources.
• Zstd can qualify as both a reasonably strong compressor and a fast
one.
• Best balance of compression codec depends on cluster characteristics
and workloads.
Future Plan
• Open source Intel® Codec Accelerator project and make it as well
supported library.
• Add codec compatibility support.
• Integrate with more IA optimized codecs along with the acceleration
library releases under different platform.
• Introduce more big data frameworks (Cassandra / HBase etc.)
• Besides compression / decompression, we will support more types
of codec like the encryption / decryption etc.
• Keep release new version along with new Intel® Platform release or
new acceleration libraries released.
Thanks!
HiBench Sort Workload bottleneck – No
data compression
18
• No compression data has big data size, mapping data make the IO disk as bottleneck in stage0
• No compression data cause big pressure in shuffle stage(Stage1). 10Gb(~1.2GB) network as
bottleneck in experiment environment. While CPU still has much idle resource.
0
500000
1000000
1500000
2000000
0
128
263
388
498
608
718
828
938
1048
1158
1268
1378
1488
1598
1708
1818
Network IO
Sum of rxkB/s
Sum of txkB/s
0
20
40
60
80
100
120
0
101
200
299
398
498
601
712
832
936
1035
1134
1233
1332
1431
1530
1629
1728
1827
Cpu Utilization
Average of %idle
Average of %steal
Average of %iowait
Average of %nice
Average of %system
Average of %user
stage0
stage1
Performance estimates were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results
inapplicable to your device or system. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using
specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your
contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks. Tests performed by Intel® company. Configurations: see slides 16
HiBench Sort Workload Resource
Utilization Examples
19
0
50
100
150
0
86
172
258
344
430
516
603
690
778
867
962
Cpu Utilization – zlibipp
Average of %idle
Average of %steal
Average of %iowait
Average of %nice
Average of %system
Average of %user
• CPU as bottleneck on High compression ratio codecs (like
zstd, zlibipp and igzip)
• Codec lz4, lz4ipp and snappy has lower compression ratio,
large size of data read/write caused the disk as the
bottleneck in stage0 and large shuffle data caused network
as bottleneck in stage10
50
100
150
0
94
181
267
367
464
555
643
728
815
905
992
1077
Cpu Utilization – lz4ipp
Average of %idle
Average of %steal
Average of %iowait
Average of %nice
Average of %system
Average of %user
0
1000000
2000000
3000000
0
108
225
327
418
509
600
693
785
878
971
1062
Network IO – lz4ipp
Sum of rxkB/s
Sum of txkB/s
Low Compression ratio codec example
High compression ratio codec example
Performance estimates were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results
inapplicable to your device or system. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using
specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your
contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks. Tests performed by Intel® company. Configurations: see slides 16

Mais conteúdo relacionado

Mais procurados

Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Databricks
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 

Mais procurados (20)

Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Vectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
Vectorized UDF: Scalable Analysis with Python and PySpark with Li JinVectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
Vectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 

Semelhante a Best Practice of Compression/Decompression Codes in Apache Spark with Sophia Sun and Qi Xie

Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent MemoryAccelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Databricks
 
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI ConvergenceDAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
inside-BigData.com
 
Crooke CWF Keynote FINAL final platinum
Crooke CWF Keynote FINAL final platinumCrooke CWF Keynote FINAL final platinum
Crooke CWF Keynote FINAL final platinum
Alan Frost
 

Semelhante a Best Practice of Compression/Decompression Codes in Apache Spark with Sophia Sun and Qi Xie (20)

Optimizing Apache Spark Throughput Using Intel Optane and Intel Memory Drive...
 Optimizing Apache Spark Throughput Using Intel Optane and Intel Memory Drive... Optimizing Apache Spark Throughput Using Intel Optane and Intel Memory Drive...
Optimizing Apache Spark Throughput Using Intel Optane and Intel Memory Drive...
 
QATCodec: past, present and future
QATCodec: past, present and futureQATCodec: past, present and future
QATCodec: past, present and future
 
Performance out of the box developers
Performance   out of the box developersPerformance   out of the box developers
Performance out of the box developers
 
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent MemoryAccelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
 
DUG'20: 11 - Platform Performance Evolution from bring-up to reaching link sa...
DUG'20: 11 - Platform Performance Evolution from bring-up to reaching link sa...DUG'20: 11 - Platform Performance Evolution from bring-up to reaching link sa...
DUG'20: 11 - Platform Performance Evolution from bring-up to reaching link sa...
 
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
 
Accelerate Ceph performance via SPDK related techniques
Accelerate Ceph performance via SPDK related techniques Accelerate Ceph performance via SPDK related techniques
Accelerate Ceph performance via SPDK related techniques
 
Spring Hill (NNP-I 1000): Intel's Data Center Inference Chip
Spring Hill (NNP-I 1000): Intel's Data Center Inference ChipSpring Hill (NNP-I 1000): Intel's Data Center Inference Chip
Spring Hill (NNP-I 1000): Intel's Data Center Inference Chip
 
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI ConvergenceDAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
 
DUG'20: 03 - Online compression with QAT in DAOS
DUG'20: 03 - Online compression with QAT in DAOSDUG'20: 03 - Online compression with QAT in DAOS
DUG'20: 03 - Online compression with QAT in DAOS
 
E5 Intel Xeon Processor E5 Family Making the Business Case
E5 Intel Xeon Processor E5 Family Making the Business Case E5 Intel Xeon Processor E5 Family Making the Business Case
E5 Intel Xeon Processor E5 Family Making the Business Case
 
The Architecture of Intel Processor Graphics: Gen 11
The Architecture of Intel Processor Graphics: Gen 11The Architecture of Intel Processor Graphics: Gen 11
The Architecture of Intel Processor Graphics: Gen 11
 
The Architecture of 11th Generation Intel® Processor Graphics
The Architecture of 11th Generation Intel® Processor GraphicsThe Architecture of 11th Generation Intel® Processor Graphics
The Architecture of 11th Generation Intel® Processor Graphics
 
Deep Learning Training at Scale: Spring Crest Deep Learning Accelerator
Deep Learning Training at Scale: Spring Crest Deep Learning AcceleratorDeep Learning Training at Scale: Spring Crest Deep Learning Accelerator
Deep Learning Training at Scale: Spring Crest Deep Learning Accelerator
 
Crooke CWF Keynote FINAL final platinum
Crooke CWF Keynote FINAL final platinumCrooke CWF Keynote FINAL final platinum
Crooke CWF Keynote FINAL final platinum
 
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...
 
Python* Scalability in Production Environments
Python* Scalability in Production EnvironmentsPython* Scalability in Production Environments
Python* Scalability in Production Environments
 
Intel python 2017
Intel python 2017Intel python 2017
Intel python 2017
 
Intel xeon-scalable-processors-overview
Intel xeon-scalable-processors-overviewIntel xeon-scalable-processors-overview
Intel xeon-scalable-processors-overview
 
Inside story on Intel Data Center @ IDF 2013
Inside story on Intel Data Center @ IDF 2013Inside story on Intel Data Center @ IDF 2013
Inside story on Intel Data Center @ IDF 2013
 

Mais de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Mais de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
gajnagarg
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
gajnagarg
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
gajnagarg
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Último (20)

Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 

Best Practice of Compression/Decompression Codes in Apache Spark with Sophia Sun and Qi Xie

  • 1. Sophia Sun (sophia.sun@intel.com) Qi Xie (qi.xie@intel.com) Hao Cheng (hao.cheng@intel.com) Best Practice of Compression Codecs in Spark
  • 2. 2 Legal Disclaimer No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request. Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm. Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others Copyright ©2018 Intel Corporation.
  • 3. 3 For Performance Claims and Optimization Notice Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804. Performance estimates were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results inapplicable to your device or system. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others For more information go to http://www.intel.com/performance.
  • 4. About me • Big data software engineer from Intel. • Focus on Spark performance profiling and optimization for Intel Architecture. 4
  • 5. Outlines • Compression Needs & Motivations • Data Compression Pipelines in Spark • Experiment Compression Codecs Intros • Intel® Codec Accelerator Architecture Overview • Takeaways • Future Works 5
  • 6. Compression Needs • Compression Needs • Reduce data volume and save storage space. • Speed up the disk I/O operations and data transfer across network, optimize workload performance. • Trade-off • Computation overhead for high compression ratio codecs. 6
  • 7. Motivations • Understanding popular compression codecs in Spark. • Take advantage of Intel® optimized libraries or accelerate hardware for data compression/decompression. 7
  • 8. Data Compression Pipeline in Spark 8 Map Map Input A HDFS file Map reduce Output A HDFS file reduce reduce Intermediate Data Each Map’s output Shuffle (Multiple iterations) Partition 0 Partition 1 Partition 0 Partition 2 Partition 1 Partition 0 Partition 2 Partition 0 Partition 2 Partition 1 Output 0 Output 1 Output 2 Input split0 Input split1 Input split2 Input Decompression Shuffle Compression Output Compression Shuffle Decompression
  • 9. Data Compression Pipeline in Spark - I/O Characteristics • HDFS Storage • Generally sequence read/write • Generally one time read/write 9 Shuffle OperationsHDFS Storage Input Read(Data Decompression) Shuffle Write(Data Compression) Output Write(Data Compression) Shuffle Read(Data Decompression) • Shuffle Operations • Random read/write • Multiple times read/write
  • 10. Experiment Compression Codecs Intros 10 Codecs Supported levels Default level Degree of Compression Compression speed CPU Usage Comments ISA-L(igzip) (0~1) 1 Medium Medium Medium~High Based on Intel® ISA-L ver 2.0.19 optimization Zlib-ipp (1~9) Best balance(near to 6) High Slow High Based on Intel® IPP library optimization Zlib/gzip (1~9) Best balance(near to 6) High Slow High Open source codec zstd 1~22 3 High Medium Medium~High Open source codec Lz4-ipp N/A N/A Medium Fast Low Based on Intel® IPP library optimization Lz4 Lz4 fast Lz4 hc Lz4 fast Low Medium Fast Low Low Medium Open source codec snappy N/A N/A Low Fast Low Open source codec High compression ratio codecs High throughput codecs Intel® ISA-L reference: https://software.intel.com/en-us/storage/ISA-L ; Intel® IPP reference: https://software.intel.com/en-us/intel-ipp
  • 11. Compression Level 11 • zstd, gzip, zlib-ipp and igzip support compression level adjustment, while codec lz4 and snappy does not support. • No big data size difference among different compression level in TPC-DS parquet format data generation test. Compression codec Level9 Data Size Level1 Data Size *Default level Data Size Default Vs Level9 Level1 Vs Level9 gzip/zlib 2,500,252,836,007 2,528,269,315,543 2,502,656,222,082 0.096% 1.12% zlib-ipp 2,482,050,449,516 2,492,687,484,854 2,482,595,509,721 0.022% 0.429% Compression codec *Default level Data Size Level6 Data Size Level9 Data Size Default Vs Level6 Default Vs Level 9 zstd 2,472,315,429,619 2,446,857,474,146 2,440,389,051,782 1.04% 1.31% 0 2,000,000,000,000 4,000,000,000,000 gzip zlib-ipp zstd TPC-DS Different Codec Compression Level Data Size(Raw data: 10TB) Default* Level 1 Level 6 Level 9 Performance estimates were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results inapplicable to your device or system. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks. Tests performed by Intel® company. Configurations: see slides 16
  • 12. Compression in Parquet Format 12 Col 1 Col 2 … Col N … … … … Col 1 Col 2 … Col N Column Chunk Row Group Parquet File … • Columnar Storage (For Column Pruning) • Compression / Decompression for each Column Chunk • Column Chunk has same data type even same values (Default Compression Level is usually effective)
  • 13. Intel® Codec Accelerator Architecture(1/2) Notes: • QAT and ISA-L AVX512 is available on Intel® Skylake-X platform. • Open Source codec zstd also can build with Intel® ISA-L AVX512 support to accelerate data compression/decompression.
  • 14. Intel® Codec Accelerator Architecture(2/2) 14
  • 15. Takeaways 15 • Better to choose high compression codecs for source data* for IO intensive workload, such as zstd, zlib-ipp, zlib, igzip. • Better to use high throughput codecs for spark shuffle compression codec, such as lz4-ipp, lz4. • Higher compression codec reduce I/O and network pressure, but consumes CPU resource, use accelerate hardware such as QAT and FPGA can help to offload CPU resources. • Zstd can qualify as both a reasonably strong compressor and a fast one. • Best balance of compression codec depends on cluster characteristics and workloads.
  • 16. Future Plan • Open source Intel® Codec Accelerator project and make it as well supported library. • Add codec compatibility support. • Integrate with more IA optimized codecs along with the acceleration library releases under different platform. • Introduce more big data frameworks (Cassandra / HBase etc.) • Besides compression / decompression, we will support more types of codec like the encryption / decryption etc. • Keep release new version along with new Intel® Platform release or new acceleration libraries released.
  • 18. HiBench Sort Workload bottleneck – No data compression 18 • No compression data has big data size, mapping data make the IO disk as bottleneck in stage0 • No compression data cause big pressure in shuffle stage(Stage1). 10Gb(~1.2GB) network as bottleneck in experiment environment. While CPU still has much idle resource. 0 500000 1000000 1500000 2000000 0 128 263 388 498 608 718 828 938 1048 1158 1268 1378 1488 1598 1708 1818 Network IO Sum of rxkB/s Sum of txkB/s 0 20 40 60 80 100 120 0 101 200 299 398 498 601 712 832 936 1035 1134 1233 1332 1431 1530 1629 1728 1827 Cpu Utilization Average of %idle Average of %steal Average of %iowait Average of %nice Average of %system Average of %user stage0 stage1 Performance estimates were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results inapplicable to your device or system. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks. Tests performed by Intel® company. Configurations: see slides 16
  • 19. HiBench Sort Workload Resource Utilization Examples 19 0 50 100 150 0 86 172 258 344 430 516 603 690 778 867 962 Cpu Utilization – zlibipp Average of %idle Average of %steal Average of %iowait Average of %nice Average of %system Average of %user • CPU as bottleneck on High compression ratio codecs (like zstd, zlibipp and igzip) • Codec lz4, lz4ipp and snappy has lower compression ratio, large size of data read/write caused the disk as the bottleneck in stage0 and large shuffle data caused network as bottleneck in stage10 50 100 150 0 94 181 267 367 464 555 643 728 815 905 992 1077 Cpu Utilization – lz4ipp Average of %idle Average of %steal Average of %iowait Average of %nice Average of %system Average of %user 0 1000000 2000000 3000000 0 108 225 327 418 509 600 693 785 878 971 1062 Network IO – lz4ipp Sum of rxkB/s Sum of txkB/s Low Compression ratio codec example High compression ratio codec example Performance estimates were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results inapplicable to your device or system. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks. Tests performed by Intel® company. Configurations: see slides 16