SlideShare uma empresa Scribd logo
1 de 43
Fast Spark Access To Your Data -
Avro, JSON, ORC, and Parquet
Owen O’Malley
owen@hortonworks.com
@owen_omalley
June 2018
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Who Am I?
Worked on Hadoop since Jan 2006
MapReduce, Security, Hive, and ORC
Worked on different file formats
–Sequence File, RCFile, ORC File, T-File, and Avro
requirements
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Goal
Benchmark for Spark SQL
Seeking to discover unknowns
–How do the different formats perform?
–What could they do better?
Use real & diverse data sets
–Over-reliance on artifical datasets leads to weakness
Open & reviewed benchmarks
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
The File Formats
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Avro
Cross-language file format for Hadoop
Schema evolution was primary goal
Schema segregated from data
–Unlike Protobuf and Thrift
Row major format
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
JSON
Serialization format for HTTP & Javascript
Text-format with MANY parsers
Schema completely integrated with data
Row major format
Compression applied on top
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
ORC
Originally part of Hive to replace RCFile
–Now top-level project
Schema segregated into footer
Column major format with stripes
Rich type model, stored top-down
Integrated compression, indexes, & stats
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Parquet
Design based on Google’s Dremel paper
Schema segregated into footer
Column major format with stripes
Simpler type-model with logical types
All data pushed to leaves of the tree
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Sets
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
NYC Taxi Data
Every taxi cab ride in NYC from 2009
–Publically available
–http://tinyurl.com/nyc-taxi-analysis
18 columns with no null values
–Doubles, integers, decimals, & strings
2 months of data – 22.7 million rows
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sales
Generated data
–Real schema from a production Hive deployment
–Random data based on the data statistics
55 columns with lots of nulls
–A little structure
–Timestamps, strings, longs, booleans, list, & struct
25 million rows
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Github Logs
All actions on Github public repositories
–Publically available
–https://www.githubarchive.org/
704 columns with a lot of structure & nulls
–Pretty much the kitchen sink
 1/2 month of data – 10.5 million rows
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Finding the Github Schema
The data is all in JSON.
No schema for the data is published.
We wrote a JSON schema discoverer.
–Scans the document and figures out the types
Available in ORC tool jar.
Schema is huge (12k)
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Software
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Software Versions
All of these projects are evolving rapidly
–Spark 2.3.1
–Avro 1.8.2
–ORC 1.5.1
–Parquet 1.8.2
–Spark-Avro 4.0.0
Dependency hell 👿
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Configuration
Spark Configuration
–spark.sql.orc.filterPushdown = true
–spark.sql.orc.impl = native
Hadoop Configuration
–session.sparkContext().hadoopConfiguration()
–avro.mapred.ignore.inputs.without.extension = false
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark-Avro
Benchmark uses Spark SQL’s FileFormat
–JSON, ORC, and Parquet all in Spark
–Avro is provided by Databricks via spark-avro
It doesn’t support all of the Spark types
–Timestamp as int96
–Decimal as bytes
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Storage costs
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Compression
Data size matters!
–Hadoop stores all your data, but requires hardware
–Is one factor in read speed (HDFS ~15mb/sec)
ORC and Parquet use RLE & Dictionaries
All the formats have general compression
–ZLIB (GZip) – tight compression, slower
–Snappy – some compression, faster
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Taxi Size Analysis
Don’t use JSON
Use either Snappy or Zlib compression
Avro’s small compression window hurts
Parquet Zlib is smaller than ORC
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sales Size Analysis
ORC did better than expected
–String columns have small cardinality
–Lots of timestamp columns
–No doubles 
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Github Size Analysis
Surprising win for JSON and Avro
–Worst when uncompressed
–Best with zlib
Many partially shared strings
–ORC and Parquet don’t compress across columns
Need to investigate Zstd with dictionary
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Use Cases
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Full Table Scans
Read all columns & rows
All formats except JSON are splittable
–Different workers do different parts of file
Taxi schema supports ColumnarBatch
–All primitive types
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
1
2
4
8
16
32
64
128
256
512
orc parquet json orc parquet json orc parquet
taxi taxi taxi taxi taxi taxi taxi taxi
none none none zlib zlib zlib snappy snappy
Taxi Times
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Taxi Read Performance Analysis
JSON is very slow to read
–Large storage size for this data set
–Needs to do a LOT of string parsing
Parquet is faster
–ORC is going through an extra layer
–VectorizedRowBatch -> OrcStruct -> ColumnarBatch
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
1
2
4
8
16
32
64
128
256
512
orc parquet json orc parquet orc parquet
sales sales sales sales sales sales sales
none none none zlib zlib snappy snappy
Sales Times
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sales Read Performance Analysis
Read performance is dominated by format
–Compression matters less for this data set
–Straight ordering: ORC, Parquet, & JSON
Uses Row instead of ColumnarBatch
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
1
2
4
8
16
32
64
128
256
512
1024
orc parquet json orc parquet orc parquet
github github github github github github github
none none none zlib zlib snappy snappy
Github Times
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Github Read Performance Analysis
JSON did really well
A lot of columns needs more space
–We need bigger stripes (add min rows in ORC-190)
–Rows/stripe - ORC: 18.6k, Parquet: 88.1k
Parquet struggles
–Twitter recommends against Parquet for this case
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Column Projection
Often just need a few columns
–Only ORC & Parquet are columnar
–Only read, decompress, & deserialize some columns
Spark FileFormat passes in desired schema
–Drop columns that aren’t needed
–JSON and Avro read first and then drop columns
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
0
2
4
6
8
10
12
14
16
18
20
none snappy zlib none snappy zlib none snappy zlib none snappy zlib none snappy zlib none snappy zlib
orc orc orc parquet parquet parquet orc orc orc parquet parquet parquet orc orc orc parquet parquet parquet
github github github github github github sales sales sales sales sales sales taxi taxi taxi taxi taxi taxi
Column Projection % Sizes
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Predicate Pushdown
Query:
–select first_name, last_name from employees where
hire_date between ‘01/01/2017’ and ‘01/03/2017’
Predicate:
–hire_date between ‘01/01/2017’ and ‘01/03/2017’
Given to FileFormat via filters
For benchmark, filter on a sorted column
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Predicate Pushdown
ORC & Parquet indexes with min & max
–Sorted data is critical!
ORC has optional bloom filters
Reader filters out sections of file
–Entire file
–Stripe
–Row group (only ORC, default 10k rows)
Engine needs to apply row level filter
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
1
4
16
64
256
1024
4096
16384
65536
262144
1048576
4194304
16777216
taxi sales github
Predicate Pushdown Rows
orc parquet total
39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Predicate Pushdown
Parquet doesn’t pushdown timestamp filters
–Taxi and Github filters were on timestamps.
Spark defaults ORC predicate pushdown off.
Small ORC stripes for Github lead to sub-10k
row read.
Because predicate pushdown is an optimization,
it isn’t clear when it isn’t used.
40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Metadata Access
ORC & Parquet store metadata
–Stored in file footer
–File schema
–Number of records
–Min, max, count of each column
Provides O(1) Access
41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Conclusions
42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Recommendations
Disclaimer – Everything changes!
–Both these benchmarks and the formats will change.
Evaluate needs
–Column projection and predicate pushdown are only
in ORC & Parquet
–Determine how to sort data
–Are bloom filters useful?
43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank you!
Twitter: @owen_omalley
Email: owen@hortonworks.com

Mais conteúdo relacionado

Mais procurados

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3Dongjoon Hyun
 
Parquet and impala overview external
Parquet and impala overview externalParquet and impala overview external
Parquet and impala overview externalmattlieber
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3DataWorks Summit
 
Apache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysisApache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysisliang chen
 
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerDataWorks Summit
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4Dongjoon Hyun
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Julien Le Dem
 
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Databricks
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and ParquetBig Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and ParquetDataWorks Summit
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVROairisData
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataDataWorks Summit
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open houseJulien Le Dem
 
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceMajor advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceDataWorks Summit/Hadoop Summit
 
Data organization: hive meetup
Data organization: hive meetupData organization: hive meetup
Data organization: hive meetupt3rmin4t0r
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Tech Triveni
 

Mais procurados (20)

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
 
Parquet and impala overview external
Parquet and impala overview externalParquet and impala overview external
Parquet and impala overview external
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
 
Apache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysisApache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysis
 
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, Smaller
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and ParquetBig Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVRO
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open house
 
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceMajor advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL compliance
 
Data organization: hive meetup
Data organization: hive meetupData organization: hive meetup
Data organization: hive meetup
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 

Semelhante a Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet

Performance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache SparkPerformance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache SparkDataWorks Summit
 
You Can't Search Without Data
You Can't Search Without DataYou Can't Search Without Data
You Can't Search Without DataBryan Bende
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_featuresAlberto Romero
 
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin DataWorks Summit/Hadoop Summit
 
What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4DataWorks Summit
 
An Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present FutureAn Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present FutureDataWorks Summit/Hadoop Summit
 
Druid Scaling Realtime Analytics
Druid Scaling Realtime AnalyticsDruid Scaling Realtime Analytics
Druid Scaling Realtime AnalyticsAaron Brooks
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017alanfgates
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseDataWorks Summit
 
Devnexus 2018 - Let Your Data Flow with Apache NiFi
Devnexus 2018 - Let Your Data Flow with Apache NiFiDevnexus 2018 - Let Your Data Flow with Apache NiFi
Devnexus 2018 - Let Your Data Flow with Apache NiFiBryan Bende
 
What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4DataWorks Summit
 
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with ZeppelinHortonworks
 
Log Analytics Optimization
Log Analytics OptimizationLog Analytics Optimization
Log Analytics OptimizationHortonworks
 
Log Analytics Optimization
Log Analytics OptimizationLog Analytics Optimization
Log Analytics OptimizationIsheeta Sanghi
 

Semelhante a Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet (20)

File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
File Format Benchmark - Avro, JSON, ORC and Parquet
File Format Benchmark - Avro, JSON, ORC and ParquetFile Format Benchmark - Avro, JSON, ORC and Parquet
File Format Benchmark - Avro, JSON, ORC and Parquet
 
Performance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache SparkPerformance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache Spark
 
You Can't Search Without Data
You Can't Search Without DataYou Can't Search Without Data
You Can't Search Without Data
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
 
What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4
 
An Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present FutureAn Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present Future
 
Druid Scaling Realtime Analytics
Druid Scaling Realtime AnalyticsDruid Scaling Realtime Analytics
Druid Scaling Realtime Analytics
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
Devnexus 2018 - Let Your Data Flow with Apache NiFi
Devnexus 2018 - Let Your Data Flow with Apache NiFiDevnexus 2018 - Let Your Data Flow with Apache NiFi
Devnexus 2018 - Let Your Data Flow with Apache NiFi
 
What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4
 
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with Zeppelin
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Log Analytics Optimization
Log Analytics OptimizationLog Analytics Optimization
Log Analytics Optimization
 
Log Analytics Optimization
Log Analytics OptimizationLog Analytics Optimization
Log Analytics Optimization
 
Building a Smarter Home with Apache NiFi and Spark
Building a Smarter Home with Apache NiFi and SparkBuilding a Smarter Home with Apache NiFi and Spark
Building a Smarter Home with Apache NiFi and Spark
 

Mais de Owen O'Malley

Running An Apache Project: 10 Traps and How to Avoid Them
Running An Apache Project: 10 Traps and How to Avoid ThemRunning An Apache Project: 10 Traps and How to Avoid Them
Running An Apache Project: 10 Traps and How to Avoid ThemOwen O'Malley
 
Big Data's Journey to ACID
Big Data's Journey to ACIDBig Data's Journey to ACID
Big Data's Journey to ACIDOwen O'Malley
 
Protect your private data with ORC column encryption
Protect your private data with ORC column encryptionProtect your private data with ORC column encryption
Protect your private data with ORC column encryptionOwen O'Malley
 
Fine Grain Access Control for Big Data: ORC Column Encryption
Fine Grain Access Control for Big Data: ORC Column EncryptionFine Grain Access Control for Big Data: ORC Column Encryption
Fine Grain Access Control for Big Data: ORC Column EncryptionOwen O'Malley
 
Strata NYC 2018 Iceberg
Strata NYC 2018  IcebergStrata NYC 2018  Iceberg
Strata NYC 2018 IcebergOwen O'Malley
 
ORC Column Encryption
ORC Column EncryptionORC Column Encryption
ORC Column EncryptionOwen O'Malley
 
Protecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache HadoopProtecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache HadoopOwen O'Malley
 
Structor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop ClustersStructor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop ClustersOwen O'Malley
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security ArchitectureOwen O'Malley
 
Adding ACID Updates to Hive
Adding ACID Updates to HiveAdding ACID Updates to Hive
Adding ACID Updates to HiveOwen O'Malley
 
ORC File Introduction
ORC File IntroductionORC File Introduction
ORC File IntroductionOwen O'Malley
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop OperationsOwen O'Malley
 
Next Generation MapReduce
Next Generation MapReduceNext Generation MapReduce
Next Generation MapReduceOwen O'Malley
 
Bay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 IntroBay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 IntroOwen O'Malley
 
Plugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in HadoopPlugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in HadoopOwen O'Malley
 

Mais de Owen O'Malley (16)

Running An Apache Project: 10 Traps and How to Avoid Them
Running An Apache Project: 10 Traps and How to Avoid ThemRunning An Apache Project: 10 Traps and How to Avoid Them
Running An Apache Project: 10 Traps and How to Avoid Them
 
Big Data's Journey to ACID
Big Data's Journey to ACIDBig Data's Journey to ACID
Big Data's Journey to ACID
 
Protect your private data with ORC column encryption
Protect your private data with ORC column encryptionProtect your private data with ORC column encryption
Protect your private data with ORC column encryption
 
Fine Grain Access Control for Big Data: ORC Column Encryption
Fine Grain Access Control for Big Data: ORC Column EncryptionFine Grain Access Control for Big Data: ORC Column Encryption
Fine Grain Access Control for Big Data: ORC Column Encryption
 
Strata NYC 2018 Iceberg
Strata NYC 2018  IcebergStrata NYC 2018  Iceberg
Strata NYC 2018 Iceberg
 
ORC Column Encryption
ORC Column EncryptionORC Column Encryption
ORC Column Encryption
 
Protecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache HadoopProtecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache Hadoop
 
Data protection2015
Data protection2015Data protection2015
Data protection2015
 
Structor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop ClustersStructor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop Clusters
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 
Adding ACID Updates to Hive
Adding ACID Updates to HiveAdding ACID Updates to Hive
Adding ACID Updates to Hive
 
ORC File Introduction
ORC File IntroductionORC File Introduction
ORC File Introduction
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop Operations
 
Next Generation MapReduce
Next Generation MapReduceNext Generation MapReduce
Next Generation MapReduce
 
Bay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 IntroBay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 Intro
 
Plugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in HadoopPlugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in Hadoop
 

Último

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 

Último (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 

Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet

  • 1. Fast Spark Access To Your Data - Avro, JSON, ORC, and Parquet Owen O’Malley owen@hortonworks.com @owen_omalley June 2018
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Who Am I? Worked on Hadoop since Jan 2006 MapReduce, Security, Hive, and ORC Worked on different file formats –Sequence File, RCFile, ORC File, T-File, and Avro requirements
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Goal Benchmark for Spark SQL Seeking to discover unknowns –How do the different formats perform? –What could they do better? Use real & diverse data sets –Over-reliance on artifical datasets leads to weakness Open & reviewed benchmarks
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved The File Formats
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Avro Cross-language file format for Hadoop Schema evolution was primary goal Schema segregated from data –Unlike Protobuf and Thrift Row major format
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved JSON Serialization format for HTTP & Javascript Text-format with MANY parsers Schema completely integrated with data Row major format Compression applied on top
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved ORC Originally part of Hive to replace RCFile –Now top-level project Schema segregated into footer Column major format with stripes Rich type model, stored top-down Integrated compression, indexes, & stats
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Parquet Design based on Google’s Dremel paper Schema segregated into footer Column major format with stripes Simpler type-model with logical types All data pushed to leaves of the tree
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Sets
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved NYC Taxi Data Every taxi cab ride in NYC from 2009 –Publically available –http://tinyurl.com/nyc-taxi-analysis 18 columns with no null values –Doubles, integers, decimals, & strings 2 months of data – 22.7 million rows
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sales Generated data –Real schema from a production Hive deployment –Random data based on the data statistics 55 columns with lots of nulls –A little structure –Timestamps, strings, longs, booleans, list, & struct 25 million rows
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Github Logs All actions on Github public repositories –Publically available –https://www.githubarchive.org/ 704 columns with a lot of structure & nulls –Pretty much the kitchen sink  1/2 month of data – 10.5 million rows
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Finding the Github Schema The data is all in JSON. No schema for the data is published. We wrote a JSON schema discoverer. –Scans the document and figures out the types Available in ORC tool jar. Schema is huge (12k)
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Software
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Software Versions All of these projects are evolving rapidly –Spark 2.3.1 –Avro 1.8.2 –ORC 1.5.1 –Parquet 1.8.2 –Spark-Avro 4.0.0 Dependency hell 👿
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Configuration Spark Configuration –spark.sql.orc.filterPushdown = true –spark.sql.orc.impl = native Hadoop Configuration –session.sparkContext().hadoopConfiguration() –avro.mapred.ignore.inputs.without.extension = false
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark-Avro Benchmark uses Spark SQL’s FileFormat –JSON, ORC, and Parquet all in Spark –Avro is provided by Databricks via spark-avro It doesn’t support all of the Spark types –Timestamp as int96 –Decimal as bytes
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Storage costs
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Compression Data size matters! –Hadoop stores all your data, but requires hardware –Is one factor in read speed (HDFS ~15mb/sec) ORC and Parquet use RLE & Dictionaries All the formats have general compression –ZLIB (GZip) – tight compression, slower –Snappy – some compression, faster
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Taxi Size Analysis Don’t use JSON Use either Snappy or Zlib compression Avro’s small compression window hurts Parquet Zlib is smaller than ORC
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sales Size Analysis ORC did better than expected –String columns have small cardinality –Lots of timestamp columns –No doubles 
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Github Size Analysis Surprising win for JSON and Avro –Worst when uncompressed –Best with zlib Many partially shared strings –ORC and Parquet don’t compress across columns Need to investigate Zstd with dictionary
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Use Cases
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Full Table Scans Read all columns & rows All formats except JSON are splittable –Different workers do different parts of file Taxi schema supports ColumnarBatch –All primitive types
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved 1 2 4 8 16 32 64 128 256 512 orc parquet json orc parquet json orc parquet taxi taxi taxi taxi taxi taxi taxi taxi none none none zlib zlib zlib snappy snappy Taxi Times
  • 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Taxi Read Performance Analysis JSON is very slow to read –Large storage size for this data set –Needs to do a LOT of string parsing Parquet is faster –ORC is going through an extra layer –VectorizedRowBatch -> OrcStruct -> ColumnarBatch
  • 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved 1 2 4 8 16 32 64 128 256 512 orc parquet json orc parquet orc parquet sales sales sales sales sales sales sales none none none zlib zlib snappy snappy Sales Times
  • 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sales Read Performance Analysis Read performance is dominated by format –Compression matters less for this data set –Straight ordering: ORC, Parquet, & JSON Uses Row instead of ColumnarBatch
  • 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved 1 2 4 8 16 32 64 128 256 512 1024 orc parquet json orc parquet orc parquet github github github github github github github none none none zlib zlib snappy snappy Github Times
  • 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Github Read Performance Analysis JSON did really well A lot of columns needs more space –We need bigger stripes (add min rows in ORC-190) –Rows/stripe - ORC: 18.6k, Parquet: 88.1k Parquet struggles –Twitter recommends against Parquet for this case
  • 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Column Projection Often just need a few columns –Only ORC & Parquet are columnar –Only read, decompress, & deserialize some columns Spark FileFormat passes in desired schema –Drop columns that aren’t needed –JSON and Avro read first and then drop columns
  • 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved 0 2 4 6 8 10 12 14 16 18 20 none snappy zlib none snappy zlib none snappy zlib none snappy zlib none snappy zlib none snappy zlib orc orc orc parquet parquet parquet orc orc orc parquet parquet parquet orc orc orc parquet parquet parquet github github github github github github sales sales sales sales sales sales taxi taxi taxi taxi taxi taxi Column Projection % Sizes
  • 36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Predicate Pushdown Query: –select first_name, last_name from employees where hire_date between ‘01/01/2017’ and ‘01/03/2017’ Predicate: –hire_date between ‘01/01/2017’ and ‘01/03/2017’ Given to FileFormat via filters For benchmark, filter on a sorted column
  • 37. 37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Predicate Pushdown ORC & Parquet indexes with min & max –Sorted data is critical! ORC has optional bloom filters Reader filters out sections of file –Entire file –Stripe –Row group (only ORC, default 10k rows) Engine needs to apply row level filter
  • 38. 38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved 1 4 16 64 256 1024 4096 16384 65536 262144 1048576 4194304 16777216 taxi sales github Predicate Pushdown Rows orc parquet total
  • 39. 39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Predicate Pushdown Parquet doesn’t pushdown timestamp filters –Taxi and Github filters were on timestamps. Spark defaults ORC predicate pushdown off. Small ORC stripes for Github lead to sub-10k row read. Because predicate pushdown is an optimization, it isn’t clear when it isn’t used.
  • 40. 40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Metadata Access ORC & Parquet store metadata –Stored in file footer –File schema –Number of records –Min, max, count of each column Provides O(1) Access
  • 41. 41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Conclusions
  • 42. 42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Recommendations Disclaimer – Everything changes! –Both these benchmarks and the formats will change. Evaluate needs –Column projection and predicate pushdown are only in ORC & Parquet –Determine how to sort data –Are bloom filters useful?
  • 43. 43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank you! Twitter: @owen_omalley Email: owen@hortonworks.com