SlideShare a Scribd company logo
1 of 64
Download to read offline
Why you should care about
data layout in the file system
Cheng Lian, @liancheng
Vida Ha, @femineer
Spark Summit 2017
1
TEAM
About Databricks
Started Spark project (now Apache Spark) at UC Berkeley in 2009
22
PRODUCT
Unified Analytics Platform
MISSION
Making Big Data Simple
Apache Spark is a
powerful framework
with some temper
3
4
Just like
super mario
5
Serve him the
right ingredients
6
Powers up and
gets more efficient
7
Keep serving
8
He even knows
how to Spark!
9
However,
once served
a wrong dish...
10
Meh...
11
And sometimes...
12
It can be messy...
13
Secret sauces
we feed Spark
13
File Formats
14
Choosing a compression scheme
15
The obvious
• Compression ratio: the higher the better
• De/compression speed: the faster the better
Choosing a compression scheme
16
Splittable v.s. non-splittable
• Affects parallelism, crucial for big data
• Common splittable compression schemes
• LZ4, Snappy, BZip2, LZO, and etc.
• GZip is non-splittable
• Still common if file sizes are << 1GB
• Still applicable for Parquet
Columnar formats
Smart, analytics friendly, optimized for big data
• Support for nested data types
• Efficient data skipping
• Column pruning
• Min/max statistics based predicate push-down
• Nice interoperability
• Examples:
• Spark SQL built-in support: Apache Parquet and Apache ORC
• Newly emerging: Apache CarbonData and Spinach
17
Columnar formats
Parquet
• Apache Spark default output format
• Usually the best practice for Spark SQL
• Relatively heavy write path
• Worth the time to encode for repeated analytics scenario
• Does not support fine grained appending
• Not ideal for, e.g., collecting logs
• Check out Parquet presentations for more details
18
Semi-structured text formats
Sort of structured but not self-describing
• Excellent write path performance but slow on the read path
• Good candidates for collecting raw data (e.g., logs)
• Subject to inconsistent and/or malformed records
• Schema inference provided by Spark (for JSON and CSV)
• Sampling-based
• Handy for exploratory scenario but can be inaccurate
• Always specify an accurate schema in production
19
Semi-structured text formats
JSON
• Supported by Apache Spark out of the box
• One JSON object per line for fast file splitting
• JSON object: map or struct?
• Spark schema inference always treats JSON objects as structs
• Watch out for arbitrary number of keys (may OOM executors)
• Specify an accurate schema if you decide to stick with maps
20
Semi-structured text formats
JSON
• Malformed records
• Bad records are collected into column _corrupted_record
• All other columns are set to null
21
Semi-structured text formats
CSV
• Supported by Spark 2.x out of the box
• Check out the spark-csv package for Spark 1.x
• Often used for handling legacy data providers & consumers
• Lacks of a standard file specification
– Separator, escaping, quoting, and etc.
• Lacks of support for nested data types
22
Raw text files
23
Arbitrary line-based text files
• Splitting files into lines using spark.read.text()
• Keep your lines a reasonable size
• Keep file size < 1GB if compressed with a non-splittable
compression scheme (e.g., GZip)
• Handing inevitable malformed data
• Use a filter() transformation to drop bad lines, or
• Use a map() transformation to fix bad line
Directory layout
24
Partitioning
year=2017 genre=classic albums
albumsgenre=folk
25
Overview
• Coarse-grained data skipping
• Available for both persisted
tables and raw directories
• Automatically discovers Hive
style partitioned directories
CREATE TABLE ratings
USING PARQUET
PARTITIONED BY (year, genre)
AS SELECT artist, rating, year, genre
FROM music
Partitioning
SQL DataFrame API
spark
.table(“music”)
.select(’artist, ’rating, ’year, ’genre)
.write
.format(“parquet”)
.partitionBy(’year, ’genre)
.saveAsTable(“ratings”)
26
Partitioning
year=2017 albums
albums
genre=classic
genre=folk
27
Filter predicates
Use simple filter predicates
containing partition columns to
leverage partition pruning
Partitioning
Filter predicates
• year = 2000 AND genre = ‘folk’
• year > 2000 AND rating > 3
• year > 2000 OR genre <> ‘rock’
28
Partitioning
29
Filter predicates
• year > 2000 OR rating = 5
• year > rating
Partitioning
Avoid excessive partitions
• Stress metastore for persisted tables
• Stress file system when reading directly from the file system
• Suggestions
• Avoid using too many partition columns
• Avoid using partition columns with too many distinct values
– Try hashing the values
– E.g., partition by first letter of first name rather than first name
30
Partitioning
Using persisted partitioned tables
with Spark 2.1+
• Per-partition metadata gets
persisted into the metastore
• Avoids unnecessary partition
discovery (esp. valuable for S3)
Check our blog post for more details
31
Scalable partition handling
• Pre-shuffles and optionally
pre-sorts the data while writing
• Layout information gets persisted
in the metastore
• Avoids shuffling and sorting when
joining large datasets
• Only available for persisted tables
Bucketing
Overview
32
CREATE TABLE ratings
USING PARQUET
PARTITIONED BY (year, genre)
CLUSTERED BY (rating) INTO 5 BUCKETS
SORTED BY (rating)
AS SELECT artist, rating, year, genre
FROM music
Bucketing
SQL
33
DataFrame
ratings
.select(’artist, ’rating, ’year, ’genre)
.write
.format(“parquet”)
.partitionBy(“year”, “genre”)
.bucketBy(5, “rating”)
.sortBy(“rating”)
.saveAsTable(“ratings”)
Bucketing
In combo with columnar formats
• Bucketing
• Per-bucket sorting
• Columnar formats
• Efficient data skipping based on min/max statistics
• Works best when the searched columns are sorted
34
Bucketing
35
Bucketing
36
min=0, max=99
min=100, max=199
min=200, max=249
Bucketing
In combo with columnar formats
Perfect combination, makes your Spark jobs FLY!
37
More tips
38
File size and compaction
Avoid small files
• Cause excessive parallelism
• Spark 2.x improves this by packing small files
• Cause extra file metadata operations
• Particularly bad when hosted on S3
39
File size and compaction
40
How to control output file sizes
• In general, one task in the output stage writes one file
• Tune parallelism of the output stage
• coalesce(N), for
• Reduces parallelism for small jobs
• repartition(N), for
• Increasing parallelism for all jobs, or
• Reducing parallelism of final output stage for large jobs
• Still preserves high parallelism for previous stages
True story
Customer
• Spark ORC Read Performance is much slower than Parquet
• The same query took
• 3 seconds on a Parquet dataset
• 4 minutes on an equivalent ORC dataset
41
True story
Me
• Ran a simple count(*), which took
• Seconds on the Parquet dataset with a handful IO requests
• 35 minutes on the ORC dataset with 10,000s of IO requests
• Most task execution threads are reading ORC stripe footers
42
True story
43
True story
44
import org.apache.hadoop.hive.ql.io.orc._
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
val conf = new Configuration
def countStripes(file: String): Int = {
val path = new Path(file)
val reader = OrcFile.createReader(path, OrcFile.readerOptions(conf))
val metadata = reader.getMetadata
metadata.getStripeStatistics.size
}
True story
45
Maximum file size: ~15 MB
Maximum ORC stripe counts: ~1,400
True story
46
Root cause
Malformed (but not corrupted) ORC dataset
• ORC readers read the footer first before consuming a strip
• ~1,400 stripes within a single file as small as 15 MB
• ~1,400 x 2 read requests issued to S3 for merely 15 MB of data
True story
47
Root cause
Malformed (but not corrupted) ORC dataset
• ORC readers read the footer first before consuming a strip
• ~1,400 stripes within a single file as small as 15 MB
• ~1,400 x 2 read requests issued to S3 for merely 15 MB of data
Much worse than even CSV, not mention Parquet
True story
48
Why?
• Tiny ORC files (~10 KB) generated by Streaming jobs
• Resulting one tiny ORC stripe inside each ORC file
• The footers might take even more space than the actual data!
True story
49
Why?
Tiny files got compacted into larger ones using
ALTER TABLE ... PARTITION (...) CONCATENATE;
The CONCATENATE command just, well, concatenated those tiny
stripes and produced larger (~15 MB) files with a huge number of
tiny stripes.
True story
50
Lessons learned
Again, avoid writing small files in columnar formats
• Output files using CSV or JSON for Streaming jobs
• For better write path performance
• Compact small files into large chunks of columnar files later
• For better read path performance
True story
51
The cure
Simply read the ORC dataset and write it back using
spark.read.orc(input).write.orc(output)
So that stripes are adjusted into more reasonable sizes.
Schema evolution
Columns come and go
• Never ever change the data type of a published column
• Columns with the same name should have the same data type
• If you really dislike the data type of some column
• Add a new column with a new name and the right data type
• Deprecate the old one
• Optionally, drop it after updating all downstream consumers
52
Schema evolution
Columns come and go
Spark built-in data sources that support schema evolution
• JSON
• Parquet
• ORC
53
Schema evolution
Common columnar formats are less tolerant of data type
mismatch. E.g.:
• INT cannot be promoted to LONG
• FLOAT cannot be promoted to DOUBLE
JSON is more tolerating, though
• LONG → DOUBLE → STRING
54
True story
Customer
Parquet dataset corrupted!!! HALP!!!
55
True story
What happened?
Original schema
• {col1: DECIMAL(19, 4), col2: INT}
Accidentally appended data with schema
• {col1: DOUBLE, col2: DOUBLE}
All files written into the same directory
56
True story
What happened?
Common columnar formats are less tolerant of data type
mismatch. E.g.:
• INT cannot be promoted to LONG
• FLOAT cannot be promoted to DOUBLE
Parquet considered these schemas as incompatible ones and
refused to merge them.
57
True story
BTW
JSON schema inference is more tolerating
• LONG → DOUBLE → STRING
However
• JSON is NOT suitable for analytics scenario
• Schema inference is unreliable, not suitable for production
58
True story
The cure
Correct the schema
• Filter out all the files with the wrong schema
• Rewrite those files using the correct schema
Exhausting because all files are appended into a single directory
59
True story
Lessons learned
• Be very careful on the write path
• Consider partitioning when possible
• Better read path performance
• Easier to fix the data when something went wrong
60
Recap
File formats Directory layout
• Partitioning
• Bucketing
• Compression schemes
• Columnar (Parquet, ORC)
• Semi-structured (JSON, CSV)
• Raw text format
61
Other tips
• File sizes and compaction
• Schema evolution
UNIFIED ANALYTICS PLATFORM
Try Apache Spark in Databricks!
• Collaborative cloud environment
• Free version (community edition)
6262
DATABRICKS RUNTIME 3.0
• Apache Spark - optimized for the cloud
• Caching and optimization layer - DBIO
• Enterprise security - DBES
Try for free today.
databricks.com
Early draft available
for free today!
go.databricks.com/book
6363
Thank you
Q & A
64

More Related Content

What's hot

What's hot (20)

Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Apache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in DepthApache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in Depth
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsMaterialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 

Similar to Why you should care about data layout in the file system with Cheng Lian and Vida Ha

Similar to Why you should care about data layout in the file system with Cheng Lian and Vida Ha (20)

Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
 
SQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - AdvancedSQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - Advanced
 
Elasticsearch Arcihtecture & What's New in Version 5
Elasticsearch Arcihtecture & What's New in Version 5Elasticsearch Arcihtecture & What's New in Version 5
Elasticsearch Arcihtecture & What's New in Version 5
 
Low Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling ExamplesLow Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling Examples
 
Dissecting Scalable Database Architectures
Dissecting Scalable Database ArchitecturesDissecting Scalable Database Architectures
Dissecting Scalable Database Architectures
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
Parquet and impala overview external
Parquet and impala overview externalParquet and impala overview external
Parquet and impala overview external
 
Cassandra Day Chicago 2015: Introduction to Apache Cassandra & DataStax Enter...
Cassandra Day Chicago 2015: Introduction to Apache Cassandra & DataStax Enter...Cassandra Day Chicago 2015: Introduction to Apache Cassandra & DataStax Enter...
Cassandra Day Chicago 2015: Introduction to Apache Cassandra & DataStax Enter...
 
Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...
Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...
Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...
 
Cassandra Day London 2015: Introduction to Apache Cassandra and DataStax Ente...
Cassandra Day London 2015: Introduction to Apache Cassandra and DataStax Ente...Cassandra Day London 2015: Introduction to Apache Cassandra and DataStax Ente...
Cassandra Day London 2015: Introduction to Apache Cassandra and DataStax Ente...
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Fabian Hueske – Juggling with Bits and Bytes
Fabian Hueske – Juggling with Bits and BytesFabian Hueske – Juggling with Bits and Bytes
Fabian Hueske – Juggling with Bits and Bytes
 
Using existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analyticsUsing existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analytics
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
 
Managing data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBManaging data and operation distribution in MongoDB
Managing data and operation distribution in MongoDB
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
 
1650607.ppt
1650607.ppt1650607.ppt
1650607.ppt
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDB
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
amitlee9823
 

Recently uploaded (20)

➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 

Why you should care about data layout in the file system with Cheng Lian and Vida Ha

  • 1. Why you should care about data layout in the file system Cheng Lian, @liancheng Vida Ha, @femineer Spark Summit 2017 1
  • 2. TEAM About Databricks Started Spark project (now Apache Spark) at UC Berkeley in 2009 22 PRODUCT Unified Analytics Platform MISSION Making Big Data Simple
  • 3. Apache Spark is a powerful framework with some temper 3
  • 5. 5 Serve him the right ingredients
  • 6. 6 Powers up and gets more efficient
  • 12. 12 It can be messy...
  • 15. Choosing a compression scheme 15 The obvious • Compression ratio: the higher the better • De/compression speed: the faster the better
  • 16. Choosing a compression scheme 16 Splittable v.s. non-splittable • Affects parallelism, crucial for big data • Common splittable compression schemes • LZ4, Snappy, BZip2, LZO, and etc. • GZip is non-splittable • Still common if file sizes are << 1GB • Still applicable for Parquet
  • 17. Columnar formats Smart, analytics friendly, optimized for big data • Support for nested data types • Efficient data skipping • Column pruning • Min/max statistics based predicate push-down • Nice interoperability • Examples: • Spark SQL built-in support: Apache Parquet and Apache ORC • Newly emerging: Apache CarbonData and Spinach 17
  • 18. Columnar formats Parquet • Apache Spark default output format • Usually the best practice for Spark SQL • Relatively heavy write path • Worth the time to encode for repeated analytics scenario • Does not support fine grained appending • Not ideal for, e.g., collecting logs • Check out Parquet presentations for more details 18
  • 19. Semi-structured text formats Sort of structured but not self-describing • Excellent write path performance but slow on the read path • Good candidates for collecting raw data (e.g., logs) • Subject to inconsistent and/or malformed records • Schema inference provided by Spark (for JSON and CSV) • Sampling-based • Handy for exploratory scenario but can be inaccurate • Always specify an accurate schema in production 19
  • 20. Semi-structured text formats JSON • Supported by Apache Spark out of the box • One JSON object per line for fast file splitting • JSON object: map or struct? • Spark schema inference always treats JSON objects as structs • Watch out for arbitrary number of keys (may OOM executors) • Specify an accurate schema if you decide to stick with maps 20
  • 21. Semi-structured text formats JSON • Malformed records • Bad records are collected into column _corrupted_record • All other columns are set to null 21
  • 22. Semi-structured text formats CSV • Supported by Spark 2.x out of the box • Check out the spark-csv package for Spark 1.x • Often used for handling legacy data providers & consumers • Lacks of a standard file specification – Separator, escaping, quoting, and etc. • Lacks of support for nested data types 22
  • 23. Raw text files 23 Arbitrary line-based text files • Splitting files into lines using spark.read.text() • Keep your lines a reasonable size • Keep file size < 1GB if compressed with a non-splittable compression scheme (e.g., GZip) • Handing inevitable malformed data • Use a filter() transformation to drop bad lines, or • Use a map() transformation to fix bad line
  • 25. Partitioning year=2017 genre=classic albums albumsgenre=folk 25 Overview • Coarse-grained data skipping • Available for both persisted tables and raw directories • Automatically discovers Hive style partitioned directories
  • 26. CREATE TABLE ratings USING PARQUET PARTITIONED BY (year, genre) AS SELECT artist, rating, year, genre FROM music Partitioning SQL DataFrame API spark .table(“music”) .select(’artist, ’rating, ’year, ’genre) .write .format(“parquet”) .partitionBy(’year, ’genre) .saveAsTable(“ratings”) 26
  • 27. Partitioning year=2017 albums albums genre=classic genre=folk 27 Filter predicates Use simple filter predicates containing partition columns to leverage partition pruning
  • 28. Partitioning Filter predicates • year = 2000 AND genre = ‘folk’ • year > 2000 AND rating > 3 • year > 2000 OR genre <> ‘rock’ 28
  • 29. Partitioning 29 Filter predicates • year > 2000 OR rating = 5 • year > rating
  • 30. Partitioning Avoid excessive partitions • Stress metastore for persisted tables • Stress file system when reading directly from the file system • Suggestions • Avoid using too many partition columns • Avoid using partition columns with too many distinct values – Try hashing the values – E.g., partition by first letter of first name rather than first name 30
  • 31. Partitioning Using persisted partitioned tables with Spark 2.1+ • Per-partition metadata gets persisted into the metastore • Avoids unnecessary partition discovery (esp. valuable for S3) Check our blog post for more details 31 Scalable partition handling
  • 32. • Pre-shuffles and optionally pre-sorts the data while writing • Layout information gets persisted in the metastore • Avoids shuffling and sorting when joining large datasets • Only available for persisted tables Bucketing Overview 32
  • 33. CREATE TABLE ratings USING PARQUET PARTITIONED BY (year, genre) CLUSTERED BY (rating) INTO 5 BUCKETS SORTED BY (rating) AS SELECT artist, rating, year, genre FROM music Bucketing SQL 33 DataFrame ratings .select(’artist, ’rating, ’year, ’genre) .write .format(“parquet”) .partitionBy(“year”, “genre”) .bucketBy(5, “rating”) .sortBy(“rating”) .saveAsTable(“ratings”)
  • 34. Bucketing In combo with columnar formats • Bucketing • Per-bucket sorting • Columnar formats • Efficient data skipping based on min/max statistics • Works best when the searched columns are sorted 34
  • 37. Bucketing In combo with columnar formats Perfect combination, makes your Spark jobs FLY! 37
  • 39. File size and compaction Avoid small files • Cause excessive parallelism • Spark 2.x improves this by packing small files • Cause extra file metadata operations • Particularly bad when hosted on S3 39
  • 40. File size and compaction 40 How to control output file sizes • In general, one task in the output stage writes one file • Tune parallelism of the output stage • coalesce(N), for • Reduces parallelism for small jobs • repartition(N), for • Increasing parallelism for all jobs, or • Reducing parallelism of final output stage for large jobs • Still preserves high parallelism for previous stages
  • 41. True story Customer • Spark ORC Read Performance is much slower than Parquet • The same query took • 3 seconds on a Parquet dataset • 4 minutes on an equivalent ORC dataset 41
  • 42. True story Me • Ran a simple count(*), which took • Seconds on the Parquet dataset with a handful IO requests • 35 minutes on the ORC dataset with 10,000s of IO requests • Most task execution threads are reading ORC stripe footers 42
  • 44. True story 44 import org.apache.hadoop.hive.ql.io.orc._ import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.Path val conf = new Configuration def countStripes(file: String): Int = { val path = new Path(file) val reader = OrcFile.createReader(path, OrcFile.readerOptions(conf)) val metadata = reader.getMetadata metadata.getStripeStatistics.size }
  • 45. True story 45 Maximum file size: ~15 MB Maximum ORC stripe counts: ~1,400
  • 46. True story 46 Root cause Malformed (but not corrupted) ORC dataset • ORC readers read the footer first before consuming a strip • ~1,400 stripes within a single file as small as 15 MB • ~1,400 x 2 read requests issued to S3 for merely 15 MB of data
  • 47. True story 47 Root cause Malformed (but not corrupted) ORC dataset • ORC readers read the footer first before consuming a strip • ~1,400 stripes within a single file as small as 15 MB • ~1,400 x 2 read requests issued to S3 for merely 15 MB of data Much worse than even CSV, not mention Parquet
  • 48. True story 48 Why? • Tiny ORC files (~10 KB) generated by Streaming jobs • Resulting one tiny ORC stripe inside each ORC file • The footers might take even more space than the actual data!
  • 49. True story 49 Why? Tiny files got compacted into larger ones using ALTER TABLE ... PARTITION (...) CONCATENATE; The CONCATENATE command just, well, concatenated those tiny stripes and produced larger (~15 MB) files with a huge number of tiny stripes.
  • 50. True story 50 Lessons learned Again, avoid writing small files in columnar formats • Output files using CSV or JSON for Streaming jobs • For better write path performance • Compact small files into large chunks of columnar files later • For better read path performance
  • 51. True story 51 The cure Simply read the ORC dataset and write it back using spark.read.orc(input).write.orc(output) So that stripes are adjusted into more reasonable sizes.
  • 52. Schema evolution Columns come and go • Never ever change the data type of a published column • Columns with the same name should have the same data type • If you really dislike the data type of some column • Add a new column with a new name and the right data type • Deprecate the old one • Optionally, drop it after updating all downstream consumers 52
  • 53. Schema evolution Columns come and go Spark built-in data sources that support schema evolution • JSON • Parquet • ORC 53
  • 54. Schema evolution Common columnar formats are less tolerant of data type mismatch. E.g.: • INT cannot be promoted to LONG • FLOAT cannot be promoted to DOUBLE JSON is more tolerating, though • LONG → DOUBLE → STRING 54
  • 55. True story Customer Parquet dataset corrupted!!! HALP!!! 55
  • 56. True story What happened? Original schema • {col1: DECIMAL(19, 4), col2: INT} Accidentally appended data with schema • {col1: DOUBLE, col2: DOUBLE} All files written into the same directory 56
  • 57. True story What happened? Common columnar formats are less tolerant of data type mismatch. E.g.: • INT cannot be promoted to LONG • FLOAT cannot be promoted to DOUBLE Parquet considered these schemas as incompatible ones and refused to merge them. 57
  • 58. True story BTW JSON schema inference is more tolerating • LONG → DOUBLE → STRING However • JSON is NOT suitable for analytics scenario • Schema inference is unreliable, not suitable for production 58
  • 59. True story The cure Correct the schema • Filter out all the files with the wrong schema • Rewrite those files using the correct schema Exhausting because all files are appended into a single directory 59
  • 60. True story Lessons learned • Be very careful on the write path • Consider partitioning when possible • Better read path performance • Easier to fix the data when something went wrong 60
  • 61. Recap File formats Directory layout • Partitioning • Bucketing • Compression schemes • Columnar (Parquet, ORC) • Semi-structured (JSON, CSV) • Raw text format 61 Other tips • File sizes and compaction • Schema evolution
  • 62. UNIFIED ANALYTICS PLATFORM Try Apache Spark in Databricks! • Collaborative cloud environment • Free version (community edition) 6262 DATABRICKS RUNTIME 3.0 • Apache Spark - optimized for the cloud • Caching and optimization layer - DBIO • Enterprise security - DBES Try for free today. databricks.com
  • 63. Early draft available for free today! go.databricks.com/book 6363