SlideShare uma empresa Scribd logo
1 de 76
Baixar para ler offline
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Gengliang Wang, Databricks
Apache Spark’s Built-in
File Sources in Depth
#UnifiedDataAnalytics #SparkAISummit
About me
• Gengliang Wang(Github:gengliangwang)
• Software Engineer at
• Apache Spark committer
3#UnifiedDataAnalytics #SparkAISummit
Agenda
• File formats
• Data layout
• File reader internals
• File writer internals
4
5
Built-in File formats
• Column-oriented
– Parquet
– ORC
• Row-oriented
– Avro
– JSON
– CSV
– Text
– Binary
6
Column vs. Row
Example table “music”
7
artist genre year album
Elvis Presley Country 1965 Love Me Tender
The Beatles Rock 1966 Revolver
Nirvana Rock 1991 Nevermind
Column vs. Row
Column
Elvis Presley, The Beatles, Nirvana
Country, Rock, Rock
1965, 1966, 1991
Love Me Tender, Revolver, Nevermind
Row
Elvis Presley, Country, 1965, Love Me Tender
The Beatles, Rock, 1966, Revolver
Nirvana, Rock, 1991, Nevermind
8
artist genre
year album
Column-oriented formats
Column
Elvis Presley, The Beatles, Nirvana
Country, Rock, Rock
1965, 1966, 1991
Love Me Tender, Revolver, Nevermind
9
Artist Genre
Year Album
SELECT album
FROM music
WHERE artist = 'The Beatles';
Row-oriented formats
10
artist genre
year album
SELECT * FROM music;
INSERT INTO music ...
Row
Elvis Presley, Country, 1965, Love Me Tender
The Beatles, Rock, 1966, Revolver
Nirvana, Rock, 1991, Nevermind
Column vs. Row
Column
• Read heavy workloads
• Queries that require only
a subset of the table
columns
Row
• Write heavy workloads
– event level data
• Queries that require
most or all of the table
columns
11
Built-in File formats
• Column-oriented
– Parquet
– ORC
• Row-oriented
– Avro
– JSON
– CSV
– Text
– Binary
12
Parquet
13
File Structure
• A list of row groups
• Footer
– Metadata of each row group
• Min/max of each column
– Schema
Parquet
14
Row group
• A list of column trunks
– One column chunk per column
Column chunk
• A list of data pages
• An optional dictionary page
Parquet
15
Predicates pushed down
• a = 1
• a < 1
• a > 1
• …
Row group skipping
• Footer stats (min/max)
• Dictionary pages of column chunks
ORC
16
File Structure
• A list of stripes
• Footer
– column-level aggregates
count, min, max, and
sum
– The number of rows per
stripe
ORC
17
Stripe Structure
• Indexes
– Start positions of each column
– Min/max of each column
• Data
Supports Stripe skipping
Built-in File formats
• Column-oriented
– Parquet
– ORC
• Row-oriented
– Avro
– JSON
– CSV
– Text
– Binary
18
Avro
• Compact
• Fast
• Robust support for schema evolution
– Adding fields
– Removing fields
– Renaming fields(with aliases)
– Change default values
– Change docs of fields
19
Semi-structured text formats:
JSON & CSV
• Excellent write path performance but slow on the read
path
– Good candidates for collecting raw data (e.g., logs)
• Subject to inconsistent and/or malformed records
• Schema inference provided by Spark
– Sampling-based
– Handy for exploratory scenario but can be
inaccurate
20
JSON
• JSON object: map or struct?
– Spark schema inference always treats JSON objects
as structs
– Watch out for arbitrary number of keys (may OOM
executors)
– Specify an accurate schema if you decide to stick
with maps
21
Avro vs. JSON
Avro
• Compact and Fast
• Accurate schema inference
• High data quality and robust
schema revolution support,
good choice for organizational
scale development
– Event data from
Web/iOS/Android clients
JSON
• Repeating every field name
with every single record
• Schema inference can
inaccurate
• Light weight, easy
development and debug
22
CSV
• Often used for handling legacy data providers &
consumers
– Lacks of a standard file specification
• Separator, escaping, quoting, and etc.
– Lacks of support for nested data types
23
Raw text files
Arbitrary line-based text files
• Splitting files into lines using spark.read.text()
– Keep your lines a reasonable size
• Limited schema with only one field
– value: (StringType)
24
Binary
• New file data source in Spark 3.0
• Reads binary files and converts each file into a
single record that contains the raw content and
metadata of the file
25
Binary
Schema
• path (StringType)
• modificationTime (TimestampType)
• length (LongType)
• content (BinaryType)
26
Binary
To reads all JPG files recursively from the input
directory and ignores partition discovery
27
spark.read.format("binaryFile")
.option("pathGlobFilter", "*.jpg")
.option("recursiveFileLookup", "true")
.load("/path/to/dir")
Agenda
• File formats
• Data layout
• File reader internals
• File writer internals
28
Data layout
• Partitioning
• Bucketing
29
Partitioning
• Coarse-grained data skipping
• Available for both persisted
tables and raw directories
• Automatically discovers Hive
style partitioned directories
30
Partitioning
Filter predicates
• Use simple filter predicates
containing partition columns to
leverage partition pruning
• Avoids unnecessary partition
discovery (esp. valuable for
S3)
31
SELECT * FROM music WHERE
year = 2019 and genre=‘folk’
Partitioning
Check our blog post
https://tinyurl.com/y2
eow2rx for more
details
32
Partitioning
SQL
CREATE TABLE ratings
USING PARQUET
PARTITIONED BY (year, genre)
AS SELECT artist, rating, year, genre
FROM music
33
DataFrame API
spark
.table(“music”)
.select(’artist, ’rating, ’year, ’genre)
.write
.format(“parquet”)
.partitionBy(’year, ’genre)
.saveAsTable(“ratings”)
Partitioning: new feature in Spark 3.0
34
Partitioning: new feature in Spark 3.0
35
More details in the talk
Dynamic Partition Pruning in Apache Spark
Bogdan Ghit & Juliusz Sompolski
Partitioning
Avoid excessive partitions
• Stress metastore for getting partition columns and values
• Stress file system for listing files under partition directories
– Esp. slow on S3
• Suggestions
– Avoid using too many partition columns
– Avoid using partition columns with too many distinct values
• Try hashing the values
• E.g., partition by first letter of first name rather than first name
– Try Delta Lake
36
Partitioning
Delta Lake’s transaction
logs are analyzed in
parallel by Spark
• No need to stress Hive
Metastore for partition
values and file system for
listing files
• Able to handle billions of
partitions and files
37
Data layout
• Partitioning
• Bucketing
38
Bucketing
39
File Scan
Exchange
Sort
Sort merge join
File Scan
Exchange
Sort
Shuffle
is slow.
Bucketing
40
File Scan
Exchange
Sort
Sort merge join
File Scan
Exchange
Sort
Pre-shuffle
+
Pre-sort
Bucketing
41
File Scan
Sort merge join
File Scan
Use bucketing
on tables which
are frequently
JOINed with
same keys
Bucketing
SQL
CREATE TABLE ratings
USING PARQUET
PARTITIONED BY (year, genre)
CLUSTERED BY (rating) INTO 5 BUCKETS
SORTED BY (rating)
AS SELECT artist, rating, year, genre
FROM music
42
DataFrame API
spark
.table(“music”)
.select(’artist, ’rating, ’year, ’genre)
.write
.format(“parquet”)
.partitionBy(’year, ’genre)
.bucketBy(5, “rating”)
.sortBy(“rating”)
.saveAsTable(“ratings”)
Bucketing
In combo with columnar formats: works best when the
searched columns are sorted
• Bucketing
– Per-bucket sorting
• Columnar formats
– Efficient data skipping based on min/max statistics
43
Agenda
• File formats
• Data layout
• File reader internals
• File writer internals
44
Query example
45
spark.read.parquet("/data/events")
.where("year = 2019")
.where("city = 'Amsterdam'")
.select("timestamp")
Goal
• Understand data and skip unneeded data
• Split file into partitions for parallel read
46
Data layout
47
Events year=2019
year=2018
year=2017
year=2016
parquet
files
row group 0
city
timestamp
OS
browser
other columns..
row group 1
row group N
Footer
.
.
Partition pruning
48
Events year=2019
year=2018
year=2017
year=2016
parquet
files
row group 0
city
timestamp
OS
browser
other columns..
row group 1
.
.
row group N
Footer
spark
.read.parquet("/data/events)
.where("year = 2019")
.where("city = 'Amsterdam'")
.select("timestamp")
.collect()
.
.
Row group skipping
49
Events year=2019
year=2018
year=2017
year=2016
parquet
files
row group 0
city
timestamp
OS
browser
other columns..
row group 1
.
.
row group N
Footer
spark
.read.parquet("/data/events)
.where("year = 2019")
.where("city = 'Amsterdam'")
.select("timestamp")
.collect()
.
.
Column pruning
50
Events year=2019
year=2018
year=2017
year=2016
parquet
files
row group 0
city
timestamp
OS
browser
other columns..
row group 1
.
.
row group N
Footer
spark
.read.parquet("/data/events)
.where("year = 2019")
.where("city = 'Amsterdam'")
.select("timestamp")
.collect()
.
.
Goal
• Understand data and skip unneeded data
• Split file into partitions for parallel read
51
Partitions of same size
52
File 0 File 1
Partition 0 Partition 1 Partition 2
File 2HDFS
Spark
Driver: plan input partitions
53
Spark
Driver
Partition 0 Partition 1 Partition 2
1. Split into
File
partitions
Driver: plan input partitions
54
Spark
Driver
Executor 0 Executor 1 Executor 2
2. Launch read tasks
Partition 0 Partition 1 Partition 2
1. Split into
File
Partitions
partitions
Executor: Read distributedly
55
Spark
Driver
Executor 0 Executor 1 Executor 2
3. Actual
Reading
Partition 0 Partition 1 Partition 2
1. Split into
File
Partitions
partitions
2. Launch read tasks
Agenda
• File formats
• Data layout
• File reader internals
• File writer internals
56
Query example
57
val trainingData =
spark.read.parquet("/data/events")
.where("year = 2019")
.where("city = 'Amsterdam'")
.select("timestamp")
trainingData.write.parquet("/data/results")
Goal
• Parallel
• Transactional
58
59
Executor 0
Executor 1
Executor 2
1. Write task
Spark
Driver
60
part-00000Executor 0
Executor 1
Executor 2
2. Write
to files
Spark
Driver
Each task
writes to
different
temporary
paths
part-00001
part-00002
1. Write task
Everything should be temporary
61
results _temporary
Files should be isolated between
jobs
62
results _temporary job id
job id
Task output is also temporary
results _temporary job id _temporary
Files should be isolated between tasks
64
results _temporary job id _temporary task
attempt id
parquet
files
task
attempt id
parquet
files
task
attempt id
parquet
files
65
part-00000Executor 0
Executor 1
Executor 2
2. Write
to files
Spark
Driver
Each task
writes to
different
temporary
paths
part-00001
part-00002
1. Write task
Commit task
3. commit
task
File layout
66
results _temporary job id task
attempt id
parquet
files
task id parquet
files
task id parquet
files
_temporary
In
progress
Committed
67
part-00000Executor 0
Executor 1
Executor 2
2. Write
to files
Spark
Driver
Each task
writes to
different
temporary
paths
part-00001
part-00002
1. Write task
If task aborts..
3. abort
task
File layout
68
results _temporary job id task
attempt id
parquet
files
task id parquet
files
task id parquet
files
_temporary
On task abort,
delete the task
output path
69
part-00000Executor 0
Executor 1
Executor 2
2. Write
to files
Spark
Driver
Each task
writes to
different
temporary
paths
part-00001
part-00002
1. Write task
If task aborts..
3. abort
task
4. Relaunch
task
70
part-00000Executor 0
Executor 1
Executor 2
2. Write
to files
Spark
Driver
Each task
writes to
different
temporary
paths
part-00001
part-00002
1. Write task
5. commit
task
4. Relaunch
task
Distributed and Transactional
Write
71
part-00000Executor 0
Executor 1
Executor 2
2. Write
to files
Spark
Driver
Each task
writes to
different
temporary
paths
part-00001
part-00002
1. Write task
5. commit
task
4. Relaunch
task
Distributed and Transactional
Write
6. Commit Job
File layout
72
results
parquet
files
parquet
files
parquet
files
Almost transactional
73
The window of
failure is small
Spark stages
output
files to a
temporary
location
Commit?
Move to final
locations
Abort; Delete
staged files
The window of
failure is small
See more for concurrent reads and writes in cloud storage
Transactional writes to cloud storage – Eric Liang
Diving Into Delta Lake: Unpacking The Transaction Log – Databricks blog
Recap
File formats
• Column-oriented
– Parquet
– ORC
• Row-oriented
– Avro
– JSON
– CSV
– Text
– Binary
74
Data layout
• Partitioning
• Bucketing
File reader
• Understand data and skip unneeded data
• Split file into partitions for parallel read
File writer internals
• Parallel
• transactional
Thank you!
Q & A
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Mais conteúdo relacionado

Mais procurados

Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
HostedbyConfluent
 

Mais procurados (20)

Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Streaming SQL for Data Engineers: The Next Big Thing?
Streaming SQL for Data Engineers: The Next Big Thing?Streaming SQL for Data Engineers: The Next Big Thing?
Streaming SQL for Data Engineers: The Next Big Thing?
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
MLOps - Getting Machine Learning Into Production
MLOps - Getting Machine Learning Into ProductionMLOps - Getting Machine Learning Into Production
MLOps - Getting Machine Learning Into Production
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
 

Semelhante a Apache Spark's Built-in File Sources in Depth

Anton Dorfman - Reversing data formats what data can reveal
Anton Dorfman - Reversing data formats what data can revealAnton Dorfman - Reversing data formats what data can reveal
Anton Dorfman - Reversing data formats what data can reveal
DefconRussia
 

Semelhante a Apache Spark's Built-in File Sources in Depth (20)

Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
 
Avro intro
Avro introAvro intro
Avro intro
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Elasto Mania
Elasto ManiaElasto Mania
Elasto Mania
 
Intro to Cassandra
Intro to CassandraIntro to Cassandra
Intro to Cassandra
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
 
Anton Dorfman - Reversing data formats what data can reveal
Anton Dorfman - Reversing data formats what data can revealAnton Dorfman - Reversing data formats what data can reveal
Anton Dorfman - Reversing data formats what data can reveal
 
CIKM 2010 Demo - SEQUEL: query completion via pattern mining on multi-column ...
CIKM 2010 Demo - SEQUEL: query completion via pattern mining on multi-column ...CIKM 2010 Demo - SEQUEL: query completion via pattern mining on multi-column ...
CIKM 2010 Demo - SEQUEL: query completion via pattern mining on multi-column ...
 
SQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - AdvancedSQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - Advanced
 
Using existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analyticsUsing existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analytics
 
ElasticSearch - DevNexus Atlanta - 2014
ElasticSearch - DevNexus Atlanta - 2014ElasticSearch - DevNexus Atlanta - 2014
ElasticSearch - DevNexus Atlanta - 2014
 
Beyond EXPLAIN: Query Optimization From Theory To Code
Beyond EXPLAIN: Query Optimization From Theory To CodeBeyond EXPLAIN: Query Optimization From Theory To Code
Beyond EXPLAIN: Query Optimization From Theory To Code
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
cache memory
cache memorycache memory
cache memory
 
CouchDB at its Core: Global Data Storage and Rich Incremental Indexing at Clo...
CouchDB at its Core: Global Data Storage and Rich Incremental Indexing at Clo...CouchDB at its Core: Global Data Storage and Rich Incremental Indexing at Clo...
CouchDB at its Core: Global Data Storage and Rich Incremental Indexing at Clo...
 
Hunting for anglerfish in datalakes
Hunting for anglerfish in datalakesHunting for anglerfish in datalakes
Hunting for anglerfish in datalakes
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmap
 
CAO-Unit-III.pptx
CAO-Unit-III.pptxCAO-Unit-III.pptx
CAO-Unit-III.pptx
 

Mais de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Mais de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 

Apache Spark's Built-in File Sources in Depth