New Developments in Spark

New Developments in Spark
Matei Zaharia
August 18th,2015

About Databricks
Founded by creatorsof Spark in 2013 and remains the top
contributor
End-to-end service for Spark on EC2
• Interactive notebooks,dashboards,
and production jobs

Our Goal for Spark
Unified engineacross data workloads and platforms
…
SQLStreaming ML Graph Batch …

Past 2 Years
Fast growth in libraries and
integration points
• New library for SQL + DataFrames
• 10xgrowth of ML library
• Pluggable data source API
• R language
Result: very diverse use of Spark
• Only 40% of userson Hadoop YARN
• Most users use at least 2 of Spark’s
built-in libraries
• 98%of Databricks customers use
SQL, 60% use Python

Beyond Libraries
Best thing about basing Spark’s libraries on a high-level API is
that we can also make big changesunderneaththem
Now working on some of the largestchangesto Spark Core
since the projectbegan

This Talk
Project Tungsten: CPU and memory efficiency
Network and disk I/O
Adaptive query execution

Hardware Trends
Storage
Network
CPU

Hardware Trends
2010
Storage
50+MB/s
(HDD)
Network 1Gbps
CPU ~3GHz

Hardware Trends
2010 2015
Storage
50+MB/s
(HDD)
500+MB/s
(SSD)
Network 1Gbps 10Gbps
CPU ~3GHz ~3GHz

Hardware Trends
2010 2015
Storage
50+MB/s
(HDD)
500+MB/s
(SSD)
10x
Network 1Gbps 10Gbps 10x
CPU ~3GHz ~3GHz L

Tungsten: Preparing Spark for Next 5 Years
Substantially speed up execution by optimizing CPU efficiency, via:
(1) Off-heap memory management
(2) Runtime code generation
(3) Cache-awarealgorithms

Interfaces to Tungsten
DataFrames
(Python, Java, Scala, R)
RDDsSpark SQL …
Data schema +
query plan
LLVMJVM GPU NVRAM
Tungsten
backends
…

DataFrame API
Single-node tabularstructure in R and Python,with APIs for:
relational algebra (filter, join,…)
math and stats
input/output(CSV, JSON, …)
Google Trends for “data frame”

DataFrame: lingua franca for “small data”
head(flights)
#> Source: local data frame [6 x 16]
#>
#> year month day dep_time dep_delay arr_time arr_delay carrier tailnum
#> 1 2013 1 1 517 2 830 11 UA N14228
#> 2 2013 1 1 533 4 850 20 UA N24211
#> 3 2013 1 1 542 2 923 33 AA N619AA
#> 4 2013 1 1 544 -‐1 1004 -‐18 B6 N804JB
#> .. ... ... ... ... ... ... ... ... ...

15
Spark DataFrames
Structureddata collections
with similar API to R/Python
• DataFrame = RDD + schema
Capture many operations as
expressionsin a DSL
• Enablesrich optimizations
df = jsonFile(“tweets.json”)
df(df(“user”) === “matei”)
.groupBy(“date”)
.sum(“retweets”)
0
5
10
Python RDD Scala RDD DataFrame
RunningTime

1. Off-Heap Memory Management
Store data outside JVM heap to avoid object overhead & GC
• For RDDs: fast serialization libraries
• For DataFrames & SQL: binary format we compute on directly
2-10x space saving, especiallyfor strings, nested objects
Can use new RAM-like devices, e.g. flash, 3D XPoint

2. Runtime Code Generation
GenerateJava code for DataFrame and
SQL expressionsrequestedby user
Avoids virtual calls and generics/boxing
Can do same in core, ML and graph
• Code-gen serializers,fused functions,
math expressions
9.3
9.4
36.7
Hand
writtenCodegen
Interpreted
Projection
Evaluating“SELECTa+a+a”
(timein seconds)

3. Cache-Aware Algorithms
Use custom memory layout to better leverageCPU cache
Example: AlphaSort-style prefix sort
• Store prefixes of sort key inside pointerarray
• Compare prefixes to avoid full record fetches+ comparisons
pointer record
key prefix pointer record
Naïve layout
Cache friendly layout

Tungsten Performance Results
0
200
400
600
800
1000
1200
1x 2x 4x 8x 16x
Run time
(seconds)
Data set size (relative)
Default
Code Gen
Tungsten onheap
Tungsten offheap

Motivation
Network and storage speedshave improved 10x, but this
speed isn’t always easyto leverage!
Many challengeswith:
• Keeping diskoperationslarge (even on SSDs)
• Keeping networkconnectionsbusy & balanced across cluster
• Doing all this on many cores and many disks

Sort Benchmark
Started by Jim Grayin 1987 to measure HW+SW advances
• Many entrantsuse purpose-builthardware & software
Participated in largestcategory: Daytona GraySort
• Sort 100 TB of 100-byte recordsin a fault-tolerant manner
Seta new world record (tied with UCSD)
• Saturated 8 SSDs and 10 Gbps network/ node
• 1st time public cloud + open source won

On-Disk Sort Record
Time to sort 100 TB
2100 machines2013 Record:
Hadoop
2014 Record:
Spark
Source: Daytona GraySort benchmark, sortbenchmark.org
72 minutes
207 machines
23 minutes
Also sorted 1 PB in 4 hours

Saturating the Network
1.1GB/sec per node

Motivation
Queryplanning is crucial to performancein distributed setting
• Level of parallelismin operations
• Choice of algorithm(e.g. broadcast vs. shuffle join)
Hard to do well for big data even with cost-based optimization
• Unindexed data => don’t have statistics
• User-defined functions=> hard to predict
Solution: letSpark changequery plan adaptively

Traditional Spark Scheduling
file.map(word => (word, 1)).reduceByKey(_ + _)
.sortByKey()
map
reduce sort

Adaptive Planning
map
.sortByKey()

Adaptive Planning
map
reduce
.sortByKey()

Adaptive Planning
map
reduce
sort
.sortByKey()

Advanced Example: Join
Goal: Bringtogetherdata items with the same key

Shuffle join
(good if both
datasets large)

Broadcast join
(good if top
dataset small)

Hybrid join
(broadcast popular
key, shuffle rest)

Hybrid join
(broadcast popular
key, shuffle rest)
More details: SPARK-9850

Impact of Adaptive Planning
Level of parallelism: 2-3x
Choice of join algorithm: as much as 10x
Follow it at SPARK-9850

Effect of Optimizations in Core
Often, when we made one optimization, we saw all of the
Spark components get faster
• Scheduleroptimization for Spark Streaming => SQL 2xfaster
• Network optimizations=> speed up all comm-intensive libraries
• Tungsten => DataFrames, SQL and parts of ML
Same applies to other changesin core, e.g. debug tools

Conclusion
Spark has grown a lot, but it still remains the most active open
sourceproject in big data
Small core + high-level API => can make changesquickly
New hardware => exciting optimizations at all levels

Learn More: sparkhub.databricks.com

New Developments in Spark

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a New Developments in Spark

Semelhante a New Developments in Spark (20)

Mais de Databricks

Mais de Databricks (20)

Último

Último (20)

New Developments in Spark