Shark SQL and Rich Analytics at Scale

Shark: SQL and Rich
Analytics at Scale

Reynold Xin, Josh Rosen, Matei Zaharia, Michael Franklin, Scott
Shenker, Ion Stoica

AMPLab, UC Berkeley

June 25 @ SIGMOD 2013

Challenges

Data size growing

» Processing has to scale out over large
clusters

» Faults and stragglers complicate DB design

Complexity of analysis increasing

» Massive ETL (web crawling)

» Machine learning, graph processing

» Leads to long running jobs

What’s good about
MapReduce?

1.  Scales out to thousands of nodes in a fault-
tolerant manner

2.  Good for analyzing semi-structured data and
complex analytics

3.  Elasticity (cloud computing)

4.  Dynamic, multi-tenant resource sharing

“parallel relational database systems are
signiﬁcantly faster than those that rely on the
use of MapReduce for their query engines”

“I totally agree.”

This Research

1.  Shows MapReduce model can be extended to
support SQL efﬁciently

»  Started from a powerful MR-like engine (Spark)

»  Extended the engine in various ways

2.  The artifact: Shark, a fast engine on top of MR

»  Performant SQL

»  Complex analytics in the same engine

»  Maintains MR beneﬁts, e.g. fault-tolerance

MapReduce Fundamental Properties?

Data-parallel operations

» Apply the same operations on a deﬁned set of data

Fine-grained, deterministic tasks

» Enables fault-tolerance straggler mitigation

Why Were Databases Faster?

Data representation

» Schema-aware, column-oriented, etc

» Co-partition co-location of data

Execution strategies

» Scheduling/task launching overhead (~20s in Hadoop)

» Cost-based optimization

» Indexing

Lack of mid-query fault tolerance

» MR’s pull model costly compared to DBMS “push”

See Pavlo 2009, Xin 2013.

Why Were Databases Faster?

Data representation

» Schema-aware, column-oriented, etc

» Co-partition co-location of data

Execution strategies

» Scheduling/task launching overhead (~20s in Hadoop)

» Cost-based optimization

» Indexing

Lack of mid-query fault tolerance

» MR’s pull model costly compared to DBMS “push”

See Pavlo 2009, Xin 2013.

Not fundamental to
“MapReduce”

Can be
surprisingly
cheap

Introducing Shark

MapReduce-based architecture

» Uses Spark as the underlying execution engine

» Scales out and tolerate worker failures

Performant

» Low-latency, interactive queries

» (Optionally) in-memory query processing

Expressive and ﬂexible

» Supports both SQL and complex analytics

» Hive compatible (storage, UDFs, types, metadata, etc)

Spark Engine

Fast MapReduce-like engine

» In-memory storage for fast iterative computations

» General execution graphs

» Designed for low latency (~100ms jobs)

Compatible with Hadoop storage APIs

» Read/write to any Hadoop-supported systems, including
HDFS, Hbase, SequenceFiles, etc

Growing open source platform

» 17 companies contributing code

More Powerful MR Engine

General task DAG

Pipelines functions
within a stage

Cache-aware data
locality reuse

Partitioning-aware
to avoid shufﬂes

join

union

groupBy

map

Stage
3

Stage
1

Stage
2

A:
B:

C:
D:

E:

F:

G:

=
previously
computed
partition

Client

CLI

JDBC

Hive Architecture

Meta
store

Hadoop Storage (HDFS, S3, …)

Driver

SQL
Parser

Query
Optimizer

Physical Plan

Execution

MapReduce

Client

CLI

JDBC

Shark Architecture

Meta
store

Hadoop Storage (HDFS, S3, …)

Driver

SQL
Parser

Spark

Cache Mgr.

Physical Plan

Execution

Query
Optimizer

Extending Spark for SQL

Columnar memory store

Dynamic query optimization

Miscellaneous other optimizations (distributed
top-K, partition statistics pruning a.k.a. coarse-
grained indexes, co-partitioned joins, …)

Columnar Memory Store

Simply caching records as JVM objects is inefficient
(huge overhead in MR’s record-oriented model)

Shark employs column-oriented storage, a
partition of columns is one MapReduce “record”.

1

Column
Storage

2
3

john
mike
sally

4.1
3.5
6.4

Row
Storage

1
john
4.1

2
mike
3.5

3
sally
6.4

Benefit: compact representation, CPU efficient
compression, cache locality.

How do we optimize:

SELECT * FROM table1 a JOIN table2 b ON a.key=b.key
WHERE my_crazy_udf(b.field1, b.field2) = true;
Hard to estimate cardinality!

Partial DAG Execution (PDE)

Lack of statistics for fresh data and the prevalent
use of UDFs necessitate dynamic approaches to
query optimization.

PDE allows dynamic alternation of query plans
based on statistics collected at run-time.

Shufﬂe Join

Stage 3Stage 2
Stage 1
Join
Result
Stage 1
Stage 2
Join
Result
Map Join (Broadcast Join)

minimizes network trafﬁc

PDE Statistics

Gather customizable statistics at per-partition
granularities while materializing map output.

» partition sizes, record counts (skew detection)

» “heavy hitters”

» approximate histograms

Can alter query plan based on such statistics

» map join vs shufﬂe join

» symmetric vs non-symmetric hash join

» skew handling

Complex Analytics Integration

Uniﬁed system for SQL,
machine learning

Both share the same set
of workers and caches

def logRegress(points: RDD[Point]): Vector {
var w = Vector(D, _ = 2 * rand.nextDouble - 1)
for (i - 1 to ITERATIONS) {
val gradient = points.map { p =
val denom = 1 + exp(-p.y * (w dot p.x))
(1 / denom - 1) * p.y * p.x
}.reduce(_ + _)
w -= gradient
}
w
}
val users = sql2rdd(SELECT * FROM user u
JOIN comment c ON c.uid=u.uid)
val features = users.mapRows { row =
new Vector(extractFeature1(row.getInt(age)),
extractFeature2(row.getStr(country)),
...)}
val trainedVector = logRegress(features.cache())

Pavlo Benchmark

Selection
0 22.5 45 67.5 90
Shark Shark5(disk) Hive
1.1
0 150 300 450 600
Aggregation
1K5Groups
32
Hive
Shark5(disk)
Shark
Shark5Copartitioned
0 500 1000 1500 2000
Runtime5(seconds)

Machine Learning Performance

KMeans(Clustering
0 36 72 108 144 180
157
4.1
Logistic(Regression
0 24 48 72 96 120
110
0.96
Shark Hadoop
Runtime per iteration (secs)

Real Warehouse Benchmark

0
25
50
75
100
Q1 Q2 Q3 Q4
Runtime0(seconds)
Shark Shark0(disk) Hive
1.1 0.8 0.7 1.0
1.7TB Real Warehouse Data on 100 EC2 nodes

New Benchmark

Impala
Impala(mem)
Redshift
Shark(disk)
Shark(mem)
0 5 10 15 20
Runtime(seconds)
http://tinyurl.com/bigdata-benchmark

Other beneﬁts of MapReduce

Elasticity

» Query processing can scale up and down dynamically

StragglerTolerance

Schema-on-read Easier ETL

Engineering

» MR handles task scheduling / dispatch / launch

» Simpler query processing code base (~10k LOC)

Berkeley Data Analytics Stack

Spark

Shark

SQL

HDFS / Hadoop Storage

Mesos Resource Manager

Spark
Streaming

GraphX

MLBase

Community

3000 people attended
online training

800 meetup members

17 companies contributing

Conclusion

Leveraging a modern MapReduce engine and
techniques from databases, Shark supports both
SQL and complex analytics efﬁciently, while
maintaining fault-tolerance.

Growing open source community

» Users observe similar speedups in real use cases

» http://shark.cs.berkeley.edu

» http://www.spark-project.org

Shark SQL and Rich Analytics at Scale

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (8)

Semelhante a Shark SQL and Rich Analytics at Scale

Semelhante a Shark SQL and Rich Analytics at Scale (20)

Mais de DataWorks Summit

Mais de DataWorks Summit (20)

Último

Último (20)

Shark SQL and Rich Analytics at Scale