In this talk, we present two emerging, popular open source projects: Spark and Shark. Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. It outperform Hadoop by up to 100x in many real-world applications. Spark programs are often much shorter than their MapReduce counterparts thanks to its high-level APIs and language integration in Java, Scala, and Python. Shark is an analytic query engine built on top of Spark that is compatible with Hive. It can run Hive queries much faster in existing Hive warehouses without modifications.
These systems have been adopted by many organizations large and small (e.g. Yahoo, Intel, Adobe, Alibaba, Tencent) to implement data intensive applications such as ETL, interactive SQL, and machine learning.
4. Apache Spark
Fast and expressive cluster computing system
interoperable with Apache Hadoop
Improves efficiency through:
» In-memory computing primitives
» General computation graphs
Improves usability through:
» Rich APIs in Scala, Java, Python
» Interactive shell
Up to 100× faster
(2-10× on disk)
Often 5× less code
5. Shark
An analytic query engine built on top of Spark
» Support both SQL and complex analytics
» Can be100X faster than Apache Hive
Compatible with Hive data, metastore, queries
» HiveQL
» UDF / UDAF
» SerDes
» Scripts
9. Why a New Framework?
MapReduce greatly simplified big data analysis
But as soon as it got popular, users wanted more:
» More complex, multi-pass analytics (e.g. ML, graph)
» More interactive ad-hoc queries
» More real-time stream processing
10. Hadoop MapReduce
iter. 1 iter. 2 …
Input
filesystem
read
filesystem
write
filesystem
read
filesystem
write
Iterative:
Interactive:
Input
query 1
query 2
…
select
Subset
Slow due to replication and disk I/O,
but necessary for fault tolerance
11. Spark Programming Model
Key idea: resilient distributed datasets (RDDs)
» Distributed collections of objects that can optionally be
cached in memory across cluster
» Manipulated through parallel operators
» Automatically recomputed on failure
Programming interface
» Functional APIs in Scala, Java, Python
» Interactive use from Scala shell
12. Example: Log Mining
Exposes RDDs through a functional API in Scala
Usable interactively from Scala shell
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
errors.persist() Block 1
Block 2
Block 3
Worker
errors.filter(_.contains(“foo”)).count()
errors.filter(_.contains(“bar”)).count()
tasks
results
Errors 2
Base RDD
Transformed RDD
Action
Result: full-text search of Wikipedia in
<1 sec (vs 20 sec for on-disk data)
Result: 1 TB data in 5 sec
(vs 170 sec for on-disk data)
Worker
Errors 3
Worker
Errors 1
Master
13. Data Sharing in Spark
iter. 1 iter. 2 …
Input
Iterative:
Interactive:
Input
query 1
query 2select
…
10-100x faster than network/disk
14. Fault Tolerance
RDDs track the series of transformations used to
build them (their lineage) to recompute lost data
E.g: messages = textFile(...).filter(_.contains(“error”))
.map(_.split(„t‟)(2))
HadoopRDD
path = hdfs://…
FilteredRDD
func = _.contains(...)
MappedRDD
func = _.split(…)
16. public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
17. Word Count
val docs = sc.textFiles(“hdfs://…”)
docs.flatMap(line => line.split(“s”))
.map(word => (word, 1))
.reduceByKey((a, b) => a + b)
18. Word Count
val docs = sc.textFiles(“hdfs://…”)
docs.flatMap(line => line.split(“s”))
.map(word => (word, 1))
.reduceByKey((a, b) => a + b)
docs.flatMap(_.split(“s”))
.map((_, 1))
.reduceByKey(_ + _)
19. Spark in Java and Python
Python API
lines = spark.textFile(…)
errors = lines.filter(
lambda s: "ERROR" in s)
errors.count()
Java API
JavaRDD<String> lines =
spark.textFile(…);
errors = lines.filter(
new Function<String, Boolean>() {
public Boolean call(String s) {
return s.contains("ERROR");
}
});
errors.count()
20. Scheduler
General task DAGs
Pipelines functions
within a stage
Cache-aware data
locality & reuse
Partitioning-aware
to avoid shuffles
Launch jobs in ms
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= previously computed partition
22. Logistic Regression: Prep the Data
val sc = new SparkContext(“spark://master”, “LRExp”)
def loadDict(): Map[String, Double] = { … }
val dict = sc.broadcast( loadDict() )
def parser(s: String): (Array[Double], Double) = {
val parts = s.split(„t‟)
val y = parts(0).toDouble
val x = parts.drop(1).map(dict.getOrElse(_, 0.0))
(x, y)
}
val data = sc.textFile(“hdfs://d”).map(parser).cache()
Load data in memory once
Start a new Spark Session
Build a dictionary for file parsing
Broadcast the Dictionary to the Cluster
Line Parser which uses the Dictionary
23
23. Logistic Regression: Grad. Desc.
val sc = new SparkContext(“spark://master”, “LRExp”)
// Parsing …
val data: Spark.RDD[(Array[Double],Double)] = …
var w = Vector.random(D)
val lambda = 1.0/data.count
for (i <- 1 to ITERATIONS) {
val gradient = data.map {
case (x, y) => (1.0/(1+exp(-y*(w dot x)))–1)*y*x
}.reduce( (g1, g2) => g1 + g2 )
w -= (lambda / sqrt(i) ) * gradient
}
Initial parameter vector
Iterate
Set Learning Rate
Apply Gradient on Master
Compute Gradient Using
Map & Reduce
24
28. Shark
A data analytics system that
» builds on Spark,
» scales out and tolerate worker failures,
» supports low-latency, interactive queries through
(optional) in-memory computation,
» supports both SQL and complex analytics,
» is compatible with Hive (storage, serdes, UDFs, types,
metadata).
31. Shark Query Language
Hive QL and a few additional DDL add-ons:
Creating an in-memory table:
» CREATE TABLE src_cached AS SELECT * FROM …
32. Engine Features
Columnar Memory Store
Data Co-partitioning & Co-location
Partition Pruning based on Statistics (range, bloom
filter and others)
Spark Integration
…
33. Columnar Memory Store
Simply caching Hive records as JVM objects is
inefficient.
Shark employs column-oriented storage.
1
Column Storage
2 3
john mike sally
4.1 3.5 6.4
Row Storage
1 john 4.1
2 mike 3.5
3 sally 6.4Benefit: compact representation, CPU efficient
compression, cache locality.
34. Spark Integration
Unified system for query
processing and Spark
analytics
Query processing and ML
share the same set of
workers and caches
def logRegress(points: RDD[Point]): Vector {
var w = Vector(D, _ => 2 * rand.nextDouble - 1)
for (i <- 1 to ITERATIONS) {
val gradient = points.map { p =>
val denom = 1 + exp(-p.y * (w dot p.x))
(1 / denom - 1) * p.y * p.x
}.reduce(_ + _)
w -= gradient
}
w
}
val users = sql2rdd("SELECT * FROM user u
JOIN comment c ON c.uid=u.uid")
val features = users.mapRows { row =>
new Vector(extractFeature1(row.getInt("age")),
extractFeature2(row.getStr("country")),
...)}
val trainedVector = logRegress(features.cache())
36. More Information
Download and docs: www.spark-project.org
» Easy to run locally, on EC2, or on Mesos/YARN
Email: rxin@apache.org
Twitter: @rxin
37. Behavior with Insufficient RAM
68.8
58.1
40.7
29.7
11.5
0
20
40
60
80
100
0% 25% 50% 75% 100%
Iterationtime(s)
Percent of working set in memory
38. Example: PageRank
1. Start each page with a rank of 1
2. On each iteration, update each page’s rank to
Σi∈neighbors ranki / |neighborsi|
links = // RDD of (url, neighbors) pairs
ranks = // RDD of (url, rank) pairs
for (i <- 1 to ITERATIONS) {
ranks = links.join(ranks).flatMap {
(url, (links, rank)) =>
links.map(dest => (dest, rank/links.size))
}.reduceByKey(_ + _)
}
Notas do Editor
FT = Fourier Transform
NOT a modified versionof Hadoop
Note that dataset is reused on each gradient computation
Key idea: add “variables” to the “functions” in functional programming
Key idea: add “variables” to the “functions” in functional programming
100 GB of data on 50 m1.xlarge EC2 machines
Query planning is also better in Shark due to (1) more optimizations and (2) use of more optimized Spark operators such as hash-based join