SlideShare a Scribd company logo
1 of 122
Download to read offline
SCALA + BIG DATA
PARIS SCALA MEETUP, 05/29/2013
Sam BESSALAH
Outline
 Scala in the Hadoop World
Hadoop and Map Reduce Basics
Scalding
A word about other Scala DSL : Scrunch and Scoobi
 Spark and Co.
Spark
Spark Streaming
 More projects using Scala for Data Analysis
SCALA and HADOOP
The new darling of data crunchers at scale
Hadoop

Redundant , fault tolerant data storage

Parallel computation framework

Job coordination
MapReduce

A programming model for expressing distributed computations
at massive scale

An execution framework for organizing and performing those
computations in an efficient and fault tolerant way,

Bundled within the hadoop framework
MapReduce redux ..

Implements two functions at a high level
Map(k1, v1) → List(k2, v2)
Reduce (k2, List(v2)) → List(v3,k3)

The framework takes care of all the plumbing and the
distribution, sorting, shuffling ...

Values with the same key flowed to the same reducer
 Way too long for a simple word counting
 This gave birth too new tools like Hive or Pig
 Pig : Script language for dataflow
text = LOAD 'text' USING TextLoader();
tokens = FOREACH text GENERATE FLATTEN(TOKENIZE($0)) as word;
wordcount = FOREACH (GROUP tokens BY word) GENERATE
Group as word,
COUNT_STAR($1) as ct ;
Cascading

Open source created by Chris Wensel, now developped at
@Concurrent.

Written in Java, evolves around the concept of Pipes or Data
flow eventually transformed into MapReduce jobs

Cascading change the MR programming model to a generic data
flow oriented programming model

A Flow is composed of a Source, a Sink and a Pipe to connect
them

A pipe is a set of transformations over the input data

Pipes can be combined to create more complex workflow

Contains a flow Optimizer that converts a user data flow to an
optimized data flow, that can be converted in its turn to an efficient
map reduce job.

We could think of pipes as distributed collections
Word Count redux ..
But ...
- Cascading makes use of FP idioms.
- Functions are wrapped in Objects
- Constructors (New) define composition
between pipes
- Map Reduce paradigm itself derive from FP
Why not use functional programming ?
SCALDING
- A Scala DSL on top of Cascading
- Open Source project developed at Twitter
By Avi Bryant (@avibryant)
Oscar Boykin (@posco)
Argyris Zymnis (@argyris)
-http://github.twitter.com/twitter/scalding
Scalding
- Two APIs :
* Field API : Primary API, using Cascading
Fields, dynamic with errors at runtime
* TypeSafe API : Uses Scala Types, errors at
compile time. We’ll focus on this one
- Both can be joined using pipe.Typed and
TypedPipe.from
Scalding word count
In reality :
val countedWords = groupedWord.size
val countedWords = groupedWords.mapValues(x=>1L).sum
val countedWords = groupedWords.mapValues(x =>1L)
.reduce(implicit mon:Monoid[Long] ((l,r) => mon.plus(l,r))
Fields Based API
# pipe.flatMap(existingFields -> additionalFields){function}
# pipe.map(existingFields -> additionalFields){function}
# pipe.project(fields)
# pipe.discard(fields)
# pipe.mapTo(existingFields -> additionalFields){function}
# pipe.groupBy(fields){ group => ... }
# group.reduce(field){function}
# group.foldLeft(field){function}
…
https://github.com/twitter/scalding/wiki/Fields-based-API-Reference
Grouping and Mapping
GroupBuilder :
Builder Pattern object that operates over groups of rows in a
pipe.
Helps building several parallel aggregations : counting,
summing, in one pass . Awesome for stream aggregation.
Used for GroupBy, adds fields which are reduction of existing
ones.
MapReduceMap : map side aggregation, derived from cascading,
using combiners intead of reducers.
Gotcha : doesn’t work with FoldLeft, which is pushed to reducers
Type Safe API
Two concepts :
TypePipe[T]
-Wraps Cascading Pipe object. Instances distributed on the cluster,
on top of which transformations occur.
-Similar interface as scala.collection.Iterator[T]
KeyedList[K,V]
- Sharding of Key value objects. Two implementations
Grouped[K,V] : usual grouping on key K
CoGrouped[K,V,W,Result] : a co group over two grouped
pipes, used for joins.
Optimized Joins
JoinWithTiny : map side joins
Left side assymetric join with a smaller pipe.
Uses Cascading HashJoin, a non blocking assymetrical join where
the smaller join fits in memory.
BlockJoinWithSmaller : Performs a block join, by replicating data.
SkewJoinwithSmaller|Larger : deals with skewed pipes
CrossWithTiny : Doing a cross product with a moderate sized
pipe, can create a huge output.
MATRIX API
Generic Matrix API build using Abstract Algebra(Monoids, Ring, ..)
Value Operation : mapValues, filterValues, binarizeAs
Vector Operations : getRow,reduceRowVectors …
mapRows, rowL2Normalize, rowMeanCentering ..
Usual Matrix operation : trnspose, product ….
Pipe.toMatrix, pipe.flatMaptoMatrix(fields) mapping function ..
Scalding is not the only Scala DSL for MR
- Scrunch
Build on top of Crunch, a MR pipelining
library in Java developed by Cloudera.
- Scoobi , build at NICTA
Same idea as crunch, except fully written in
Scala, uses Distributed Lists Dlist to mimic
pipelines.
Scrunch Style
object WordCount extends PipelineApp {
def ScrunchWordCount(file: String) = {
read(from.textFile(file))
.flatMap(_.split("W+")
.filter(!_.isEmpty()))
.count
}
val counts = join(countWords(args(0)), countWords(args(1)))
write(counts, to.textFile(args(2)))
}
Spark
In-Memory Interactive and Real time
Analytics for Large DataSets
Sam Bessalah
@samklr
Adapted Slides from Matei Zaharia, UC Berkeley
Fast, expressive cluster computing system compatible with Apache
Hadoop
Works with any Hadoop-supported storage system (HDFS, S3,
Avro, …)
Improves efficiency through:
In-memory computing primitives
General computation graphs
Improves usability through:
Rich APIs in Java, Scala, Python
Interactive shell
Up to 100× faster
Often 2-10× less code
What is Spark?
Key Idea
Work with distributed collections as if they were local
Concept: Resilient Distributed Datasets (RDDs)
- Immutable collections of objects spread across a
cluster
- Built through parallel transformations (map, filter, etc)
- Automatically rebuilt on failure
- Controllable persistence (like caching in RAM)
Example: Log Mining
L
oad error messages from a log into memory,
then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘t’)(2))
cachedMsgs = messages.cache()
Block 1Block 1
Block 2Block 2
Block 3Block 3
Worke
r
Worke
r
Worke
r
Worke
r
Worke
r
Worke
r
DriverDriver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
. . .
tasks
results
Cache
1
Cache
1
Cache
2
Cache
2
Cache
3
Cache
3
Base
RDD
Base
RDDTransformed
RDD
Transformed
RDD
ActionAction
Result: full-text search of Wikipedia in <1 sec (vs
20 sec for on-disk data)
Result: scaled to 1 TB data in 5-7 sec
(vs 170 sec for on-disk data)
Fault Tolerance
RDDs track lineage information that can be used
to efficiently reconstruct lost partitions
Ex: messages = textFile(...).filter(_.startsWith(“ERROR”))
.map(_.split(‘t’)(2))
HDFS FileHDFS File Filtered RDDFiltered RDD Mapped RDDMapped RDD
filter
(func = _.contains(...))
map
(func = _.split(...))
Spark in Java and Scala
Java API:
JavaRDD<String> lines =
spark.textFile(…);
errors = lines.filter(
new Function<String, Boolean>() {
public Boolean call(String s) {
return s.contains(“ERROR”);
}
});
errors.count()
Scala API:
val lines = spark.textFile(…)
errors = lines.filter(s =>
s.contains(“ERROR”))
// can also write
filter(_.contains(“ERROR”))
errors.count
Which Language Should I Use?
Standalone programs can be written in any, but
console is only Python & Scala
Python developers: can stay with Python for both
Java developers: consider using Scala for console
(to learn the API)
Performance: Java / Scala will be faster (statically
typed), but Python can do well for numerical work
with NumPy
Scala Cheat Sheet
Variables:
var x: Int = 7
var x = 7 // type inferred
val y = “hi” // read-only
Functions:
def square(x: Int): Int = x*x
def square(x: Int): Int = {
x*x // last line returned
}
Collections and closures:
val nums = Array(1, 2, 3)
nums.map((x: Int) => x + 2) // => Array(3, 4,
5)
nums.map(x => x + 2) // => same
nums.map(_ + 2) // => same
nums.reduce((x, y) => x + y) // => 6
nums.reduce(_ + _) // => 6
Learning Spark
Easiest way: Spark interpreter (spark-shell or
pyspark)
Special Scala and Python consoles for cluster use
Runs in local mode on 1 thread by default, but can
control with MASTER environment var:
MASTER=local ./spark-shell # local, 1 thread
MASTER=local[2] ./spark-shell # local, 2 threads
MASTER=spark://host:port ./spark-shell # Spark standalone cluster
Main entry point to Spark functionality
Created for you in Spark shells as variable sc
In standalone programs, you’d make your own
(see later for details)
First Stop: SparkContext
Creating RDDs
# Turn a local collection into an RDD
sc.parallelize([1, 2, 3])
# Load text file from local FS, HDFS, or S3
sc.textFile(“file.txt”)
sc.textFile(“directory/*.txt”)
sc.textFile(“hdfs://namenode:9000/path/file”)
# Use any existing Hadoop InputFormat
sc.hadoopFile(keyClass, valClass, inputFmt, conf)
Basic Transformations
nums = sc.parallelize([1, 2, 3])
# Pass each element through a function
squares = nums.map(lambda x: x*x) # => {1, 4, 9}
# Keep elements passing a predicate
even = squares.filter(lambda x: x % 2 == 0) # => {4}
# Map each element to zero or more others
nums.flatMap(lambda x: range(0, x)) # => {0, 0, 1, 0, 1,
2} Range object (sequence
of numbers 0, 1, …, x-1)
Range object (sequence
of numbers 0, 1, …, x-1)
nums = sc.parallelize([1, 2, 3])
# Retrieve RDD contents as a local collection
nums.collect() # => [1, 2, 3]
# Return first K elements
nums.take(2) # => [1, 2]
# Count number of elements
nums.count() # => 3
# Merge elements with an associative function
nums.reduce(lambda x, y: x + y) # => 6
# Write elements to a text file
nums.saveAsTextFile(“hdfs://file.txt”)
Basic Actions
Spark’s “distributed reduce” transformations act on RDDs
of key-value pairs
Python: pair = (a, b)
pair[0] # => a
pair[1] # => b
Scala: val pair = (a, b)
pair._1 // => a
pair._2 // => b
Java: Tuple2 pair = new Tuple2(a, b); // class scala.Tuple2
pair._1 // => a
pair._2 // => b
Working with Key-Value Pairs
Some Key-Value Operations
pets = sc.parallelize([(“cat”, 1), (“dog”, 1), (“cat”, 2)])
pets.reduceByKey(lambda x, y: x + y)
# => {(cat, 3), (dog, 1)}
pets.groupByKey()
# => {(cat, Seq(1, 2)), (dog, Seq(1)}
pets.sortByKey()
# => {(cat, 1), (cat, 2), (dog, 1)}
reduceByKey also automatically implements combiners on the map
side
visits = sc.parallelize([(“index.html”, “1.2.3.4”),
(“about.html”, “3.4.5.6”),
(“index.html”, “1.3.3.1”)])
pageNames = sc.parallelize([(“index.html”, “Home”), (“about.html”,
“About”)])
visits.join(pageNames)
# (“index.html”, (“1.2.3.4”, “Home”))
# (“index.html”, (“1.3.3.1”, “Home”))
# (“about.html”, (“3.4.5.6”, “About”))
visits.cogroup(pageNames)
# (“index.html”, (Seq(“1.2.3.4”, “1.3.3.1”), Seq(“Home”)))
# (“about.html”, (Seq(“3.4.5.6”), Seq(“About”)))
Multiple Datasets
class MyCoolRddApp {
val param = 3.14
val log = new Log(...)
...
def work(rdd: RDD[Int]) {
rdd.map(x => x + param)
.reduce(...)
}
}
How to get around it:
class MyCoolRddApp {
...
def work(rdd: RDD[Int]) {
val param_ = param
rdd.map(x => x + param_)
.reduce(...)
}
}
NotSerializableException:
MyCoolRddApp (or Log)
NotSerializableException:
MyCoolRddApp (or Log) References only local
variable instead of
this.param
References only local
variable instead of
this.param
Closure Mishap Example
SPARK INTERNALS
Components
sc= new SparkContext
f= sc.textFile(“…”)
f.filter(…)
.count()
...
Your program
Spark client
(app master)
Spark worker
HDFS, HBase, …
Block
manager
Task
threads
RDD graph
Scheduler
Block tracker
Shuffle tracker
Cluster
manager
Example Job
val sc = new SparkContext(
“spark://...”, “MyJob”, home, jars)
val file = sc.textFile(“hdfs://...”)
val errors = file.filter(_.contains(“ERROR”))
errors.cache()
errors.count()
Resilient distributed
datasets (RDDs)
Resilient distributed
datasets (RDDs)
ActionAction
RDD Graph
HadoopRDD
path = hdfs://...
HadoopRDD
path = hdfs://...
FilteredRDD
func = _.contains(…)
shouldCache = true
FilteredRDD
func = _.contains(…)
shouldCache = true
file:
errors:
Partition-level view:Dataset-level view:
Task 1Task 2 ...
Data Locality
First run: data not in cache, so use HadoopRDD’s
locality prefs (from HDFS)
Second run: FilteredRDD is in cache, so use its
locations
If something falls out of cache, go back to HDFS
Broadcast Variables
When one creates a broadcast variable b with a
value v, v is saved to a file in a shared file
system. The serialized form of b is a path to this
file. When b’s value is queried on a worker
node, Spark first checks whether v is in a local
cache, and reads it from the file system if it isn’t.
Accumulators
Each accumulator is given a unique ID when it is
created. When the accumulator is saved, its
serialized form contains its ID and the “zero” value
for its type.
On the workers, a separate copy of the accumulator
is created for each thread that runs a task using
thread-local variables, and is reset to zero when a
task begins. After each task runs, the worker
sends a message to the driver program containing
the updates it made to various accumulators.
Scheduling Process
rdd1.join(rdd2)
.groupBy(…)
.filter(…)
RDD Objects
build operator DAG
agnostic to
operators!
agnostic to
operators!
doesn’t know
about stages
doesn’t know
about stages
DAGScheduler
split graph into
stages of tasks
submit each
stage as ready
DAG
TaskScheduler
TaskSet
launch tasks via
cluster manager
retry failed or
straggling tasks
Cluster
manager
Worker
execute tasks
store and serve
blocks
Block
manager
Threads
Task
stage
failed
Example: HadoopRDD
partitions = one per HDFS block
dependencies = none
compute(partition) = read corresponding block
preferredLocations(part) = HDFS block location
partitioner = none
Example: FilteredRDD
partitions = same as parent RDD
dependencies = “one-to-one” on parent
compute(partition) = compute parent and filter it
preferredLocations(part) = none (ask parent)
partitioner = none
Example: JoinedRDD
partitions = one per reduce task
dependencies = “shuffle” on each parent
compute(partition) = read and join shuffled data
preferredLocations(part) = none
partitioner = HashPartitioner(numTasks)
Spark will now know
this data is hashed!
Spark will now know
this data is hashed!
Dependency Types
union
groupByKey
join with inputs not
co-partitioned
join with
inputs co-
partitioned
map, filter
“Narrow” deps: “Wide” (shuffle) deps:
DAG Scheduler
Interface: receives a “target” RDD, a function to
run on each partition, and a listener for results
Roles:
Build stages of Task objects (code + preferred loc.)
Submit them to TaskScheduler as ready
Resubmit failed stages if outputs are lost
Scheduler Optimizations
Pipelines narrow ops.
within a stage
Picks join algorithms
based on partitioning
(minimize shuffles)
Reuses previously
cached data join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= previously computed partition
Task
Exemple : K-Means Clustering using
Spark
Clustering
Grouping data according to
similarity
Distance East
DistanceNorth
E.g. archaeological dig
Clustering
Grouping data according to
similarity
Distance East
DistanceNorth
E.g. archaeological dig
K-Means Algorithm
Benefits
•Popular
•Fast
•Conceptually straightforward
Distance East
DistanceNorth
E.g. archaeological dig
K-Means: preliminaries
Feature 1
Feature2
Data: Collection of values
data = lines.map(line=>
parseVector(line))
K-Means: preliminaries
Feature 1
Feature2
Dissimilarity:
Squared Euclidean distance
dist = p.squaredDist(q)
K-Means: preliminaries
Feature 1
Feature2
K = Number of clusters
Data assignments to
clusters
S1, S2,. . ., SK
K-Means: preliminaries
Feature 1
Feature2
K = Number of clusters
Data assignments to
clusters
S1, S2,. . ., SK
K-Means Algorithm
Feature 1
Feature2
• Initialize K cluster centers
• Repeat until convergence:
Assign each data point to the
cluster with the closest
center.
Assign each cluster center to
be the mean of its cluster’s
data points.
K-Means Algorithm
Feature 1
Feature2
• Initialize K cluster centers
• Repeat until
convergence:
Assign each data point to
the cluster with the
closest center.
Assign each cluster
center to be the mean
of its cluster’s data
points.
K-Means Algorithm
Feature 1
Feature2
• Initialize K cluster centers
• Repeat until convergence:
Assign each data point to the
cluster with the closest center.
Assign each cluster center to be
the mean of its cluster’s data
points.
centers = data.takeSample(
false, K, seed)
K-Means Algorithm
Feature 1
Feature2
• Initialize K cluster centers
• Repeat until convergence:
Assign each data point to the
cluster with the closest center.
Assign each cluster center to be
the mean of its cluster’s data
points.
centers = data.takeSample(
false, K, seed)
K-Means Algorithm
Feature 1
Feature2
• Initialize K cluster centers
• Repeat until convergence:
Assign each data point to the
cluster with the closest center.
Assign each cluster center to be
the mean of its cluster’s data
points.
centers = data.takeSample(
false, K, seed)
K-Means Algorithm
Feature 1
Feature2
• Initialize K cluster centers
• Repeat until convergence:
Assign each cluster center to be
the mean of its cluster’s data
points.
centers = data.takeSample(
false, K, seed)
closest = data.map(p =>
(closestPoint(p,centers),p))
K-Means Algorithm
Feature 1
Feature2
• Initialize K cluster centers
• Repeat until convergence:
Assign each cluster center to be the
mean of its cluster’s data points.
centers = data.takeSample(
false, K, seed)
closest = data.map(p =>
(closestPoint(p,centers),p))
K-Means Algorithm
Feature 1
Feature2
• Initialize K cluster centers
• Repeat until convergence:
Assign each cluster center to be
the mean of its cluster’s data
points.
centers = data.takeSample(
false, K, seed)
closest = data.map(p =>
(closestPoint(p,centers),p))
K-Means Algorithm
Feature 1
Feature2
• Initialize K cluster centers
• Repeat until convergence:
centers = data.takeSample(
false, K, seed)
closest = data.map(p =>
(closestPoint(p,centers),p))
pointsGroup =
closest.groupByKey()
K-Means Algorithm
Feature 1
Feature2
• Initialize K cluster centers
• Repeat until convergence:
centers = data.takeSample(
false, K, seed)
closest = data.map(p =>
(closestPoint(p,centers),p))
pointsGroup =
closest.groupByKey()
newCenters =
pointsGroup.mapValues(
ps => average(ps))
K-Means Algorithm
Feature 1
Feature2
• Initialize K cluster centers
• Repeat until convergence:
centers = data.takeSample(
false, K, seed)
closest = data.map(p =>
(closestPoint(p,centers),p))
pointsGroup =
closest.groupByKey()
newCenters = pointsGroup.mapValues(
ps => average(ps))
K-Means Algorithm
Feature 1
Feature2
• Initialize K cluster centers
• Repeat until convergence:
centers = data.takeSample(
false, K, seed)
closest = data.map(p =>
(closestPoint(p,centers),p))
pointsGroup =
closest.groupByKey()
newCenters = pointsGroup.mapValues(
ps => average(ps))
K-Means Algorithm
Feature 1
Feature2
• Initialize K cluster centers
• Repeat until convergence:
centers = data.takeSample(
false, K, seed)
closest = data.map(p =>
(closestPoint(p,centers),p))
pointsGroup =
closest.groupByKey()
newCenters =pointsGroup.mapValues(
ps => average(ps))
while (dist(centers, newCenters) > ɛ)
K-Means Algorithm
Feature 1
Feature2
• Initialize K cluster centers
• Repeat until
convergence:
centers = data.takeSample(
false, K, seed)
closest = data.map(p =>
(closestPoint(p,centers),p))
pointsGroup =
closest.groupByKey()
newCenters =pointsGroup.mapValues(
ps => average(ps))
while (dist(centers, newCenters) > ɛ)
K-Means Source
Feature 1
Feature2
centers =
data.takeSample(
false, K, seed)
closest = data.map(p =>
(closestPoint(p,centers),p))
pointsGroup =
closest.groupByKey()
newCenters =pointsGroup.mapValues(
ps => average(ps))
while (d > ɛ)
{
}
d = distance(centers, newCenters)
centers = newCenters.map(_)
Ease of use
 Interactive shell:
Useful for featurization, pre-processing data
 Lines of code for K-Means
- Spark ~ 90 lines – (Part of hands-on tutorial !)
- Hadoop/Mahout ~ 4 files, > 300 lines
Example: PageRank
Why PageRank?
Good example of a more complex algorithm
Multiple stages of map & reduce
Benefits from Spark’s in-memory caching
Multiple iterations over the same data
Basic Idea
Give pages ranks (scores) based on links to them
Links from many pages  high rank
Link from a high-rank page  high rank
Image: en.wikipedia.org/wiki/File:PageRank-hi-res-2.png
Algorithm
1.0 1.0
1.0
1.0
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
1.0 1.0
1.0
1.0
1
0.5
0.5
0.5
1
0.5
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
0.58 1.0
1.85
0.58
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
0.58
0.29
0.29
0.5
1.85
0.58 1.0
1.85
0.58
0.5
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
0.39 1.72
1.31
0.58
. . .
Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
0.46 1.37
1.44
0.73
Final state:
Scala Implementation
val links = // RDD of (url, neighbors) pairs
var ranks = // RDD of (url, rank) pairs
for (i <- 1 to ITERATIONS) {
val contribs = links.join(ranks).flatMap {
case (url, (links, rank)) =>
links.map(dest => (dest, rank/links.size))
}
ranks = contribs.reduceByKey(_ + _)
.mapValues(0.15 + 0.85 * _)
}
ranks.saveAsTextFile(...)
PageRank Performance
SPARK STREAMING
What is Spark Streaming?
Framework for large scale stream processing
Scales to 100s of nodes
Can achieve second scale latencies
Integrates with Spark’s batch and interactive processing
Provides a simple batch-like API for implementing complex algorithm
Can absorb live data streams from Kafka, Flume, ZeroMQ, etc.
Requirements
Scalable to large clusters
Second-scale latencies
Simple programming model
Requirements
Scalable to large clusters
Second-scale latencies
Simple programming model
Integrated with batch & interactive processing
Stateful Stream Processing
Traditional streaming systems have a
event-driven record-at-a-time
processing model
Each node has mutable state
For each record, update state & send
new records
State is lost if node dies!
Making stateful stream processing be
fault-tolerant is challenging
mutable state
node 1
node 3
input
records
node 2
input
records
104
Existing Streaming Systems
Storm
Replays record if not processed by a node
Processes each record at least once
May update mutable state twice!
Mutable state can be lost due to failure!
Trident – Use transactions to update
state
Processes each record exactly once
Per state transaction updates slow
105
Requirements
Scalable to large clusters
Second-scale latencies
Simple programming model
Integrated with batch & interactive processing
Efficient fault-tolerance in stateful computations
Discretized Stream Processing
Run a streaming computation as a series
of very small, deterministic batch jobs
107
Spark
Spark
Streaming
batches of X
seconds
live data stream
processed
results
 Chop up the live stream into batches of X
seconds
 Spark treats each batch of data as RDDs
and processes them using RDD operations
 Finally, the processed results of the RDD
operations are returned in batches
Discretized Stream Processing
Run a streaming computation as a series
of very small, deterministic batch jobs
108
Spark
Spark
Streamin
batches of X
seconds
live data stream
processed
results
 Batch sizes as low as ½ second, latency ~ 1
second
 Potential for combining batch processing
and streaming processing in the same
system
Example – Get hashtags from
Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
DStream: a sequence of RDD representing a stream
of data
batch @
t+1
batch @
t+1batch @ tbatch @ t batch @
t+2
batch @
t+2
tweets DStream
stored in memory as an
RDD (immutable,
distributed)
Twitter Streaming API
Example – Get hashtags from
Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
flatMap flatMap flatMap
…
transformation: modify data in one Dstream to create
another DStreamnew DStream
new RDDs created
for every batch
batch @
t+1
batch @
t+1batch @ tbatch @ t batch @
t+2
batch @
t+2
tweets DStream
hashTags Dstream
[#cat, #dog, … ]
Example – Get hashtags from
Twitter
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
output operation: to push data to external
storage
flatMa
p
flatMa
p
flatMa
p
save save save
batch @
t+1
batch @ t
batch @
t+2
tweets DStream
hashTags DStream
every batch
saved to HDFS
Java Example
Scala
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
Java
JavaDStream<Status>s = ssc.twitterStream(<Twitter username>, <Twitter
password>)
JavaDstream<String> hashTags = tweets.flatMap(new Function<...> { })
hashTags.saveAsHadoopFiles("hdfs://...") Function object to define the
transformation
Fault-tolerance
RDDs are remember the sequence
of operations that created it from
the original fault-tolerant input
data
Batches of input data are
replicated in memory of multiple
worker nodes, therefore fault-
tolerant
Data lost due to worker failure, can
be recomputed from input data
input data
replicated
in memory
flatMap
lost partitions
recomputed on
other workers
tweets
RDD
hashTags
RDD
Key concepts
DStream – sequence of RDDs representing a stream of data
Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets
Transformations – modify data from on DStream to another
Standard RDD operations – map, countByValue, reduce, join, …
Stateful operations – window, countByValueAndWindow, …
Output Operations – send data to external entity
saveAsHadoopFiles – saves to HDFS
foreach – do anything with each batch of results
Example 2 – Count the hashtags
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
val tagCounts = hashTags.countByValue()
flatMap
map
reduceByKey
flatMap
map
reduceByKey
…
flatMap
map
reduceByKey
batch @
t+1
batch @ t
batch @
t+2
hashTags
tweets
tagCounts
[(#cat, 10), (#dog, 25), ... ]
Fault-tolerant Stateful Processing
All intermediate data are RDDs, hence can be recomputed if lost
hashTags
t-1 t t+1 t+2 t+3
tagCounts
Fault-tolerant Stateful Processing
State data not lost even if a worker node dies
Does not change the value of your result
Exactly once semantics to all transformations
No double counting!
117
Other Interesting Operations
Maintaining arbitrary state, track sessions
Maintain per-user mood as state, and update it with his/her tweets
tweets.updateStateByKey(tweet => updateMood(tweet))
Do arbitrary Spark RDD computation within DStream
Join incoming tweets with a spam file to filter out bad tweets
tweets.transform(tweetsRDD => {
tweetsRDD.join(spamHDFSFile).filter(...)
})
Performance
Can process 6 GB/sec (60M records/sec) of data on 100 nodes at sub-
second latency
Tested with 100 streams of data on 100 EC2 instances with 4 cores each
119
OTHER PROJECTS
Scoobi
Scrunch
Scala NLP
Breeze
Saddle
Factorie …
**Scala Notebook
THANKS
Bibliography
Slides for Scalding shamelessly inspired from
- Mario Pasteurelli
http://fr.slideshare.net/melrief/scalding-programming-model-for-hadoop
-Dean Wampler (@deanwampler)
Scalding workshop code
https://github.com/ThinkBigAnalytics/scalding-workshop
Slides : http://polyglotprogramming.com/papers/ScaldingForHadoop.pdf
SPARK : http://spark-project.org/documentation/

More Related Content

What's hot

Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...CloudxLab
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsCheng Min Chi
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...CloudxLab
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingToni Cebrián
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with ScalaHimanshu Gupta
 
Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.Аліна Шепшелей
 
Data analysis scala_spark
Data analysis scala_sparkData analysis scala_spark
Data analysis scala_sparkYiguang Hu
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraRussell Spitzer
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...Chester Chen
 
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxData
 
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0Sigmoid
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with CassandraJacek Lewandowski
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 

What's hot (20)

Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internals
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
 
Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.
 
Data analysis scala_spark
Data analysis scala_sparkData analysis scala_spark
Data analysis scala_spark
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
 
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
 
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
 
SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
 

Similar to Scala+data

Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfWalmirCouto3
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkVincent Poncet
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfishFei Dong
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingAnimesh Chaturvedi
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An OverviewMohit Jain
 

Similar to Scala+data (20)

Spark training-in-bangalore
Spark training-in-bangaloreSpark training-in-bangalore
Spark training-in-bangalore
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Spark core
Spark coreSpark core
Spark core
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An Overview
 

More from Samir Bessalah

Machine Learning In Production
Machine Learning In ProductionMachine Learning In Production
Machine Learning In ProductionSamir Bessalah
 
Eventual Consitency with CRDTS
Eventual Consitency with CRDTSEventual Consitency with CRDTS
Eventual Consitency with CRDTSSamir Bessalah
 
Deep learning for mere mortals - Devoxx Belgium 2015
Deep learning for mere mortals - Devoxx Belgium 2015Deep learning for mere mortals - Devoxx Belgium 2015
Deep learning for mere mortals - Devoxx Belgium 2015Samir Bessalah
 
High Performance RPC with Finagle
High Performance RPC with FinagleHigh Performance RPC with Finagle
High Performance RPC with FinagleSamir Bessalah
 
Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014Samir Bessalah
 
Structures de données exotiques
Structures de données exotiquesStructures de données exotiques
Structures de données exotiquesSamir Bessalah
 

More from Samir Bessalah (6)

Machine Learning In Production
Machine Learning In ProductionMachine Learning In Production
Machine Learning In Production
 
Eventual Consitency with CRDTS
Eventual Consitency with CRDTSEventual Consitency with CRDTS
Eventual Consitency with CRDTS
 
Deep learning for mere mortals - Devoxx Belgium 2015
Deep learning for mere mortals - Devoxx Belgium 2015Deep learning for mere mortals - Devoxx Belgium 2015
Deep learning for mere mortals - Devoxx Belgium 2015
 
High Performance RPC with Finagle
High Performance RPC with FinagleHigh Performance RPC with Finagle
High Performance RPC with Finagle
 
Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014
 
Structures de données exotiques
Structures de données exotiquesStructures de données exotiques
Structures de données exotiques
 

Recently uploaded

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 

Recently uploaded (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 

Scala+data

  • 1. SCALA + BIG DATA PARIS SCALA MEETUP, 05/29/2013 Sam BESSALAH
  • 2. Outline  Scala in the Hadoop World Hadoop and Map Reduce Basics Scalding A word about other Scala DSL : Scrunch and Scoobi  Spark and Co. Spark Spark Streaming  More projects using Scala for Data Analysis
  • 4. The new darling of data crunchers at scale
  • 5.
  • 6. Hadoop  Redundant , fault tolerant data storage  Parallel computation framework  Job coordination
  • 7. MapReduce  A programming model for expressing distributed computations at massive scale  An execution framework for organizing and performing those computations in an efficient and fault tolerant way,  Bundled within the hadoop framework
  • 8. MapReduce redux ..  Implements two functions at a high level Map(k1, v1) → List(k2, v2) Reduce (k2, List(v2)) → List(v3,k3)  The framework takes care of all the plumbing and the distribution, sorting, shuffling ...  Values with the same key flowed to the same reducer
  • 9.
  • 10.
  • 11.
  • 12.  Way too long for a simple word counting  This gave birth too new tools like Hive or Pig  Pig : Script language for dataflow text = LOAD 'text' USING TextLoader(); tokens = FOREACH text GENERATE FLATTEN(TOKENIZE($0)) as word; wordcount = FOREACH (GROUP tokens BY word) GENERATE Group as word, COUNT_STAR($1) as ct ;
  • 13. Cascading  Open source created by Chris Wensel, now developped at @Concurrent.  Written in Java, evolves around the concept of Pipes or Data flow eventually transformed into MapReduce jobs
  • 14.  Cascading change the MR programming model to a generic data flow oriented programming model  A Flow is composed of a Source, a Sink and a Pipe to connect them  A pipe is a set of transformations over the input data  Pipes can be combined to create more complex workflow  Contains a flow Optimizer that converts a user data flow to an optimized data flow, that can be converted in its turn to an efficient map reduce job.  We could think of pipes as distributed collections
  • 16.
  • 17. But ... - Cascading makes use of FP idioms. - Functions are wrapped in Objects - Constructors (New) define composition between pipes - Map Reduce paradigm itself derive from FP Why not use functional programming ?
  • 18. SCALDING - A Scala DSL on top of Cascading - Open Source project developed at Twitter By Avi Bryant (@avibryant) Oscar Boykin (@posco) Argyris Zymnis (@argyris) -http://github.twitter.com/twitter/scalding
  • 19. Scalding - Two APIs : * Field API : Primary API, using Cascading Fields, dynamic with errors at runtime * TypeSafe API : Uses Scala Types, errors at compile time. We’ll focus on this one - Both can be joined using pipe.Typed and TypedPipe.from
  • 20.
  • 22. In reality : val countedWords = groupedWord.size val countedWords = groupedWords.mapValues(x=>1L).sum val countedWords = groupedWords.mapValues(x =>1L) .reduce(implicit mon:Monoid[Long] ((l,r) => mon.plus(l,r))
  • 23. Fields Based API # pipe.flatMap(existingFields -> additionalFields){function} # pipe.map(existingFields -> additionalFields){function} # pipe.project(fields) # pipe.discard(fields) # pipe.mapTo(existingFields -> additionalFields){function} # pipe.groupBy(fields){ group => ... } # group.reduce(field){function} # group.foldLeft(field){function} … https://github.com/twitter/scalding/wiki/Fields-based-API-Reference
  • 24. Grouping and Mapping GroupBuilder : Builder Pattern object that operates over groups of rows in a pipe. Helps building several parallel aggregations : counting, summing, in one pass . Awesome for stream aggregation. Used for GroupBy, adds fields which are reduction of existing ones. MapReduceMap : map side aggregation, derived from cascading, using combiners intead of reducers. Gotcha : doesn’t work with FoldLeft, which is pushed to reducers
  • 25. Type Safe API Two concepts : TypePipe[T] -Wraps Cascading Pipe object. Instances distributed on the cluster, on top of which transformations occur. -Similar interface as scala.collection.Iterator[T] KeyedList[K,V] - Sharding of Key value objects. Two implementations Grouped[K,V] : usual grouping on key K CoGrouped[K,V,W,Result] : a co group over two grouped pipes, used for joins.
  • 26. Optimized Joins JoinWithTiny : map side joins Left side assymetric join with a smaller pipe. Uses Cascading HashJoin, a non blocking assymetrical join where the smaller join fits in memory. BlockJoinWithSmaller : Performs a block join, by replicating data. SkewJoinwithSmaller|Larger : deals with skewed pipes CrossWithTiny : Doing a cross product with a moderate sized pipe, can create a huge output.
  • 27.
  • 28.
  • 29. MATRIX API Generic Matrix API build using Abstract Algebra(Monoids, Ring, ..) Value Operation : mapValues, filterValues, binarizeAs Vector Operations : getRow,reduceRowVectors … mapRows, rowL2Normalize, rowMeanCentering .. Usual Matrix operation : trnspose, product …. Pipe.toMatrix, pipe.flatMaptoMatrix(fields) mapping function ..
  • 30.
  • 31. Scalding is not the only Scala DSL for MR - Scrunch Build on top of Crunch, a MR pipelining library in Java developed by Cloudera. - Scoobi , build at NICTA Same idea as crunch, except fully written in Scala, uses Distributed Lists Dlist to mimic pipelines.
  • 32.
  • 33. Scrunch Style object WordCount extends PipelineApp { def ScrunchWordCount(file: String) = { read(from.textFile(file)) .flatMap(_.split("W+") .filter(!_.isEmpty())) .count } val counts = join(countWords(args(0)), countWords(args(1))) write(counts, to.textFile(args(2))) }
  • 34. Spark In-Memory Interactive and Real time Analytics for Large DataSets Sam Bessalah @samklr Adapted Slides from Matei Zaharia, UC Berkeley
  • 35. Fast, expressive cluster computing system compatible with Apache Hadoop Works with any Hadoop-supported storage system (HDFS, S3, Avro, …) Improves efficiency through: In-memory computing primitives General computation graphs Improves usability through: Rich APIs in Java, Scala, Python Interactive shell Up to 100× faster Often 2-10× less code What is Spark?
  • 36. Key Idea Work with distributed collections as if they were local Concept: Resilient Distributed Datasets (RDDs) - Immutable collections of objects spread across a cluster - Built through parallel transformations (map, filter, etc) - Automatically rebuilt on failure - Controllable persistence (like caching in RAM)
  • 37. Example: Log Mining L oad error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘t’)(2)) cachedMsgs = messages.cache() Block 1Block 1 Block 2Block 2 Block 3Block 3 Worke r Worke r Worke r Worke r Worke r Worke r DriverDriver cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count . . . tasks results Cache 1 Cache 1 Cache 2 Cache 2 Cache 3 Cache 3 Base RDD Base RDDTransformed RDD Transformed RDD ActionAction Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data)
  • 38. Fault Tolerance RDDs track lineage information that can be used to efficiently reconstruct lost partitions Ex: messages = textFile(...).filter(_.startsWith(“ERROR”)) .map(_.split(‘t’)(2)) HDFS FileHDFS File Filtered RDDFiltered RDD Mapped RDDMapped RDD filter (func = _.contains(...)) map (func = _.split(...))
  • 39. Spark in Java and Scala Java API: JavaRDD<String> lines = spark.textFile(…); errors = lines.filter( new Function<String, Boolean>() { public Boolean call(String s) { return s.contains(“ERROR”); } }); errors.count() Scala API: val lines = spark.textFile(…) errors = lines.filter(s => s.contains(“ERROR”)) // can also write filter(_.contains(“ERROR”)) errors.count
  • 40. Which Language Should I Use? Standalone programs can be written in any, but console is only Python & Scala Python developers: can stay with Python for both Java developers: consider using Scala for console (to learn the API) Performance: Java / Scala will be faster (statically typed), but Python can do well for numerical work with NumPy
  • 41. Scala Cheat Sheet Variables: var x: Int = 7 var x = 7 // type inferred val y = “hi” // read-only Functions: def square(x: Int): Int = x*x def square(x: Int): Int = { x*x // last line returned } Collections and closures: val nums = Array(1, 2, 3) nums.map((x: Int) => x + 2) // => Array(3, 4, 5) nums.map(x => x + 2) // => same nums.map(_ + 2) // => same nums.reduce((x, y) => x + y) // => 6 nums.reduce(_ + _) // => 6
  • 42. Learning Spark Easiest way: Spark interpreter (spark-shell or pyspark) Special Scala and Python consoles for cluster use Runs in local mode on 1 thread by default, but can control with MASTER environment var: MASTER=local ./spark-shell # local, 1 thread MASTER=local[2] ./spark-shell # local, 2 threads MASTER=spark://host:port ./spark-shell # Spark standalone cluster
  • 43. Main entry point to Spark functionality Created for you in Spark shells as variable sc In standalone programs, you’d make your own (see later for details) First Stop: SparkContext
  • 44. Creating RDDs # Turn a local collection into an RDD sc.parallelize([1, 2, 3]) # Load text file from local FS, HDFS, or S3 sc.textFile(“file.txt”) sc.textFile(“directory/*.txt”) sc.textFile(“hdfs://namenode:9000/path/file”) # Use any existing Hadoop InputFormat sc.hadoopFile(keyClass, valClass, inputFmt, conf)
  • 45. Basic Transformations nums = sc.parallelize([1, 2, 3]) # Pass each element through a function squares = nums.map(lambda x: x*x) # => {1, 4, 9} # Keep elements passing a predicate even = squares.filter(lambda x: x % 2 == 0) # => {4} # Map each element to zero or more others nums.flatMap(lambda x: range(0, x)) # => {0, 0, 1, 0, 1, 2} Range object (sequence of numbers 0, 1, …, x-1) Range object (sequence of numbers 0, 1, …, x-1)
  • 46. nums = sc.parallelize([1, 2, 3]) # Retrieve RDD contents as a local collection nums.collect() # => [1, 2, 3] # Return first K elements nums.take(2) # => [1, 2] # Count number of elements nums.count() # => 3 # Merge elements with an associative function nums.reduce(lambda x, y: x + y) # => 6 # Write elements to a text file nums.saveAsTextFile(“hdfs://file.txt”) Basic Actions
  • 47. Spark’s “distributed reduce” transformations act on RDDs of key-value pairs Python: pair = (a, b) pair[0] # => a pair[1] # => b Scala: val pair = (a, b) pair._1 // => a pair._2 // => b Java: Tuple2 pair = new Tuple2(a, b); // class scala.Tuple2 pair._1 // => a pair._2 // => b Working with Key-Value Pairs
  • 48. Some Key-Value Operations pets = sc.parallelize([(“cat”, 1), (“dog”, 1), (“cat”, 2)]) pets.reduceByKey(lambda x, y: x + y) # => {(cat, 3), (dog, 1)} pets.groupByKey() # => {(cat, Seq(1, 2)), (dog, Seq(1)} pets.sortByKey() # => {(cat, 1), (cat, 2), (dog, 1)} reduceByKey also automatically implements combiners on the map side
  • 49. visits = sc.parallelize([(“index.html”, “1.2.3.4”), (“about.html”, “3.4.5.6”), (“index.html”, “1.3.3.1”)]) pageNames = sc.parallelize([(“index.html”, “Home”), (“about.html”, “About”)]) visits.join(pageNames) # (“index.html”, (“1.2.3.4”, “Home”)) # (“index.html”, (“1.3.3.1”, “Home”)) # (“about.html”, (“3.4.5.6”, “About”)) visits.cogroup(pageNames) # (“index.html”, (Seq(“1.2.3.4”, “1.3.3.1”), Seq(“Home”))) # (“about.html”, (Seq(“3.4.5.6”), Seq(“About”))) Multiple Datasets
  • 50. class MyCoolRddApp { val param = 3.14 val log = new Log(...) ... def work(rdd: RDD[Int]) { rdd.map(x => x + param) .reduce(...) } } How to get around it: class MyCoolRddApp { ... def work(rdd: RDD[Int]) { val param_ = param rdd.map(x => x + param_) .reduce(...) } } NotSerializableException: MyCoolRddApp (or Log) NotSerializableException: MyCoolRddApp (or Log) References only local variable instead of this.param References only local variable instead of this.param Closure Mishap Example
  • 52. Components sc= new SparkContext f= sc.textFile(“…”) f.filter(…) .count() ... Your program Spark client (app master) Spark worker HDFS, HBase, … Block manager Task threads RDD graph Scheduler Block tracker Shuffle tracker Cluster manager
  • 53. Example Job val sc = new SparkContext( “spark://...”, “MyJob”, home, jars) val file = sc.textFile(“hdfs://...”) val errors = file.filter(_.contains(“ERROR”)) errors.cache() errors.count() Resilient distributed datasets (RDDs) Resilient distributed datasets (RDDs) ActionAction
  • 54. RDD Graph HadoopRDD path = hdfs://... HadoopRDD path = hdfs://... FilteredRDD func = _.contains(…) shouldCache = true FilteredRDD func = _.contains(…) shouldCache = true file: errors: Partition-level view:Dataset-level view: Task 1Task 2 ...
  • 55. Data Locality First run: data not in cache, so use HadoopRDD’s locality prefs (from HDFS) Second run: FilteredRDD is in cache, so use its locations If something falls out of cache, go back to HDFS
  • 56. Broadcast Variables When one creates a broadcast variable b with a value v, v is saved to a file in a shared file system. The serialized form of b is a path to this file. When b’s value is queried on a worker node, Spark first checks whether v is in a local cache, and reads it from the file system if it isn’t.
  • 57. Accumulators Each accumulator is given a unique ID when it is created. When the accumulator is saved, its serialized form contains its ID and the “zero” value for its type. On the workers, a separate copy of the accumulator is created for each thread that runs a task using thread-local variables, and is reset to zero when a task begins. After each task runs, the worker sends a message to the driver program containing the updates it made to various accumulators.
  • 58. Scheduling Process rdd1.join(rdd2) .groupBy(…) .filter(…) RDD Objects build operator DAG agnostic to operators! agnostic to operators! doesn’t know about stages doesn’t know about stages DAGScheduler split graph into stages of tasks submit each stage as ready DAG TaskScheduler TaskSet launch tasks via cluster manager retry failed or straggling tasks Cluster manager Worker execute tasks store and serve blocks Block manager Threads Task stage failed
  • 59. Example: HadoopRDD partitions = one per HDFS block dependencies = none compute(partition) = read corresponding block preferredLocations(part) = HDFS block location partitioner = none
  • 60. Example: FilteredRDD partitions = same as parent RDD dependencies = “one-to-one” on parent compute(partition) = compute parent and filter it preferredLocations(part) = none (ask parent) partitioner = none
  • 61. Example: JoinedRDD partitions = one per reduce task dependencies = “shuffle” on each parent compute(partition) = read and join shuffled data preferredLocations(part) = none partitioner = HashPartitioner(numTasks) Spark will now know this data is hashed! Spark will now know this data is hashed!
  • 62. Dependency Types union groupByKey join with inputs not co-partitioned join with inputs co- partitioned map, filter “Narrow” deps: “Wide” (shuffle) deps:
  • 63. DAG Scheduler Interface: receives a “target” RDD, a function to run on each partition, and a listener for results Roles: Build stages of Task objects (code + preferred loc.) Submit them to TaskScheduler as ready Resubmit failed stages if outputs are lost
  • 64. Scheduler Optimizations Pipelines narrow ops. within a stage Picks join algorithms based on partitioning (minimize shuffles) Reuses previously cached data join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: = previously computed partition Task
  • 65. Exemple : K-Means Clustering using Spark
  • 66. Clustering Grouping data according to similarity Distance East DistanceNorth E.g. archaeological dig
  • 67. Clustering Grouping data according to similarity Distance East DistanceNorth E.g. archaeological dig
  • 69. K-Means: preliminaries Feature 1 Feature2 Data: Collection of values data = lines.map(line=> parseVector(line))
  • 70. K-Means: preliminaries Feature 1 Feature2 Dissimilarity: Squared Euclidean distance dist = p.squaredDist(q)
  • 71. K-Means: preliminaries Feature 1 Feature2 K = Number of clusters Data assignments to clusters S1, S2,. . ., SK
  • 72. K-Means: preliminaries Feature 1 Feature2 K = Number of clusters Data assignments to clusters S1, S2,. . ., SK
  • 73. K-Means Algorithm Feature 1 Feature2 • Initialize K cluster centers • Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points.
  • 74. K-Means Algorithm Feature 1 Feature2 • Initialize K cluster centers • Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points.
  • 75. K-Means Algorithm Feature 1 Feature2 • Initialize K cluster centers • Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed)
  • 76. K-Means Algorithm Feature 1 Feature2 • Initialize K cluster centers • Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed)
  • 77. K-Means Algorithm Feature 1 Feature2 • Initialize K cluster centers • Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed)
  • 78. K-Means Algorithm Feature 1 Feature2 • Initialize K cluster centers • Repeat until convergence: Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p))
  • 79. K-Means Algorithm Feature 1 Feature2 • Initialize K cluster centers • Repeat until convergence: Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p))
  • 80. K-Means Algorithm Feature 1 Feature2 • Initialize K cluster centers • Repeat until convergence: Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p))
  • 81. K-Means Algorithm Feature 1 Feature2 • Initialize K cluster centers • Repeat until convergence: centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey()
  • 82. K-Means Algorithm Feature 1 Feature2 • Initialize K cluster centers • Repeat until convergence: centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters = pointsGroup.mapValues( ps => average(ps))
  • 83. K-Means Algorithm Feature 1 Feature2 • Initialize K cluster centers • Repeat until convergence: centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters = pointsGroup.mapValues( ps => average(ps))
  • 84. K-Means Algorithm Feature 1 Feature2 • Initialize K cluster centers • Repeat until convergence: centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters = pointsGroup.mapValues( ps => average(ps))
  • 85. K-Means Algorithm Feature 1 Feature2 • Initialize K cluster centers • Repeat until convergence: centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters =pointsGroup.mapValues( ps => average(ps)) while (dist(centers, newCenters) > ɛ)
  • 86. K-Means Algorithm Feature 1 Feature2 • Initialize K cluster centers • Repeat until convergence: centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters =pointsGroup.mapValues( ps => average(ps)) while (dist(centers, newCenters) > ɛ)
  • 87. K-Means Source Feature 1 Feature2 centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters =pointsGroup.mapValues( ps => average(ps)) while (d > ɛ) { } d = distance(centers, newCenters) centers = newCenters.map(_)
  • 88. Ease of use  Interactive shell: Useful for featurization, pre-processing data  Lines of code for K-Means - Spark ~ 90 lines – (Part of hands-on tutorial !) - Hadoop/Mahout ~ 4 files, > 300 lines
  • 90. Why PageRank? Good example of a more complex algorithm Multiple stages of map & reduce Benefits from Spark’s in-memory caching Multiple iterations over the same data
  • 91. Basic Idea Give pages ranks (scores) based on links to them Links from many pages  high rank Link from a high-rank page  high rank Image: en.wikipedia.org/wiki/File:PageRank-hi-res-2.png
  • 92. Algorithm 1.0 1.0 1.0 1.0 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rankp / |neighborsp| to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs
  • 93. Algorithm 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rankp / |neighborsp| to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs 1.0 1.0 1.0 1.0 1 0.5 0.5 0.5 1 0.5
  • 94. Algorithm 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rankp / |neighborsp| to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs 0.58 1.0 1.85 0.58
  • 95. Algorithm 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rankp / |neighborsp| to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs 0.58 0.29 0.29 0.5 1.85 0.58 1.0 1.85 0.58 0.5
  • 96. Algorithm 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rankp / |neighborsp| to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs 0.39 1.72 1.31 0.58 . . .
  • 97. Algorithm 1. Start each page at a rank of 1 2. On each iteration, have page p contribute rankp / |neighborsp| to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs 0.46 1.37 1.44 0.73 Final state:
  • 98. Scala Implementation val links = // RDD of (url, neighbors) pairs var ranks = // RDD of (url, rank) pairs for (i <- 1 to ITERATIONS) { val contribs = links.join(ranks).flatMap { case (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } ranks = contribs.reduceByKey(_ + _) .mapValues(0.15 + 0.85 * _) } ranks.saveAsTextFile(...)
  • 101. What is Spark Streaming? Framework for large scale stream processing Scales to 100s of nodes Can achieve second scale latencies Integrates with Spark’s batch and interactive processing Provides a simple batch-like API for implementing complex algorithm Can absorb live data streams from Kafka, Flume, ZeroMQ, etc.
  • 102. Requirements Scalable to large clusters Second-scale latencies Simple programming model
  • 103. Requirements Scalable to large clusters Second-scale latencies Simple programming model Integrated with batch & interactive processing
  • 104. Stateful Stream Processing Traditional streaming systems have a event-driven record-at-a-time processing model Each node has mutable state For each record, update state & send new records State is lost if node dies! Making stateful stream processing be fault-tolerant is challenging mutable state node 1 node 3 input records node 2 input records 104
  • 105. Existing Streaming Systems Storm Replays record if not processed by a node Processes each record at least once May update mutable state twice! Mutable state can be lost due to failure! Trident – Use transactions to update state Processes each record exactly once Per state transaction updates slow 105
  • 106. Requirements Scalable to large clusters Second-scale latencies Simple programming model Integrated with batch & interactive processing Efficient fault-tolerance in stateful computations
  • 107. Discretized Stream Processing Run a streaming computation as a series of very small, deterministic batch jobs 107 Spark Spark Streaming batches of X seconds live data stream processed results  Chop up the live stream into batches of X seconds  Spark treats each batch of data as RDDs and processes them using RDD operations  Finally, the processed results of the RDD operations are returned in batches
  • 108. Discretized Stream Processing Run a streaming computation as a series of very small, deterministic batch jobs 108 Spark Spark Streamin batches of X seconds live data stream processed results  Batch sizes as low as ½ second, latency ~ 1 second  Potential for combining batch processing and streaming processing in the same system
  • 109. Example – Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) DStream: a sequence of RDD representing a stream of data batch @ t+1 batch @ t+1batch @ tbatch @ t batch @ t+2 batch @ t+2 tweets DStream stored in memory as an RDD (immutable, distributed) Twitter Streaming API
  • 110. Example – Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) flatMap flatMap flatMap … transformation: modify data in one Dstream to create another DStreamnew DStream new RDDs created for every batch batch @ t+1 batch @ t+1batch @ tbatch @ t batch @ t+2 batch @ t+2 tweets DStream hashTags Dstream [#cat, #dog, … ]
  • 111. Example – Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") output operation: to push data to external storage flatMa p flatMa p flatMa p save save save batch @ t+1 batch @ t batch @ t+2 tweets DStream hashTags DStream every batch saved to HDFS
  • 112. Java Example Scala val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") Java JavaDStream<Status>s = ssc.twitterStream(<Twitter username>, <Twitter password>) JavaDstream<String> hashTags = tweets.flatMap(new Function<...> { }) hashTags.saveAsHadoopFiles("hdfs://...") Function object to define the transformation
  • 113. Fault-tolerance RDDs are remember the sequence of operations that created it from the original fault-tolerant input data Batches of input data are replicated in memory of multiple worker nodes, therefore fault- tolerant Data lost due to worker failure, can be recomputed from input data input data replicated in memory flatMap lost partitions recomputed on other workers tweets RDD hashTags RDD
  • 114. Key concepts DStream – sequence of RDDs representing a stream of data Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets Transformations – modify data from on DStream to another Standard RDD operations – map, countByValue, reduce, join, … Stateful operations – window, countByValueAndWindow, … Output Operations – send data to external entity saveAsHadoopFiles – saves to HDFS foreach – do anything with each batch of results
  • 115. Example 2 – Count the hashtags val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) val tagCounts = hashTags.countByValue() flatMap map reduceByKey flatMap map reduceByKey … flatMap map reduceByKey batch @ t+1 batch @ t batch @ t+2 hashTags tweets tagCounts [(#cat, 10), (#dog, 25), ... ]
  • 116. Fault-tolerant Stateful Processing All intermediate data are RDDs, hence can be recomputed if lost hashTags t-1 t t+1 t+2 t+3 tagCounts
  • 117. Fault-tolerant Stateful Processing State data not lost even if a worker node dies Does not change the value of your result Exactly once semantics to all transformations No double counting! 117
  • 118. Other Interesting Operations Maintaining arbitrary state, track sessions Maintain per-user mood as state, and update it with his/her tweets tweets.updateStateByKey(tweet => updateMood(tweet)) Do arbitrary Spark RDD computation within DStream Join incoming tweets with a spam file to filter out bad tweets tweets.transform(tweetsRDD => { tweetsRDD.join(spamHDFSFile).filter(...) })
  • 119. Performance Can process 6 GB/sec (60M records/sec) of data on 100 nodes at sub- second latency Tested with 100 streams of data on 100 EC2 instances with 4 cores each 119
  • 121. THANKS
  • 122. Bibliography Slides for Scalding shamelessly inspired from - Mario Pasteurelli http://fr.slideshare.net/melrief/scalding-programming-model-for-hadoop -Dean Wampler (@deanwampler) Scalding workshop code https://github.com/ThinkBigAnalytics/scalding-workshop Slides : http://polyglotprogramming.com/papers/ScaldingForHadoop.pdf SPARK : http://spark-project.org/documentation/