Created at the University of Berkeley in California, Apache Spark combines a distributed computing system through computer clusters with a simple and elegant way of writing programs. Spark is considered the first open source software that makes distribution programming really accessible to data scientists. Here you can find an introduction and basic concepts.
6. Rise of the data center
Hugh amounts of data spread out
across many commodity servers
MapReduce
lots of data → scale out
Data Processing Requirements
Network bottleneck → Distributed Computing
Hardware failure → Fault Tolerance
Abstraction to organize parallelizable tasks
MapReduce
Abstraction to organize parallelizable tasks
7. MapReduce
Input Split Map [combine]
Suffle &
Sort
Reduce Output
AA BB AA
AA CC DD
AA EE DD
BB FF AA
AA BB AA
AA CC DD
AA EE DD
BB FF AA
(AA, 1)
(BB, 1)
(AA, 1)
(AA, 1)
(CC, 1)
(DD, 1)
(AA, 1)
(EE, 1)
(DD, 1)
(BB, 1)
(FF, 1)
(AA, 1)
(AA, 2)
(BB, 1)
(AA, 1)
(CC, 1)
(DD, 1)
(AA, 1)
(EE, 1)
(DD, 1)
(BB, 1)
(FF, 1)
(AA, 1)
(AA, 2)
(AA, 1)
(AA, 1)
(AA, 1)
(BB, 1)
(BB, 1)
(CC, 1)
(DD, 1)
(DD, 1)
(EE, 1)
(FF, 1)
(AA, 5)
(BB, 2)
(CC, 1)
(DD, 2)
(EE, 1)
(FF, 1)
AA, 5
BB, 2
CC, 1
DD, 2
EE, 1
FF, 1
9. Spark Components
SparkContext
● Main entry point for Spark functionality
● Represents the connection to a Spark cluster
● Tells Spark how & where to access a cluster
● Can be used to create RDDs, accumulators and
broadcast variables on that cluster
Driver program
● “Main” process coordinated by the
SparkContext object
● Allows to configure any spark process with
specific parameters
● Spark actions are executed in the Driver
● Spark-shell
● Application → driver program + executors
Driver Program
SparkContext
10. Spark Components
● External service for acquiring resources on the cluster
● Variety of cluster managers
○ Local
○ Standalone
○ YARN
○ Mesos
● Deploy mode:
○ Cluster → framework launches the driver inside of the cluser
○ Client → submitter launches the driver outside of the cluster
Cluster Manager
11. Spark Components
● Any node that can run application code in the cluster
● Key Terms
○ Executor: A process launched for an application on a worker node, that runs tasks and
keeps data in memory or disk storage across them. Each application has its own executors.
○ Task: Unit of work that will be sent to one executor
○ Job: A parallel computation consisting of multiple tasks that gets spawned in response to a
Spark action (e.g. save, collect)
○ Stage: smaller set of tasks inside any job
Worker Node
Executor
Task Task
Worker
13. RDD
Resilient Distributed Datasets
● Collection of objects that is distributed across
nodes in a cluster
● Data Operations are performed on RDD
● Once created, RDD are immutable
● RDD can be persisted in memory or on disk
● Fault Tolerant
numbers = RDD[1,2,3,4,5,6,7,8,9,10]
Worker Node
Executor
[1,5,6,9]
Worker Node
Executor
[2,7,8]
Worker Node
Executor
[3,4,10]
14. RDD
● Lazy Evaluation
● Operation: Transformation / Action
● Lineage
● Base RDD
● Partition
● Task
● Level of Parallelism
Main Concepts
15. RDD
Internally, each RDD is characterized by five main properties
A list of partitions
A function for
computing each split
A list of dependencies
on other RDDs
A Partitioner for key-value RDDs
A list of preferred locations to
compute each split on
Method Location Input Output
getPartitions()
compute()
getDependencies()
Driver
Driver
Worker
-
Partition
-
[Partition]
Iterable
[Dependency]
Optionally
16. RDD
Creating RDDs
Text File
Collection
Database
val textFile = sc.textFile("README.md")
val input = sc.parallelize(List(1, 2, 3, 4))
val casRdd = sc.newAPIHadoopRDD(
job.getConfiguration(),
classOf[ColumnFamilyInputFormat],
classOf[ByteBuffer],
classOf[SortedMap[ByteBuffer, IColumn]])
Transformation val input = rddFather.map(value => value.toString )
File / set of files
(Local/Distributed)
Memory
Another RDD
Spark load and
write data with
database
19. Data Operations
Transformations Actions
❏ Creates new dataset from existing one
❏ Lazy evaluated (Transformed RDD
executed only when action runs on it)
❏ Example: filter(), map(), flatMap()
❏ Return a value to driver program after
computation on dataset
❏ Example: count(), reduce(), take(), collect()
20. Transformations
map(func) Return a new distributed dataset formed by passing each
element of the source through a function func
filter(func) Return a new dataset formed by selecting those elements of the
source on which func returns true
flatMap(func) Similar to map, but each input item can be mapped to 0 or
more output items (so func should return a Seq rather than a
single item)
distinct Return a new dataset that contains the distinct elements of the
source dataset
Commonly Used Transformations
25. Transformations
union(otherRDD) Return a new RDD that contains the union of the elements in
the source dataset and the argument
intersection
(otherRDD)
Return a new RDD that contains the intersection of elements in
the source dataset and the argument
Operations of mathematical sets
28. Actions
count() Returns the number of elements in the dataset
reduce(func) Aggregate the elements of the dataset using a function func
(which takes two arguments and returns one). The function
should be commutative and associative so that it can be
computed correctly in parallel
collect() Return all the elements of the dataset as an array at the driver
program. This is usually useful after a filter or other operation
that returns a sufficiently small subset of the data
take(n) Returns an array with first n elements
first() Returns the first element of the dataset
takeOrdered
(n,[ordering])
Returns first n elements of RDD using natural order or custom
operator
Commonly Used Actions
36. WORKSHOP
In order to practice the main concepts, please complete the exercises
proposed at our Github repository by clicking the following link:
○ Homework