This document provides an overview of Spark, its core abstraction of resilient distributed datasets (RDDs), and common transformations and actions. It discusses how Spark partitions and distributes data across a cluster, its lazy evaluation model, and the concept of dependencies between RDDs. Common use cases like word counting, bucketing user data, finding top results, and analytics reporting are demonstrated. Key topics covered include avoiding expensive shuffle operations, choosing optimal aggregation methods, and potentially caching data in memory.
2. Spark
•Spark is a cluster computing engine.
•Provides high-level API in Scala, Java, Python and R.
•Provides high level tools:
•Spark SQL.
•MLib.
•GraphX.
•Spark Streaming.
3. RDD
•The basic abstraction in Spark is the RDD.
•Stands for: Resilient Distributed Dataset.
•It is a collection of items which their source may for example:
•Hadoop (HDFS).
•JDBC.
•ElasticSearch.
•And more…
4. D is for Partitioned
• Partition is a sub-collection of data that should fit into memory
• Partition + transformation = Task
• This is the distributed part of the RDD
• Partitions are recomputed in case of failure - Resilient
Foo bar ..
Line 2
Hello
…
…
Line 100..
Line #...
…
…
Line 200..
Line #...
…
…
Line 300..
Line #...
…
5. D is for dependency
• RDD can depend on other RDDs.
• RDD is lazy.
• For example rdd.map(String::toUppercase)
• Creates a new RDD which depends on the original one.
• Contains only meta-data (i.e., the computing function).
• Only on a specific command (knows as actions, like collect) the
flow will be computed.
6. The famous word count example
sc.textFile("src/main/resources/books.txt")
.flatMap(line => line.split(" "))
.map(w => (w,1))
.reduceByKey((c1, c2) => c1 + c2)
7. The famous word count example
sc.textFile("src/main/resources/book.txt")
.flatMap(s -> Arrays.asList(s.split(" ")).iterator())
.mapToPair(word -> new Tuple2<>(word, 1))
.reduceByKey((a, b) -> a + b)
.collectAsMap();
8. How does it work?
•Driver:
•Executes the main program
•Creates the RDDs
•Collects the results
•Executors:
•Executes the RDD operations
•Participate in the shuffle
Image taken from https://spark.apache.org/docs/latest/cluster-overview.html
9. Spark Terminology
•Application – the main program that runs on the driver and
creates the RDDs and collects the results.
•Job – a sequence of transformations on RDD till action occurs.
•Stage – a sequence of transformations on RDD till shuffle
occurs.
•Task – a sequence of transformations on a single partition till
shuffle occurs.
10. RDD from Collection
•You can create an RDD from a collection:
sc.parallelize(list)
•Takes a sequence from the driver and distributes it across the
nodes.
•Note, the distribution is lazy so be careful with mutable
collections!
•Important, if it’s a range collection, use the range method as it
does not create the collection on the driver.
11. RDD from file
•Spark supports reading files, directories and compressed files.
•The following out-of-the-box methods:
•textFile – retrieving an RDD[String] (lines).
•wholeTextFiles – retrieving an RDD[(String, String)] with
filename and content.
•sequenceFile – Hadoop sequence files RDD[(K,V)].
12. RDD Actions
•Return values by evaluating the RDD (not lazy):
•collect() – returns an list containing all the elements of the
RDD. This is the main method that evaluates the RDD.
•count() – returns the number of the elements in the RDD.
•first() – returns the first element of the RDD.
•foreach(f) – performs the function on each element of the
RDD.
13. RDD Transformations
•Return pointer to new RDD with transformation meta-data
•map(func) - Return a new distributed dataset formed by passing
each element of the source through a function func.
•filter(func) - Return a new dataset formed by selecting those
elements of the source on which func returns true.
•flatMap(func) - Similar to map, but each input item can be mapped
to 0 or more output items (so func should return a Seq rather than a
single item).
14. Shuffle
•Shuffle operations repartition the data across the network.
•Can be very expensive operations in Spark.
•You must be aware where and why shuffle happens.
•Order is not guaranteed inside a partition.
•Popular operations that cause shuffle are: groupBy*,
reduceBy*, sort*, aggregateBy* and join/intersect operations on
multiple RDDs.
16. Transformation that shuffle
*Taken from the official Apache Spark documentation
•distinct([numTasks]) - Return a new dataset that contains the
distinct elements of the source dataset.
•groupByKey([numTasks]) - When called on a dataset of (K,
V) pairs, returns a dataset of (K, Iterable<V>) pairs.
•reduceByKey(func, [numTasks]) -When called on a
dataset of (K, V) pairs, returns a dataset of (K, V) pairs where
the values for each key are aggregated using the given reduce
function func, which must be of type (V,V) => V.
17. Transformation that shuffle
*Taken from the official Apache Spark documentation
•join(otherDataset, [numTasks]) - When called on datasets of
type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs
with all pairs of elements for each key. Outer joins are
supported through leftOuterJoin, rightOuterJoin, and
fullOuterJoin.
•sort, sortByKey…
•More at http://spark.apache.org/docs/latest/programming-
guide.html
18. Use case #1: Bucketing
• Every day we aggregate lots (millions/billions) of user actions
• We want that all the action of a single user will be saved in file
of its own
• Sounds simple..
• But should we use groupByKey?
• Or a reduceByKey?
21. Spark SQL
•Spark module for structured data
•Spark SQL provides a unified engine (catalyst) with 3 APIs:
•SQL. (out of topic today)
•Dataframes (untyped).
•Datasets (typed) – Only Scala and Java for Spark 2.x
22. Use Case #2: Top results
•Same user activities as before
•Now we want to take top active users
•Sounds simple :
1. Group by user Id
2. Count users in each group
3. Sort by count
4. Take the X top users
• But sort is expensive..
24. DataFrame and DataSets
• Originally Spark provided only DataFrames.
• A DataFrame is conceptually a table with typed columns.
• However, it is not typed at compilation.
• Starting with Spark 1.6, Datasets were introduced.
• Represent a table with columns, but, the row is typed at
compilation.
25. DataFrame and DataSets
val flightsDF: DataFrame =
spark.read.option("inferSchema",true).csv(”…flights.csv")
VS
val flightsDF: Dataset[Flight] =
spark.read.option("inferSchema",true)
.csv(”…flights.csv").as[Flight]
26. Use Case #3: Analytics report
•Our input is a CSV with flight records of different airports
•Flight from one airport to another including departure and arrival info
•We want to know how many flights arrived and departed from
every airport
•All we need to do is to group by the airports and then count..
27. To summarize
•We learned what Spark is
•Took a taste of RDD and Spark SQL
•Try to avoid shuffle - Shuffle is expensive
•Pick the aggregation method according to the use case
•Caching may help