SlideShare a Scribd company logo
1 of 53
What Is Apache Spark?
Apache Spark is a cluster computing platform designed to be
fast and general purpose.
On the speed side, Spark extends the popular MapReduce
model to efficiently support more types of computations,
including interactive queries and stream processing.
On the generality side, Spark is designed to cover a wide
range of workloads that previously required separate
distributed systems, including batch applications, iterative
algorithms, interactive queries, and streaming.
Spark is designed to be highly accessible, offering simple APIs
in Python, Java, Scala, and SQL, and rich built-in libraries.
Spark itself is written in Scala, and runs on the Java Virtual
Machine (JVM).
The Spark stack
The Spark stack
 Spark Core
 Spark Core contains the basic functionality of Spark, including components for task
scheduling, memory management, fault recovery, interacting with storage systems, and
more. Spark Core is also home to the API that defines resilient distributed datasets
(RDDs).
 Spark SQL
 Spark SQL is Spark’s package for working with structured data. It allows querying data via
SQL as well as the Apache Hive variant of SQL, HQL.
 Spark Streaming
 Spark Streaming is a Spark component that enables processing of live streams of data.
 MLlib
 Spark comes with a library containing common machine learning (ML) functionality,
called MLlib. MLlib provides multiple types of machine learning algorithms, including
classification, regression, clustering, and collaborative filtering, as well as supporting
functionality such as model evaluation and data import.
 GraphX
 GraphX is a library for manipulating graphs (e.g., a social network’s friend graph) and
performing graph-parallel computations.
 Cluster Managers
 Under the hood, Spark is designed to efficiently scale up from one to many thousands of
compute nodes. To achieve this while maximizing flexibility, Spark can run over a variety
of cluster managers, including Hadoop YARN, Apache Mesos, and a simple cluster
manager included in Spark itself called the Standalone Scheduler.
Storage Layers for Spark
• Spark can create distributed datasets from any file
stored in the Hadoop distributed file system (HDFS) or
other storage systems supported by the Hadoop APIs
(including your local file system, Amazon S3,
Cassandra, Hive, HBase, etc.).
• It’s important to remember that Spark does not
require Hadoop; it simply has support for storage
systems implementing the Hadoop APIs.
• Spark supports text files, SequenceFiles, Avro,
Parquet, and any other Hadoop InputFormat.
Installing Spark
• Download and extract
 Download a compressed TAR file, or tar ball.
 You don’t need to have Hadoop, but if you have an existing Hadoop cluster or
HDFS installation, download the matching version.
 Extract the tar.
 Update the bashrc file
• Spark directory Contents.
 README.md
 Contains short instructions for getting started with Spark.
 Bin
 Contains executable files that can be used to interact with Spark in various ways like
shell.
 core, streaming, python, …
 Contains the source code of major components of the Spark project.
 examples
 Contains some helpful Spark standalone jobs that you can look at and run to learn about
the Spark API.
Core Spark Concepts
• Every Spark application consists of a driver program
that launches various parallel operations on a cluster.
The driver program contains your application’s main
function and defines distributed datasets on the cluster,
then applies operations to them.
 Driver programs access Spark through a
SparkContext object, which represents a
connection to a computing cluster.
 Once you have a SparkContext, you can
use it to build RDDs.
 To run operations on RDD, driver
programs typically manage a number of
nodes called executors.
Spark’s Python and Scala Shells
• Spark comes with interactive shells that enable ad hoc data
analysis.
• Unlike most other shells, however, which let you manipulate
data using the disk and memory on a single machine, Spark’s
shells allow you to interact with data that is distributed on
disk or in memory across many machines, and Spark takes
care of automatically distributing this processing.
• bin/pyspark and bin/spark-shell to open the respective shell.
• When the shell starts, you will notice a lot of log messages. It
can be controlled by changing log4j.rootCategory=INFO in
conf/log4j.properties.
StandaloneApplications
• The main difference from using it in the shell is that you need to
initialize your own SparkContext. After that, the API is the same.
• Initializing Spark in Scala
 import org.apache.spark.SparkConf
 import org.apache.spark.SparkContext
 import org.apache.spark.SparkContext._
 val conf = new SparkConf().setMaster("local").setAppName(“WordCount")
 val sc = new SparkContext(conf)
• Once we have our build defined, we can easily package and run our
application using the bin/spark-submit script.
• The spark-submit script sets up a number of environment variables
used by Spark.
 Maven build and run (mvn clean && mvn compile && mvn package)
 $SPARK_HOME/bin/spark-submit 
 --class com.oreilly.learningsparkexamples.mini.java.WordCount 
 ./target/learning-spark-mini-example-0.0.1.jar 
 ./README.md ./wordcounts
Word count Scala application
• // Create a Scala Spark Context.
• val conf = new SparkConf().setAppName("wordCount")
• val sc = new SparkContext(conf)
• // Load our input data.
• val input = sc.textFile(inputFile)
• // Split it up into words.
• val words = input.flatMap(line => line.split(" "))
• // Transform into pairs and count.
• val counts = words.map(word => (word, 1)).reduceByKey{case (x, y)
=> x + y}
• // Save the word count back out to a text file, causing evaluation.
• counts.saveAsTextFile(outputFile)
Resilient Distributed Dataset
• An RDD is simply a immutable distributed collection of
elements.
• Each RDD is split into multiple partitions, which may be
computed on different nodes of the cluster.
• Once created, RDDs offer two types of operations:
transformations and actions.
• Transformations construct a new RDD from a previous one.
• Actions, on the other hand, compute a result based on an
RDD, and either return it to the driver program or save it to
an external storage system.
• In Spark all work is expressed as either creating new RDDs,
transforming existing RDDs, or calling operations on RDDs
to compute a result.
Create RDDs
• Spark provides two ways to create RDDs:
loading an external dataset
val lines = sc.textFile("/path/to/README.md")
parallelizing a collection in your driver program.
The simplest way to create RDDs is to take an existing
collection in your program and pass it to SparkContext’s
parallelize() method.
Beyond prototyping and testing, this is not widely used
since it requires that you have your entire dataset in
memory on one machine.
val lines = sc.parallelize(List(1,2,3))
Lazy Evaluation
• Lazy evaluation means that when we call a transformation
on an RDD (for instance, calling map()), the operation is not
immediately performed. Instead, Spark internally records
metadata to indicate that this operation has been
requested.
• Rather than thinking of an RDD as containing specific data,
it is best to think of each RDD as consisting of instructions
on how to compute the data that we build up through
transformations. Loading data into an RDD is lazily
evaluated in the same way transformations are. So, when
we call sc.textFile(), the data is not loaded until it is
necessary.
• Spark uses lazy evaluation to reduce the number of passes
it has to take over our data by grouping operations
together.
RDD Operations(Transformations)
• Transformations are operations on RDDs that return a new RDD, such as
map() and filter().
• Transformed RDDs are computed lazily, only when you use them in an
action.
 val inputRDD = sc.textFile("log.txt")
 val errorsRDD = inputRDD.filter(line => line.contains("error"))
• Transformations operation does not mutate the existing inputRDD. Instead,
it returns a pointer to an entirely new RDD.
• InputRDD can still be reused later in the program
• Transformations can actually operate on any number of input RDDs. Like
Union to merge the RDDs.
• As you derive new RDDs from each other using transformations, Spark keeps
track of the set of dependencies between different RDDs, called the lineage
graph. It uses this information to compute each RDD on demand and to
recover lost data if part of a persistent RDD is lost.
Common Transformations
• Element-wise transformations
 MAP
The map() transformation takes in a function and applies it to each
element in the RDD with the result of the function being the new
value of each element in the resulting RDD.
 val input = sc.parallelize(List(1, 2, 3, 4))
 val result = input.map(x => x * x)
 println(result.collect().mkString(","))
 Filter
The filter() transformation takes in a function and returns an RDD
that only has elements that pass the filter() function.
 FLATMAP
Sometimes we want to produce multiple output elements for each
input element. The operation to do this is called flatMap().
 val lines = sc.parallelize(List("hello world", "hi"))
 val words = lines.flatMap(line => line.split(" "))
 words.first() // returns "hello"
Examples(Transformation)
• Create RDD / collect(Action)
 val seq = Seq(4,2,5,3,1,4)
 val rdd = sc.parallelize(seq)
 rdd.collect()
• Filter
 val filteredrdd = rdd.filter(x=>x>2)
• Distinct
 val rdddist = rdd.distinct()
 rdddist.collect()
• Map (square a number)
 val rddsquare = rdd.map(x=>x*x);
• Flatmap
 val rdd1 = sc.parallelize(List(("Hadoop PIG Hive"), ("Hive PIG PIG Hadoop"),
("Hadoop Hadoop Hadoop")))
 val rdd2 = rdd1.flatMap(x => x.split(" "))
• Sample an RDD
 rdd.sample(false,0.5).collect()
Examples(Transformation) Cont…
• Create RDD
 val x = Seq(1,2,3)
 val y = Seq(3,4,5)
 val rdd1 = sc.parallelize(x)
 val rdd2 = sc.parallelize(y)
• Union
 rdd1.union(rdd2).collect()
• Intersection
 rdd1.intersection(rdd2).collect()
• subtract
 Rdd1.subtract(rdd2).collect()
• Cartesian
 rdd1.cartesian(rdd2).collect()
RDD Operations(Actions)
• Actions are operations that return a result to the driver
program or write it to storage, and kick off a
computation, such as count() and first().
• Actions force the evaluation of the transformations
required for the RDD they were called on, since they
need to actually produce output.
 println("Input had " + badLinesRDD.count() + " concerning lines")
 println("Here are 10 examples:")
 badLinesRDD.take(10).foreach(println)
• It is important to note that each time we call a new
action, the entire RDD must be computed “from
scratch.”
Examples(Action)
• Reduce
 The most common action on basic RDDs you will likely use is reduce(), which
takes a function that operates on two elements of the type in your RDD and
returns a new element of the same type. A simple example of such a function
is +, which we can use to sum our RDD. With reduce(), we can easily sum the
elements of our RDD, count the number of elements, and perform other types
of aggregations.
 val seq = Seq(4,2,5,3,1,4)
 val rdd = sc.parallelize(seq)
 val sum = rdd.reduce((x,y)=> x+y)
• Fold
 Similar to reduce() is fold(), which also takes a function with the same
signature as needed for reduce(), but in addition takes a “zero value” to be
used for the initial call on each partition. The zero value you provide should be
the identity element for your operation;
 val seq = Seq(4,2,5,3,1,4)
 val rdd = sc.parallelize(seq,2)
 val rdd2 = rdd.fold(1)((x,y)=>x+y);
Examples(Action) cont.…
• Aggregate
val rdd = sc.parallelize(List(1,2,3,4,5,6),5)
Val rdd2 =
rdd.aggregate(0,0)((ac,val)=>(ac._1+val,ac._2+1),(
p1,p2)=>(p1._1+p2._1,p1._2+p2._2))
Val avg = rdd2._1/rdd._2.toDouble
Examples(Action) cont.…
• takeOrdered(num)(ordering)
 Reverse Order (Highest number)
 val seq = Seq(3,9,2,3,5,4)
 val rdd = sc.parallelize(seq,2)
 rdd.takeOrdered(1)(Ordering[Int].reverse)
 Custom Order (Highest based on age)
 case class Person(name:String,age:Int)
 val rdd = sc.parallelize(Array(("x",10),("y",14),("z",12)))
 val rdd2 = rdd.map(x=>Person(x._1,x._2))
 rdd2.takeOrdered(1)(Ordering[Int].reverse.on(x=>x.age))
 (highest/lowest repeated word in word count program)
 val rdd1 = sc.parallelize(List(("Hadoop PIG Hive"), ("Hive PIG PIG Hadoop"), ("Hadoop
Hadoop Hadoop")))
 val rdd2 = rdd1.flatMap(x => x.split(" ")).map(x => (x,1))
 val rdd3 = rdd2.reduceByKey((x,y) => (x+y))
 rdd3.takeOrdered(3)(Ordering[Int].on(x=>x._2)) //Lowest value
 rdd3.takeOrdered(3)(Ordering[Int].reverse.on(x=>x._2)) //Higest value
Persist(Cache) RDD
• Spark’s RDDs are by default recomputed each time you run an action on them. If
you would like to reuse an RDD in multiple actions, you can ask Spark to persist it
using RDD.persist().
• After computing it the first time, Spark will store the RDD contents in memory
(partitioned across the machines in your cluster), and reuse them in future actions.
Persisting RDDs on disk instead of memory is also possible.
• If a node that has data persisted on it fails, Spark will recompute the lost partitions
of the data when needed.
• We can also replicate our data on multiple nodes if we want to be able to handle
node failure without slowdown.
• If you attempt to cache too much data to fit in memory, Spark will automatically
evict old partitions using a Least Recently Used (LRU) cache policy. Caching
unnecessary data can lead to eviction of useful data and more recomputation
time.
• RDDs come with a method called unpersist() that lets you manually remove them
from the cache.
• cache() is the same as calling persist() with the default storage level.
Persistencelevels
Note: Cache and persist will work only after the RDD is computed again either
by performing action on the RDD or on the transformed RDD when persist is not
given when the RDD is created.
Spark Summary
• To summarize, every Spark program and shell
session will work as follows:
Create some input RDDs from external data.
Transform them to define new RDDs using
transformations like filter().
Ask Spark to persist() any intermediate RDDs that will
need to be reused. cache() is the same as calling
persist() with the default storage level.
Launch actions such as count() and first() to kick off a
parallel computation, which is then optimized and
executed by Spark.
Pair RDD (Key/Value Pair)
• Spark provides special operations on RDDs containing
key/value pairs, called pair RDDs.
• Key/value RDDs are commonly used to perform
aggregations, and often we will do some initial ETL to get
our data into a key/value format.
• Key/value RDDs expose new operations (e.g., counting up
reviews for each product, grouping together data with the
same key, and grouping together two different RDDs)
• Creating Pair RDDs
 Some loading formats directly return the Pair RDDs.
 Using map function
 Scala
 val pairs = lines.map(x => (x.split(" ")(0), x))
Transformations on Pair RDDs
• Pair RDDs are allowed to use all the transformations
available to standard RDDs.
• Aggregations
When datasets are described in terms of key/value
pairs, it is common to want to aggregate statistics
across all elements with the same key. We have
looked at the fold(), combine(), and reduce() actions
on basic RDDs, and similar per-key transformations
exist on pair RDDs. Spark has a similar set of
operations that combines values that have the same
key. These operations return RDDs and thus are called
transformations rather than actions.
Transformations...cont…
• reduceByKey()
 is quite similar to reduce(); both take a function and use it to combine values.
reduceByKey() runs several parallel reduce operations, one for each key in the
dataset, where each operation combines values that have the same key.
Because datasets can have very large numbers of keys, reduceByKey() is not
implemented as an action that returns a value to the user program. Instead, it
returns a new RDD consisting of each key and the reduced value for that key.
• Example Word count in Scala
 val input = sc.textFile("s3://...")
 val words = input.flatMap(x => x.split(" "))
 val result = words.map(x => (x, 1)).reduceByKey((acc, val) => acc + val)
• We can actually implement word count even faster by using the
countByValue() function on the first RDD: input.flatMap(x =>x.split("
")).countByValue().
Transformations...cont…
Per-key average with
reduceByKey() and
mapValues() in Scala
– rdd.mapValues(x => (x,
1)).reduceByKey((x, y) =>
(x._1 + y._1, x._2 + y._2))
Per-key average data flow
Transformations...cont…
• foldByKey()
is quite similar to fold(); both use a zero value of
the same type of the data in our RDD and
combination function. As with fold(), the provided
zero value for foldByKey() should have no impact
when added with your combination function to
another element.
Those familiar with the combiner concept from MapReduce should note that
calling reduceByKey() and foldByKey() will automatically perform combining locally
on each machine before computing global totals for each key. The user does not
need to specify a combiner. The more general combineByKey() interface allows you
to customize combining behavior.
Transformations...cont…
• combineByKey()
 is the most general of the per-key aggregation functions. Most of the other
per-key combiners are implemented using it. Like aggregate(), combineBy
Key() allows the user to return values that are not the same type as our input
data.
 As combineByKey() goes through the elements in a partition, each element
either has a key it hasn’t seen before or has the same key as a previous
element.
 If it’s a new element, combineByKey() uses a function we provide, called
createCombiner(), to create the initial value for the accumulator on that key. It’s
important to note that this happens the first time a key is found in each partition, rather
than only the first time the key is found in the RDD.
 If it is a value we have seen before while processing that partition, it will instead use the
provided function, mergeValue(), with the current value for the accumulator for that key
and the new value.
 Since each partition is processed independently, we can have multiple
accumulators for the same key. When we are merging the results from each
partition, if two or more partitions have an accumulator for the same key we
merge the accumulators using the user-supplied mergeCombiners() function.
Transformations...cont…
• Per-key average using combineByKey() in Scala
val result = input.combineByKey((v) => (v, 1), (acc:
(Int, Int), v) => (acc._1 + v, acc._2 + 1), (acc1: (Int,
Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2
+ acc2._2)).map{ case (key, value) => (key,
value._1 / value._2.toFloat)
}result.collectAsMap().map(println(_))
Transformations...cont…
• GroupByKey()
 With keyed data a common use case is grouping our data by key—for
example, viewing all of a customer’s orders together.
 If our data is already keyed in the way we want, groupByKey() will
group our data using the key in our RDD. On an RDD consisting of keys
of type K and values of type V, we get back an RDD of type [K,
Iterable[V]].
 If you find yourself writing code where you groupByKey() and then use
a reduce() or fold() on the values, you can probably achieve the same
result more efficiently by using one of the per-key aggregation
functions. Rather than reducing the RDD to an inmemory value, we
reduce the data per key and get back an RDD with the reduced values
corresponding to each key. For example, rdd.reduceByKey(func)
produces the same RDD as rdd.groupByKey().mapValues(value =>
value.reduce(func)) but is more efficient as it avoids the step of
creating a list of values for each key.
Transformations...cont…
• cogroup
 In addition to grouping data from a single RDD, we can group
data sharing the same key from multiple RDDs using a function
called cogroup(). cogroup() over two RDDs sharing the same key
type, K, with the respective value types V and W gives us back
RDD[(K, (Iterable[V], Iterable[W]))]. If one of the RDDs doesn’t
have elements for a given key that is present in the other RDD,
the corresponding Iterable is simply empty.
 cogroup() gives us the power to group data from multiple RDDs.
 cogroup() is used as a building block for the joins. However
cogroup() can be used for much more than just implementing
joins. We can also use it to implement intersect by key.
 Additionally, cogroup() can work on three or more RDDs at
once.
Transformations...cont…
• Joins
 Joining data together is probably one of the most common operations on a
pair RDD, and we have a full range of options including right and left outer
joins, cross joins, and inner joins.
• Scala shell inner join
 storeAddress = {(Store("Ritual"), "1026 Valencia St"), (Store("Philz"), "748 Van
Ness Ave"), (Store("Philz"), "3101 24th St"), (Store("Starbucks"), "Seattle")}
 storeRating = {(Store("Ritual"), 4.9), (Store("Philz"), 4.8))}
 storeAddress.join(storeRating) == {(Store("Ritual"), ("1026 Valencia St", 4.9)),
(Store("Philz"), ("748 Van Ness Ave", 4.8)), (Store("Philz"), ("3101 24th St",
4.8))}
• leftOuterJoin() and rightOuterJoin()
 storeAddress.leftOuterJoin(storeRating) == {(Store("Ritual"),("1026 Valencia
St",Some(4.9))), (Store("Starbucks"),("Seattle",None)), (Store("Philz"),("748
Van Ness Ave",Some(4.8))), (Store("Philz"),("3101 24th St",Some(4.8)))}
 storeAddress.rightOuterJoin(storeRating) == {(Store("Ritual"),(Some("1026
Valencia St"),4.9)), (Store("Philz"),(Some("748 Van Ness Ave"),4.8)),
(Store("Philz"), (Some("3101 24th St"),4.8))}
Tuning the level of parallelism
• When performing aggregations or grouping operations, we can ask Spark
to use a specific number of partitions. Spark will always try to infer a
sensible default value based on the size of your cluster, but in some cases
you will want to tune the level of parallelism for better performance.
 val data = Seq(("a", 3), ("b", 4), ("a", 1))
 sc.parallelize(data).reduceByKey((x, y) => x + y) // Default parallelism
 sc.parallelize(data).reduceByKey((x, y) => x + y,10) // Custom parallelism
• Repartitioning your data is a fairly expensive operation. Spark also has an
optimized version of repartition() function called coalesce() that allows
avoiding data movement, but only if you are decreasing the number of
RDD partitions.
• To know whether you can safely call coalesce(), you can check the size of
the RDD using rdd.partitions.size() or rdd.getNumPartitions().
• To see each partition data of a RDD,
 scala> rdd.mapPartitionsWithIndex( (index: Int, it: Iterator[(Int,Int)]) =>
it.toList.map(x => index + ":" + x).iterator).collect
Sorting Data
• Having sorted data is quite useful in many cases, especially when
you’re producing downstream output. We can sort an RDD with
key/value pairs provided that there is an ordering defined on the
key. Once we have sorted our data, any subsequent call on the
sorted data to collect() or save() will result in ordered data.
• Since we often want our RDDs in the reverse order, the sortByKey()
function takes a parameter called true/false indicating whether we
want it in ascending order (it defaults to true i.e. ascending).
• Sometimes we want a different sort order entirely, and to support
this we can provide our own comparison function.
 Custom sort order in Scala, sorting integers as if strings
 val rdd1 = sc.parallelize(List(("Hadoop PIG Hive"), ("Hive PIG PIG Hadoop"),
("Hadoop Hadoop Hadoop")))
 val rdd2 = rdd1.flatMap(x => x.split(" ")).map(x => (x,1))
 val rdd3 = rdd2.reduceByKey((x,y) => (x+y))
 rdd3.takeOrdered(3)(Ordering[Int].on(x=>x._2))
Actions available on Pair RDDs
• As with the transformations, all of the traditional actions available on the
base RDD are also available on pair RDDs. Some additional actions are
available on pair RDDs to take advantage of the key/value nature of the
data.
Accumulators
• Shared variable, accumulators, provides a simple syntax for aggregating values
from worker nodes back to the driver program.
• One of the most common uses of accumulators is to count events that occur
during job execution for debugging purposes.
 val sc = new SparkContext(...)
 val file = sc.textFile("file.txt")
 val blankLines = sc.accumulator(0) // Create Accumulator[Int] initialized to 0
 val callSigns = file.flatMap(line => {
if (line == "") {
blankLines += 1 // Add to the accumulator
}
line.split(" ")})
 callSigns.saveAsTextFile("output.txt")
 println("Blank lines: " + blankLines.value)
 Note that we will see the right count only after we run the saveAsTextFile() action, because the
transformation above it, map(), is lazy, so the side-effect of incrementing of the accumulator will
happen only when the map() transformation is forced to occur by the saveAsTextFile() action.
 Also note that tasks on worker nodes cannot access the accumulator’s value()—from the point of view
of these tasks, accumulators are write-only variables. This allows accumulators to be implemented
efficiently, without having to communicate every update.
Accumulators cont…
• Accumulators and Fault Tolerance
 Spark automatically deals with failed or slow machines by re-executing failed or slow tasks. For
example, if the node running a partition of a map() operation crashes, Spark will rerun it on
another node; and even if the node does not crash but is simply much slower than other
nodes, Spark can preemptively launch a “speculative” copy of the task on another node, and
take its result if that finishes. Even if no nodes fail, Spark may have to rerun a task to rebuild a
cached value that falls out of memory. The net result is therefore that the same function may
run multiple times on the same data depending on what happens on the cluster.
 The end result is that for accumulators used in actions, Spark applies each task’s update to
each accumulator only once. Thus, if we want a reliable absolute value counter, regardless of
failures or multiple evaluations, we must put it inside an action like foreach().
• Custom Accumulators
 Spark supports accumulators of type Int, Double, Long, and Float.
 Spark also includes an API to define custom accumulator types and custom aggregation
operations.
 Custom accumulators need to extend AccumulatorParam.
 we can use any operation for add, provided that operation is commutative and associative.
 An operation op is commutative if a op b = b op a for all values a, b.
 An operation op is associative if (a op b) op c = a op (b op c) for all values a, b, and c.
Broadcast Variables
• Shared variable, broadcast variables, allows the program to efficiently send a large, read-only
value to all the worker nodes for use in one or more Spark operations.
 It’s expensive to send that Array from the master with each task. Also if we used same
object later (if we ran the same code for other files), it would send again to each node.
 val c = sc.broadcast(Array(100,200))
 val a = sc.parallelize(List(“This is Krishna” ,“Sureka learning the Spark”))
 val d = a.flatMap(x=>x.split(" ")).map(x=>x+c.value(1))
• Optimizing Broadcasts
 When we are broadcasting large values, it is important to choose a data serialization format
that is both fast and compact.
Working on a Per-Partition Basis
 Working with data on a per-partition basis allows us to avoid redoing
setup work for each data item. Operations like opening a database
connection or creating a random number generator are examples of
setup steps that we wish to avoid doing for each element.
 Spark has per-partition versions of map and foreach to help reduce the
cost of these operations by letting you run code only once for each
partition of an RDD.
 Example-1 (mapPartitions)
 val a = sc.parallelize(List(1,2,3,4,5,6),2)
 scala> val b = a.mapPartitions((x:Iterator[Int])=>{println("Hello"); x.toList.map(y=>y+1).iterator})
 scala> val b = a.mapPartitions((x:Iterator[Int])=>{println("Hello"); x.map(y=>y+1)})
 scala> b.collect
 Hello
 Hello
 res20: Array[Int] = Array(2, 3, 4, 5, 6, 7)
 scala> a.mapPartitions(x=>x.filter(y => y > 1)).collect
 res35: Array[Int] = Array(2, 3, 4, 5, 6)
Working on a Per-Partition Basis
 Example-2 (mapPartitionWithIndex)
 scala> val b =
a.mapPartitionsWithIndex((index:Int,x:Iterator[Int])=>{println("Hello
from "+ index); if(index==1){x.map(y=>y+1)}else{(x.map(y=>y+200))}})
 b: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[28] at
mapPartitionsWithIndex at <console>:29
 scala> val b = a.mapPartitionsWithIndex((index,x)=>{println("Hello
from "+ index); if(index==1){x.map(y=>y+1)}else{(x.map(y=>y+200))}})
 scala> b.collect
 Hello from 0
 Hello from 1
 res44: Array[Int] = Array(201, 202, 203, 5, 6, 7)
Practice Session
SPARK SQL
IGATE Sensitive
• scala> case class person(id:Int,name:String,sal:Int)
• defined class person
•
• scala> person
• res38: person.type = person
•
• scala> val a = sc.textFile("file:///home/vm4learning/Desktop/spark/emp")
• a: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[36] at textFile at
<console>:27
•
• scala> a.collect
• res39: Array[String] = Array(837798 Krishna 5000000, 843123 Sandipta
10000000, 844066 Abhishek 40000, 843289 ZohriSahab 9999999)
•
• scala> val b = a.map(_.split(" "))
• b: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[37] at map at <console>:29
•
• scala> b.collect
• res40: Array[Array[String]] = Array(Array(837798, Krishna, 5000000), Array(843123, Sandipta,
10000000), Array(844066, Abhishek, 40000), Array(843289, ZohriSahab, 9999999))
•
• scala> val c = b.map (x=>person(x(0).toInt,x(1),x(2).toInt))
• c: org.apache.spark.rdd.RDD[person] = MapPartitionsRDD[38] at map at <console>:33
•
• scala> c.collect
• res41: Array[person] = Array(person(837798,Krishna,5000000),
person(843123,Sandipta,10000000), person(844066,Abhishek,40000),
person(843289,ZohriSahab,9999999))
•
• scala> val d = c.toDF()
• d: org.apache.spark.sql.DataFrame = [id: int, name: string, sal: int]
•
• scala> d.collect
• res42: Array[org.apache.spark.sql.Row] = Array([837798,Krishna,5000000],
[843123,Sandipta,10000000], [844066,Abhishek,40000], [843289,ZohriSahab,9999999])
•
• scala> d.registerTempTable("Emp")
•
• scala> val e = sqlContext.sql("""select id,name,sal from Emp where sal < 9999999""")
• e: org.apache.spark.sql.DataFrame = [id: int, name: string, sal: int]
•
• scala> e.collect
• res44: Array[org.apache.spark.sql.Row] = Array([837798,Krishna,5000000],
[844066,Abhishek,40000])
•
• scala> e
• res46: org.apache.spark.sql.DataFrame = [id: int, name: string, sal: int]
•
• scala> val a1 = sc.textFile("file:///home/vm4learning/Desktop/spark/dept")
• a1: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[52] at textFile at <console>:27
•
• scala> a1.collect
• res49: Array[String] = Array(843123 CEO, 844066 SparkCommiter, 843289
BoardHSBC)
•
• scala> case class pdept(id:Int,dept:String)
• defined class pdept
•
• scala> val a2 = a1.map(x=>x.split("
")).map(x=>pdept(x(0).toInt,x(1))).toDF()
• a2: org.apache.spark.sql.DataFrame = [id: int, dept: string]
•
• scala> a2.collect
• res50: Array[org.apache.spark.sql.Row] = Array([843123,CEO],
[844066,SparkCommiter], [843289,BoardHSBC])
•
• scala> a2.registerTempTable("Dept")
•
• scala> val e = sqlContext.sql("""select * from Emp join Dept on
Emp.id = Dept.id where sal < 9999999""")
• e: org.apache.spark.sql.DataFrame = [id: int, name: string, sal: int,
id: int, dept: string]
•
• scala> e.collect
• res52: Array[org.apache.spark.sql.Row] =
Array([844066,Abhishek,40000,844066,SparkCommiter])
•
• scala> sqlContext.cacheTable("Dept")
•
• scala> sqlContext.uncacheTable("Dept")
Thank You
• Question?
• Feedback?
explorehadoop@gmail.com

More Related Content

What's hot

Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Simplilearn
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with PythonGokhan Atil
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Sparkdatamantra
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkDatio Big Data
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architectureSohil Jain
 

What's hot (20)

Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
6.hive
6.hive6.hive
6.hive
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Spark
SparkSpark
Spark
 
Apache spark
Apache sparkApache spark
Apache spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 

Similar to Spark core

Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxAishg4
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to SparkKyle Burke
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZDataFactZ
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark exampleShidrokhGoudarzi1
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionRUHULAMINHAZARIKA
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 

Similar to Spark core (20)

Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to Spark
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Apache spark
Apache sparkApache spark
Apache spark
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 

More from Prashant Gupta

More from Prashant Gupta (9)

Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
Sqoop
SqoopSqoop
Sqoop
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Mongodb - NoSql Database
Mongodb - NoSql DatabaseMongodb - NoSql Database
Mongodb - NoSql Database
 
Sonar Tool - JAVA code analysis
Sonar Tool - JAVA code analysisSonar Tool - JAVA code analysis
Sonar Tool - JAVA code analysis
 

Recently uploaded

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Recently uploaded (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

Spark core

  • 1. What Is Apache Spark? Apache Spark is a cluster computing platform designed to be fast and general purpose. On the speed side, Spark extends the popular MapReduce model to efficiently support more types of computations, including interactive queries and stream processing. On the generality side, Spark is designed to cover a wide range of workloads that previously required separate distributed systems, including batch applications, iterative algorithms, interactive queries, and streaming. Spark is designed to be highly accessible, offering simple APIs in Python, Java, Scala, and SQL, and rich built-in libraries. Spark itself is written in Scala, and runs on the Java Virtual Machine (JVM).
  • 3. The Spark stack  Spark Core  Spark Core contains the basic functionality of Spark, including components for task scheduling, memory management, fault recovery, interacting with storage systems, and more. Spark Core is also home to the API that defines resilient distributed datasets (RDDs).  Spark SQL  Spark SQL is Spark’s package for working with structured data. It allows querying data via SQL as well as the Apache Hive variant of SQL, HQL.  Spark Streaming  Spark Streaming is a Spark component that enables processing of live streams of data.  MLlib  Spark comes with a library containing common machine learning (ML) functionality, called MLlib. MLlib provides multiple types of machine learning algorithms, including classification, regression, clustering, and collaborative filtering, as well as supporting functionality such as model evaluation and data import.  GraphX  GraphX is a library for manipulating graphs (e.g., a social network’s friend graph) and performing graph-parallel computations.  Cluster Managers  Under the hood, Spark is designed to efficiently scale up from one to many thousands of compute nodes. To achieve this while maximizing flexibility, Spark can run over a variety of cluster managers, including Hadoop YARN, Apache Mesos, and a simple cluster manager included in Spark itself called the Standalone Scheduler.
  • 4. Storage Layers for Spark • Spark can create distributed datasets from any file stored in the Hadoop distributed file system (HDFS) or other storage systems supported by the Hadoop APIs (including your local file system, Amazon S3, Cassandra, Hive, HBase, etc.). • It’s important to remember that Spark does not require Hadoop; it simply has support for storage systems implementing the Hadoop APIs. • Spark supports text files, SequenceFiles, Avro, Parquet, and any other Hadoop InputFormat.
  • 5. Installing Spark • Download and extract  Download a compressed TAR file, or tar ball.  You don’t need to have Hadoop, but if you have an existing Hadoop cluster or HDFS installation, download the matching version.  Extract the tar.  Update the bashrc file • Spark directory Contents.  README.md  Contains short instructions for getting started with Spark.  Bin  Contains executable files that can be used to interact with Spark in various ways like shell.  core, streaming, python, …  Contains the source code of major components of the Spark project.  examples  Contains some helpful Spark standalone jobs that you can look at and run to learn about the Spark API.
  • 6. Core Spark Concepts • Every Spark application consists of a driver program that launches various parallel operations on a cluster. The driver program contains your application’s main function and defines distributed datasets on the cluster, then applies operations to them.  Driver programs access Spark through a SparkContext object, which represents a connection to a computing cluster.  Once you have a SparkContext, you can use it to build RDDs.  To run operations on RDD, driver programs typically manage a number of nodes called executors.
  • 7. Spark’s Python and Scala Shells • Spark comes with interactive shells that enable ad hoc data analysis. • Unlike most other shells, however, which let you manipulate data using the disk and memory on a single machine, Spark’s shells allow you to interact with data that is distributed on disk or in memory across many machines, and Spark takes care of automatically distributing this processing. • bin/pyspark and bin/spark-shell to open the respective shell. • When the shell starts, you will notice a lot of log messages. It can be controlled by changing log4j.rootCategory=INFO in conf/log4j.properties.
  • 8. StandaloneApplications • The main difference from using it in the shell is that you need to initialize your own SparkContext. After that, the API is the same. • Initializing Spark in Scala  import org.apache.spark.SparkConf  import org.apache.spark.SparkContext  import org.apache.spark.SparkContext._  val conf = new SparkConf().setMaster("local").setAppName(“WordCount")  val sc = new SparkContext(conf) • Once we have our build defined, we can easily package and run our application using the bin/spark-submit script. • The spark-submit script sets up a number of environment variables used by Spark.  Maven build and run (mvn clean && mvn compile && mvn package)  $SPARK_HOME/bin/spark-submit  --class com.oreilly.learningsparkexamples.mini.java.WordCount  ./target/learning-spark-mini-example-0.0.1.jar  ./README.md ./wordcounts
  • 9. Word count Scala application • // Create a Scala Spark Context. • val conf = new SparkConf().setAppName("wordCount") • val sc = new SparkContext(conf) • // Load our input data. • val input = sc.textFile(inputFile) • // Split it up into words. • val words = input.flatMap(line => line.split(" ")) • // Transform into pairs and count. • val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x + y} • // Save the word count back out to a text file, causing evaluation. • counts.saveAsTextFile(outputFile)
  • 10. Resilient Distributed Dataset • An RDD is simply a immutable distributed collection of elements. • Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster. • Once created, RDDs offer two types of operations: transformations and actions. • Transformations construct a new RDD from a previous one. • Actions, on the other hand, compute a result based on an RDD, and either return it to the driver program or save it to an external storage system. • In Spark all work is expressed as either creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result.
  • 11. Create RDDs • Spark provides two ways to create RDDs: loading an external dataset val lines = sc.textFile("/path/to/README.md") parallelizing a collection in your driver program. The simplest way to create RDDs is to take an existing collection in your program and pass it to SparkContext’s parallelize() method. Beyond prototyping and testing, this is not widely used since it requires that you have your entire dataset in memory on one machine. val lines = sc.parallelize(List(1,2,3))
  • 12. Lazy Evaluation • Lazy evaluation means that when we call a transformation on an RDD (for instance, calling map()), the operation is not immediately performed. Instead, Spark internally records metadata to indicate that this operation has been requested. • Rather than thinking of an RDD as containing specific data, it is best to think of each RDD as consisting of instructions on how to compute the data that we build up through transformations. Loading data into an RDD is lazily evaluated in the same way transformations are. So, when we call sc.textFile(), the data is not loaded until it is necessary. • Spark uses lazy evaluation to reduce the number of passes it has to take over our data by grouping operations together.
  • 13. RDD Operations(Transformations) • Transformations are operations on RDDs that return a new RDD, such as map() and filter(). • Transformed RDDs are computed lazily, only when you use them in an action.  val inputRDD = sc.textFile("log.txt")  val errorsRDD = inputRDD.filter(line => line.contains("error")) • Transformations operation does not mutate the existing inputRDD. Instead, it returns a pointer to an entirely new RDD. • InputRDD can still be reused later in the program • Transformations can actually operate on any number of input RDDs. Like Union to merge the RDDs. • As you derive new RDDs from each other using transformations, Spark keeps track of the set of dependencies between different RDDs, called the lineage graph. It uses this information to compute each RDD on demand and to recover lost data if part of a persistent RDD is lost.
  • 14. Common Transformations • Element-wise transformations  MAP The map() transformation takes in a function and applies it to each element in the RDD with the result of the function being the new value of each element in the resulting RDD.  val input = sc.parallelize(List(1, 2, 3, 4))  val result = input.map(x => x * x)  println(result.collect().mkString(","))  Filter The filter() transformation takes in a function and returns an RDD that only has elements that pass the filter() function.  FLATMAP Sometimes we want to produce multiple output elements for each input element. The operation to do this is called flatMap().  val lines = sc.parallelize(List("hello world", "hi"))  val words = lines.flatMap(line => line.split(" "))  words.first() // returns "hello"
  • 15.
  • 16. Examples(Transformation) • Create RDD / collect(Action)  val seq = Seq(4,2,5,3,1,4)  val rdd = sc.parallelize(seq)  rdd.collect() • Filter  val filteredrdd = rdd.filter(x=>x>2) • Distinct  val rdddist = rdd.distinct()  rdddist.collect() • Map (square a number)  val rddsquare = rdd.map(x=>x*x); • Flatmap  val rdd1 = sc.parallelize(List(("Hadoop PIG Hive"), ("Hive PIG PIG Hadoop"), ("Hadoop Hadoop Hadoop")))  val rdd2 = rdd1.flatMap(x => x.split(" ")) • Sample an RDD  rdd.sample(false,0.5).collect()
  • 17. Examples(Transformation) Cont… • Create RDD  val x = Seq(1,2,3)  val y = Seq(3,4,5)  val rdd1 = sc.parallelize(x)  val rdd2 = sc.parallelize(y) • Union  rdd1.union(rdd2).collect() • Intersection  rdd1.intersection(rdd2).collect() • subtract  Rdd1.subtract(rdd2).collect() • Cartesian  rdd1.cartesian(rdd2).collect()
  • 18. RDD Operations(Actions) • Actions are operations that return a result to the driver program or write it to storage, and kick off a computation, such as count() and first(). • Actions force the evaluation of the transformations required for the RDD they were called on, since they need to actually produce output.  println("Input had " + badLinesRDD.count() + " concerning lines")  println("Here are 10 examples:")  badLinesRDD.take(10).foreach(println) • It is important to note that each time we call a new action, the entire RDD must be computed “from scratch.”
  • 19.
  • 20. Examples(Action) • Reduce  The most common action on basic RDDs you will likely use is reduce(), which takes a function that operates on two elements of the type in your RDD and returns a new element of the same type. A simple example of such a function is +, which we can use to sum our RDD. With reduce(), we can easily sum the elements of our RDD, count the number of elements, and perform other types of aggregations.  val seq = Seq(4,2,5,3,1,4)  val rdd = sc.parallelize(seq)  val sum = rdd.reduce((x,y)=> x+y) • Fold  Similar to reduce() is fold(), which also takes a function with the same signature as needed for reduce(), but in addition takes a “zero value” to be used for the initial call on each partition. The zero value you provide should be the identity element for your operation;  val seq = Seq(4,2,5,3,1,4)  val rdd = sc.parallelize(seq,2)  val rdd2 = rdd.fold(1)((x,y)=>x+y);
  • 21. Examples(Action) cont.… • Aggregate val rdd = sc.parallelize(List(1,2,3,4,5,6),5) Val rdd2 = rdd.aggregate(0,0)((ac,val)=>(ac._1+val,ac._2+1),( p1,p2)=>(p1._1+p2._1,p1._2+p2._2)) Val avg = rdd2._1/rdd._2.toDouble
  • 22. Examples(Action) cont.… • takeOrdered(num)(ordering)  Reverse Order (Highest number)  val seq = Seq(3,9,2,3,5,4)  val rdd = sc.parallelize(seq,2)  rdd.takeOrdered(1)(Ordering[Int].reverse)  Custom Order (Highest based on age)  case class Person(name:String,age:Int)  val rdd = sc.parallelize(Array(("x",10),("y",14),("z",12)))  val rdd2 = rdd.map(x=>Person(x._1,x._2))  rdd2.takeOrdered(1)(Ordering[Int].reverse.on(x=>x.age))  (highest/lowest repeated word in word count program)  val rdd1 = sc.parallelize(List(("Hadoop PIG Hive"), ("Hive PIG PIG Hadoop"), ("Hadoop Hadoop Hadoop")))  val rdd2 = rdd1.flatMap(x => x.split(" ")).map(x => (x,1))  val rdd3 = rdd2.reduceByKey((x,y) => (x+y))  rdd3.takeOrdered(3)(Ordering[Int].on(x=>x._2)) //Lowest value  rdd3.takeOrdered(3)(Ordering[Int].reverse.on(x=>x._2)) //Higest value
  • 23. Persist(Cache) RDD • Spark’s RDDs are by default recomputed each time you run an action on them. If you would like to reuse an RDD in multiple actions, you can ask Spark to persist it using RDD.persist(). • After computing it the first time, Spark will store the RDD contents in memory (partitioned across the machines in your cluster), and reuse them in future actions. Persisting RDDs on disk instead of memory is also possible. • If a node that has data persisted on it fails, Spark will recompute the lost partitions of the data when needed. • We can also replicate our data on multiple nodes if we want to be able to handle node failure without slowdown. • If you attempt to cache too much data to fit in memory, Spark will automatically evict old partitions using a Least Recently Used (LRU) cache policy. Caching unnecessary data can lead to eviction of useful data and more recomputation time. • RDDs come with a method called unpersist() that lets you manually remove them from the cache. • cache() is the same as calling persist() with the default storage level.
  • 24. Persistencelevels Note: Cache and persist will work only after the RDD is computed again either by performing action on the RDD or on the transformed RDD when persist is not given when the RDD is created.
  • 25. Spark Summary • To summarize, every Spark program and shell session will work as follows: Create some input RDDs from external data. Transform them to define new RDDs using transformations like filter(). Ask Spark to persist() any intermediate RDDs that will need to be reused. cache() is the same as calling persist() with the default storage level. Launch actions such as count() and first() to kick off a parallel computation, which is then optimized and executed by Spark.
  • 26. Pair RDD (Key/Value Pair) • Spark provides special operations on RDDs containing key/value pairs, called pair RDDs. • Key/value RDDs are commonly used to perform aggregations, and often we will do some initial ETL to get our data into a key/value format. • Key/value RDDs expose new operations (e.g., counting up reviews for each product, grouping together data with the same key, and grouping together two different RDDs) • Creating Pair RDDs  Some loading formats directly return the Pair RDDs.  Using map function  Scala  val pairs = lines.map(x => (x.split(" ")(0), x))
  • 27. Transformations on Pair RDDs • Pair RDDs are allowed to use all the transformations available to standard RDDs. • Aggregations When datasets are described in terms of key/value pairs, it is common to want to aggregate statistics across all elements with the same key. We have looked at the fold(), combine(), and reduce() actions on basic RDDs, and similar per-key transformations exist on pair RDDs. Spark has a similar set of operations that combines values that have the same key. These operations return RDDs and thus are called transformations rather than actions.
  • 28. Transformations...cont… • reduceByKey()  is quite similar to reduce(); both take a function and use it to combine values. reduceByKey() runs several parallel reduce operations, one for each key in the dataset, where each operation combines values that have the same key. Because datasets can have very large numbers of keys, reduceByKey() is not implemented as an action that returns a value to the user program. Instead, it returns a new RDD consisting of each key and the reduced value for that key. • Example Word count in Scala  val input = sc.textFile("s3://...")  val words = input.flatMap(x => x.split(" "))  val result = words.map(x => (x, 1)).reduceByKey((acc, val) => acc + val) • We can actually implement word count even faster by using the countByValue() function on the first RDD: input.flatMap(x =>x.split(" ")).countByValue().
  • 29. Transformations...cont… Per-key average with reduceByKey() and mapValues() in Scala – rdd.mapValues(x => (x, 1)).reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2)) Per-key average data flow
  • 30. Transformations...cont… • foldByKey() is quite similar to fold(); both use a zero value of the same type of the data in our RDD and combination function. As with fold(), the provided zero value for foldByKey() should have no impact when added with your combination function to another element. Those familiar with the combiner concept from MapReduce should note that calling reduceByKey() and foldByKey() will automatically perform combining locally on each machine before computing global totals for each key. The user does not need to specify a combiner. The more general combineByKey() interface allows you to customize combining behavior.
  • 31. Transformations...cont… • combineByKey()  is the most general of the per-key aggregation functions. Most of the other per-key combiners are implemented using it. Like aggregate(), combineBy Key() allows the user to return values that are not the same type as our input data.  As combineByKey() goes through the elements in a partition, each element either has a key it hasn’t seen before or has the same key as a previous element.  If it’s a new element, combineByKey() uses a function we provide, called createCombiner(), to create the initial value for the accumulator on that key. It’s important to note that this happens the first time a key is found in each partition, rather than only the first time the key is found in the RDD.  If it is a value we have seen before while processing that partition, it will instead use the provided function, mergeValue(), with the current value for the accumulator for that key and the new value.  Since each partition is processed independently, we can have multiple accumulators for the same key. When we are merging the results from each partition, if two or more partitions have an accumulator for the same key we merge the accumulators using the user-supplied mergeCombiners() function.
  • 32. Transformations...cont… • Per-key average using combineByKey() in Scala val result = input.combineByKey((v) => (v, 1), (acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1), (acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)).map{ case (key, value) => (key, value._1 / value._2.toFloat) }result.collectAsMap().map(println(_))
  • 33. Transformations...cont… • GroupByKey()  With keyed data a common use case is grouping our data by key—for example, viewing all of a customer’s orders together.  If our data is already keyed in the way we want, groupByKey() will group our data using the key in our RDD. On an RDD consisting of keys of type K and values of type V, we get back an RDD of type [K, Iterable[V]].  If you find yourself writing code where you groupByKey() and then use a reduce() or fold() on the values, you can probably achieve the same result more efficiently by using one of the per-key aggregation functions. Rather than reducing the RDD to an inmemory value, we reduce the data per key and get back an RDD with the reduced values corresponding to each key. For example, rdd.reduceByKey(func) produces the same RDD as rdd.groupByKey().mapValues(value => value.reduce(func)) but is more efficient as it avoids the step of creating a list of values for each key.
  • 34. Transformations...cont… • cogroup  In addition to grouping data from a single RDD, we can group data sharing the same key from multiple RDDs using a function called cogroup(). cogroup() over two RDDs sharing the same key type, K, with the respective value types V and W gives us back RDD[(K, (Iterable[V], Iterable[W]))]. If one of the RDDs doesn’t have elements for a given key that is present in the other RDD, the corresponding Iterable is simply empty.  cogroup() gives us the power to group data from multiple RDDs.  cogroup() is used as a building block for the joins. However cogroup() can be used for much more than just implementing joins. We can also use it to implement intersect by key.  Additionally, cogroup() can work on three or more RDDs at once.
  • 35. Transformations...cont… • Joins  Joining data together is probably one of the most common operations on a pair RDD, and we have a full range of options including right and left outer joins, cross joins, and inner joins. • Scala shell inner join  storeAddress = {(Store("Ritual"), "1026 Valencia St"), (Store("Philz"), "748 Van Ness Ave"), (Store("Philz"), "3101 24th St"), (Store("Starbucks"), "Seattle")}  storeRating = {(Store("Ritual"), 4.9), (Store("Philz"), 4.8))}  storeAddress.join(storeRating) == {(Store("Ritual"), ("1026 Valencia St", 4.9)), (Store("Philz"), ("748 Van Ness Ave", 4.8)), (Store("Philz"), ("3101 24th St", 4.8))} • leftOuterJoin() and rightOuterJoin()  storeAddress.leftOuterJoin(storeRating) == {(Store("Ritual"),("1026 Valencia St",Some(4.9))), (Store("Starbucks"),("Seattle",None)), (Store("Philz"),("748 Van Ness Ave",Some(4.8))), (Store("Philz"),("3101 24th St",Some(4.8)))}  storeAddress.rightOuterJoin(storeRating) == {(Store("Ritual"),(Some("1026 Valencia St"),4.9)), (Store("Philz"),(Some("748 Van Ness Ave"),4.8)), (Store("Philz"), (Some("3101 24th St"),4.8))}
  • 36.
  • 37.
  • 38. Tuning the level of parallelism • When performing aggregations or grouping operations, we can ask Spark to use a specific number of partitions. Spark will always try to infer a sensible default value based on the size of your cluster, but in some cases you will want to tune the level of parallelism for better performance.  val data = Seq(("a", 3), ("b", 4), ("a", 1))  sc.parallelize(data).reduceByKey((x, y) => x + y) // Default parallelism  sc.parallelize(data).reduceByKey((x, y) => x + y,10) // Custom parallelism • Repartitioning your data is a fairly expensive operation. Spark also has an optimized version of repartition() function called coalesce() that allows avoiding data movement, but only if you are decreasing the number of RDD partitions. • To know whether you can safely call coalesce(), you can check the size of the RDD using rdd.partitions.size() or rdd.getNumPartitions(). • To see each partition data of a RDD,  scala> rdd.mapPartitionsWithIndex( (index: Int, it: Iterator[(Int,Int)]) => it.toList.map(x => index + ":" + x).iterator).collect
  • 39. Sorting Data • Having sorted data is quite useful in many cases, especially when you’re producing downstream output. We can sort an RDD with key/value pairs provided that there is an ordering defined on the key. Once we have sorted our data, any subsequent call on the sorted data to collect() or save() will result in ordered data. • Since we often want our RDDs in the reverse order, the sortByKey() function takes a parameter called true/false indicating whether we want it in ascending order (it defaults to true i.e. ascending). • Sometimes we want a different sort order entirely, and to support this we can provide our own comparison function.  Custom sort order in Scala, sorting integers as if strings  val rdd1 = sc.parallelize(List(("Hadoop PIG Hive"), ("Hive PIG PIG Hadoop"), ("Hadoop Hadoop Hadoop")))  val rdd2 = rdd1.flatMap(x => x.split(" ")).map(x => (x,1))  val rdd3 = rdd2.reduceByKey((x,y) => (x+y))  rdd3.takeOrdered(3)(Ordering[Int].on(x=>x._2))
  • 40. Actions available on Pair RDDs • As with the transformations, all of the traditional actions available on the base RDD are also available on pair RDDs. Some additional actions are available on pair RDDs to take advantage of the key/value nature of the data.
  • 41. Accumulators • Shared variable, accumulators, provides a simple syntax for aggregating values from worker nodes back to the driver program. • One of the most common uses of accumulators is to count events that occur during job execution for debugging purposes.  val sc = new SparkContext(...)  val file = sc.textFile("file.txt")  val blankLines = sc.accumulator(0) // Create Accumulator[Int] initialized to 0  val callSigns = file.flatMap(line => { if (line == "") { blankLines += 1 // Add to the accumulator } line.split(" ")})  callSigns.saveAsTextFile("output.txt")  println("Blank lines: " + blankLines.value)  Note that we will see the right count only after we run the saveAsTextFile() action, because the transformation above it, map(), is lazy, so the side-effect of incrementing of the accumulator will happen only when the map() transformation is forced to occur by the saveAsTextFile() action.  Also note that tasks on worker nodes cannot access the accumulator’s value()—from the point of view of these tasks, accumulators are write-only variables. This allows accumulators to be implemented efficiently, without having to communicate every update.
  • 42. Accumulators cont… • Accumulators and Fault Tolerance  Spark automatically deals with failed or slow machines by re-executing failed or slow tasks. For example, if the node running a partition of a map() operation crashes, Spark will rerun it on another node; and even if the node does not crash but is simply much slower than other nodes, Spark can preemptively launch a “speculative” copy of the task on another node, and take its result if that finishes. Even if no nodes fail, Spark may have to rerun a task to rebuild a cached value that falls out of memory. The net result is therefore that the same function may run multiple times on the same data depending on what happens on the cluster.  The end result is that for accumulators used in actions, Spark applies each task’s update to each accumulator only once. Thus, if we want a reliable absolute value counter, regardless of failures or multiple evaluations, we must put it inside an action like foreach(). • Custom Accumulators  Spark supports accumulators of type Int, Double, Long, and Float.  Spark also includes an API to define custom accumulator types and custom aggregation operations.  Custom accumulators need to extend AccumulatorParam.  we can use any operation for add, provided that operation is commutative and associative.  An operation op is commutative if a op b = b op a for all values a, b.  An operation op is associative if (a op b) op c = a op (b op c) for all values a, b, and c.
  • 43. Broadcast Variables • Shared variable, broadcast variables, allows the program to efficiently send a large, read-only value to all the worker nodes for use in one or more Spark operations.  It’s expensive to send that Array from the master with each task. Also if we used same object later (if we ran the same code for other files), it would send again to each node.  val c = sc.broadcast(Array(100,200))  val a = sc.parallelize(List(“This is Krishna” ,“Sureka learning the Spark”))  val d = a.flatMap(x=>x.split(" ")).map(x=>x+c.value(1)) • Optimizing Broadcasts  When we are broadcasting large values, it is important to choose a data serialization format that is both fast and compact.
  • 44. Working on a Per-Partition Basis  Working with data on a per-partition basis allows us to avoid redoing setup work for each data item. Operations like opening a database connection or creating a random number generator are examples of setup steps that we wish to avoid doing for each element.  Spark has per-partition versions of map and foreach to help reduce the cost of these operations by letting you run code only once for each partition of an RDD.  Example-1 (mapPartitions)  val a = sc.parallelize(List(1,2,3,4,5,6),2)  scala> val b = a.mapPartitions((x:Iterator[Int])=>{println("Hello"); x.toList.map(y=>y+1).iterator})  scala> val b = a.mapPartitions((x:Iterator[Int])=>{println("Hello"); x.map(y=>y+1)})  scala> b.collect  Hello  Hello  res20: Array[Int] = Array(2, 3, 4, 5, 6, 7)  scala> a.mapPartitions(x=>x.filter(y => y > 1)).collect  res35: Array[Int] = Array(2, 3, 4, 5, 6)
  • 45. Working on a Per-Partition Basis  Example-2 (mapPartitionWithIndex)  scala> val b = a.mapPartitionsWithIndex((index:Int,x:Iterator[Int])=>{println("Hello from "+ index); if(index==1){x.map(y=>y+1)}else{(x.map(y=>y+200))}})  b: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[28] at mapPartitionsWithIndex at <console>:29  scala> val b = a.mapPartitionsWithIndex((index,x)=>{println("Hello from "+ index); if(index==1){x.map(y=>y+1)}else{(x.map(y=>y+200))}})  scala> b.collect  Hello from 0  Hello from 1  res44: Array[Int] = Array(201, 202, 203, 5, 6, 7)
  • 48. • scala> case class person(id:Int,name:String,sal:Int) • defined class person • • scala> person • res38: person.type = person • • scala> val a = sc.textFile("file:///home/vm4learning/Desktop/spark/emp") • a: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[36] at textFile at <console>:27 • • scala> a.collect • res39: Array[String] = Array(837798 Krishna 5000000, 843123 Sandipta 10000000, 844066 Abhishek 40000, 843289 ZohriSahab 9999999) •
  • 49. • scala> val b = a.map(_.split(" ")) • b: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[37] at map at <console>:29 • • scala> b.collect • res40: Array[Array[String]] = Array(Array(837798, Krishna, 5000000), Array(843123, Sandipta, 10000000), Array(844066, Abhishek, 40000), Array(843289, ZohriSahab, 9999999)) • • scala> val c = b.map (x=>person(x(0).toInt,x(1),x(2).toInt)) • c: org.apache.spark.rdd.RDD[person] = MapPartitionsRDD[38] at map at <console>:33 • • scala> c.collect • res41: Array[person] = Array(person(837798,Krishna,5000000), person(843123,Sandipta,10000000), person(844066,Abhishek,40000), person(843289,ZohriSahab,9999999)) • • scala> val d = c.toDF() • d: org.apache.spark.sql.DataFrame = [id: int, name: string, sal: int] •
  • 50. • scala> d.collect • res42: Array[org.apache.spark.sql.Row] = Array([837798,Krishna,5000000], [843123,Sandipta,10000000], [844066,Abhishek,40000], [843289,ZohriSahab,9999999]) • • scala> d.registerTempTable("Emp") • • scala> val e = sqlContext.sql("""select id,name,sal from Emp where sal < 9999999""") • e: org.apache.spark.sql.DataFrame = [id: int, name: string, sal: int] • • scala> e.collect • res44: Array[org.apache.spark.sql.Row] = Array([837798,Krishna,5000000], [844066,Abhishek,40000]) • • scala> e • res46: org.apache.spark.sql.DataFrame = [id: int, name: string, sal: int] • • scala> val a1 = sc.textFile("file:///home/vm4learning/Desktop/spark/dept") • a1: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[52] at textFile at <console>:27 •
  • 51. • scala> a1.collect • res49: Array[String] = Array(843123 CEO, 844066 SparkCommiter, 843289 BoardHSBC) • • scala> case class pdept(id:Int,dept:String) • defined class pdept • • scala> val a2 = a1.map(x=>x.split(" ")).map(x=>pdept(x(0).toInt,x(1))).toDF() • a2: org.apache.spark.sql.DataFrame = [id: int, dept: string] • • scala> a2.collect • res50: Array[org.apache.spark.sql.Row] = Array([843123,CEO], [844066,SparkCommiter], [843289,BoardHSBC]) •
  • 52. • scala> a2.registerTempTable("Dept") • • scala> val e = sqlContext.sql("""select * from Emp join Dept on Emp.id = Dept.id where sal < 9999999""") • e: org.apache.spark.sql.DataFrame = [id: int, name: string, sal: int, id: int, dept: string] • • scala> e.collect • res52: Array[org.apache.spark.sql.Row] = Array([844066,Abhishek,40000,844066,SparkCommiter]) • • scala> sqlContext.cacheTable("Dept") • • scala> sqlContext.uncacheTable("Dept")
  • 53. Thank You • Question? • Feedback? explorehadoop@gmail.com