This Edureka Spark Tutorial will help you to understand all the basics of Apache Spark. This Spark tutorial is ideal for both beginners as well as professionals who want to learn or brush up Apache Spark concepts. Below are the topics covered in this tutorial:
1) Big Data Introduction
2) Batch vs Real Time Analytics
3) Why Apache Spark?
4) What is Apache Spark?
5) Using Spark with Hadoop
6) Apache Spark Features
7) Apache Spark Ecosystem
8) Demo: Earthquake Detection Using Apache Spark
5. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
IOT: 50 Billion Devices By 2020
Rapid adoption rate of digital infrastructure
5x faster than electricity & telephony
50
Billion
SmartObjects
World Population
Inflection Point
2003 2008 2010 2015 2020
6.307
6.721 6.894 7.347 7.83
Tablets, Laptops, Phones
“~6 things online” per person
Sensors, Smart, Objects, Device Clustered Systems
7. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Big Data Analytics
➢ Big Data Analytics is the process of examining large data sets to uncover
hidden patterns, unknown correlations, market trends, customer
preferences and other useful business information.
➢ Big Data Analytics is of two types:
1. Batch Analytics
2. Real-Time Analytics
9. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Analytics based on the data collected over a period of time is
Batch Analytics
Analytics based on immediate data for instant result is Real-
Time (Stream) Analytics
Client Client Client
Time
ms ms ms
Client Client Client
Time
Stored Stored Stored
ETL
Client
Batch vs Real Time Analytics
13. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Batch Processing In Hadoop
Day 1 Day 2 Day 3 Day 4
Day 1 Day 2 Day 3 Day 4
Input DataInput Data Input Data Input Data
Processing
Data
Processing
Data
Processing
Data
Processing
Data
Time Lag
Day N
Day N
Processing Data Using MapReduce
Hadoop processes the data
stored over a period of time
14. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Real Time Processing In Spark
Day 1 Day 2 Day 3 Day 4
Day 1 Day 2 Day 3 Day 4
Input DataInput Data Input Data Input Data
Processing
Data
Processing
Data
Processing
Data
Processing
Data
No Time Lag
Day N
Day N
Spark overcomes
the time lag issue
15. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Vs Hadoop
Hadoop implements Batch processing on Big Data.
It thus cannot deliver to our Real-Time use case needs.
Process data in real-time
Easy to use
Faster processing
Our Requirements:
Handle input from multiple sources
17. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Success Story
Twitter Sentiment Analysis
With Spark
Trending Topics can
be used to create
campaigns and attract
larger audience
Sentiment helps in
crisis management,
service adjusting and
target marketing
NYSE: Real Time Analysis
of Stock Market Data
Banking: Credit Card
Fraud Detection
Genomic Sequencing
19. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
What Is Spark?
Apache Spark is an open-source cluster-computing framework for real
time processing developed by the Apache Software Foundation
Spark provides an interface for programming entire clusters with implicit
data parallelism and fault-tolerance
It was built on top of Hadoop MapReduce and it extends the
MapReduce model to efficiently use more types of computations
Reduction
in time
Parallel
Serial
Figure: Data Parallelism In Spark
Figure: Real Time Processing In Spark
20. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Deployment
Powerful
Caching
Polyglot
Features
100x faster than
for large scale data
processing
Simple programming
layer provides powerful
caching and disk
persistence capabilities
Can be deployed through
Mesos, Hadoop via Yarn, or
Spark’s own cluster manger
Can be programmed
in Scala, Java,
Python and R
Speed
vs
Why Spark?
22. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark And Hadoop
Spark can run on top of
Hadoop’s distributed file system
Hadoop Distributed File System
(HDFS) to leverage the
distributed replicated storage
Spark can be used along with
MapReduce in the same
Hadoop cluster or can be used
alone as a processing
framework
Spark applications can
also be run on YARN
(Hadoop NextGen)
23. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark And Hadoop
Spark is not intended to replace
Hadoop but it can regarded as an
extension to it
MapReduce and Spark are used
together where MapReduce is used for
batch processing and Spark for real-
time processing
29. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Ecosystem
Used for structured
data. Can run
unmodified hive
queries on existing
Hadoop deployment
Spark Core Engine
Spark SQL
(SQL)
Spark
Streaming
(Streaming)
MLlib
(Machine
Learning)
GraphX
(Graph
Computation)
SparkR
(R on Spark)
Enables analytical
and interactive
apps for live
streaming data.
Package for R language to
enable R-users to leverage
Spark power from R shell
Machine learning
libraries being built
on top of Spark.
The core engine for entire Spark framework. Provides
utilities and architecture for other components
Graph Computation
engine (Similar to
Giraph). Combines data-
parallel and graph-
parallel concepts
30. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Ecosystem
Spark Core Engine
Spark SQL
(SQL)
Spark
Streaming
(Streaming)
MLlib
(Machine
learning)
GraphX
(Graph
Computation)
SparkR
(R on Spark)
DataFrames ML Pipelines
Tabular data
abstraction
introduced by
Spark SQL
ML pipelines makes
it easier to combine
multiple algorithms
or workflows
32. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Core
Spark Core is the base engine for large-scale parallel and distributed data
processing
It is responsible for:
Memory management and fault recovery
Scheduling, distributing and monitoring jobs on a cluster
Interacting with storage systems
Figure: Spark Core Job Cluster
Table
Row
Row
Row
Row
Result
35. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Streaming
Spark Streaming is used for processing real-time streaming data
It is a useful addition to the core Spark API
Spark Streaming enables high-throughput and fault-tolerant
stream processing of live data streams
The fundamental stream unit is DStream which is basically a
series of RDDs to process the real-time data Figure: Streams In Spark Streaming
37. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark Streaming
Kafka
HDFS/ S3
Flume
Streaming
Twitter
Kinesis
Databases
HDFS
Dashboards
Figure: Data from a variety of sources to various storage systems
Streaming Engine
Input Data
Stream
Batches Of
Input Data
Batches Of
Processed Data
Figure: Incoming streams of data divided into batches
Figure: Extracting words from an InputStream
Data From
Time 0 to 1
Data From
Time 0 to 1
Data From
Time 0 to 1
Data From
Time 0 to 1
RDD @ Time 1 RDD @ Time 2 RDD @ Time 3 RDD @ Time 4
DStream
Figure: Input data stream divided into discrete chunks of data
Data From
Time 0 to 1
Data From
Time 0 to 1
Data From
Time 0 to 1
Data From
Time 0 to 1
DStream
Words From
Time 0 to 1
Words From
Time 0 to 1
Words From
Time 0 to 1
Words From
Time 0 to 1
Words
DStream
flatMap
Operation
40. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark SQL Features
SQL queries can be converted into RDDs for transformations
Support for various data formats3
4
RDD 1 RDD 2
Shuffle
transform
Drop split
point
Invoking RDD 2 computes all partitions of RDD 1
43. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Spark SQL Flow Diagram
Spark SQL
Service
Interpreter &
OptimizerResilient
Distributed
Dataset
Spark SQL has the following libraries:
1. Data Source API
2. DataFrame API
3. Interpreter & Optimizer
4. SQL Service
The flow diagram represents a Spark SQL process using all the four libraries in
sequence
DataFrame API
Named
Columns
Data Source
API
45. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
MLlib
Machine learning may be broken down into two classes of
algorithms:
1. Supervised algorithms use labelled data in which both the
input and output are provided to the algorithm
2. Unsupervised algorithms do not have the outputs in
advance. These algorithms are left to make sense of the data
without labels.
Machine Learning
Supervised
• Classification
- Naïve Bayes
- SVM
• Regression
- Linear
- Logistic
Unsupervised
• Clustering
- K Means
• Dimensionality
Reduction
- Principal
Component Analysis
- SVD
47. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Mllib - Techniques
1. Classification: It is a family of supervised machine
learning algorithms that designate input as belonging
to one of several pre-defined classes
Some common use cases for classification include:
i) Credit card fraud detection
ii) Email spam detection
2. Clustering: In clustering, an algorithm groups objects
into categories by analyzing similarities between input
examples
48. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Mllib - Techniques
Collaborative Filtering: Collaborative filtering algorithms
recommend items (this is the filtering part) based on
preference information from many users (this is the
collaborative part)
3.
50. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
GraphX
Graph Concepts
A graph is a mathematical structure used to model relations between objects. A graph is made up of vertices and edges that
connect them. The vertices are the objects and the edges are the relationships between them.
A directed graph is a graph where the edges have a direction associated with them. E.g. User Bob follows Carol on Twitter.
Bob
Carol
Relationship: Friends
Edge Vertex
Bob
Carol
Relationship: Friends
Follows
51. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
GraphX Use Cases
Used in finding the influencers in any
network such as paper-citation network
or social media network
Used to detect disasters such as
hurricanes, earthquakes, tsunami, forest
fires and volcanos so as to provide
warnings to alert people
Used to monitor financial transaction
and detect people involved in financial
fraud and money laundering
Used along with Machine Learning to
understand the customer purchase
trends
E.g. Uber, McDonalds, etc.
Analyze Business Trends
Event Detection System PageRank Financial Fraud Detection
Used to develop functionalities on
geographic information systems like
watershed delineation
Geographic Information Systems
Google Pregel
Pregel is Google’s scalable and fault-
tolerant platform with an API that is
sufficiently flexible to express arbitrary
graph algorithms
53. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case - Earthquake Detection Using Spark
Big Data Analytics is majorly used in Disaster Detection & Prevention Systems like hurricanes, earthquakes,
tsunami, forest fires, volcanos, etc.
We will be using earthquake for our use case.
An earthquake is the shaking of the surface of the Earth, resulting from the sudden release of energy in
the Earth's lithosphere that creates seismic waves.
Figure: Earthquake prone areas around the worldFigure: Earth’s tectonic plates movement
54. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case – Japan Earthquake Warning Model
At 2:46 p.m. on March 11, 2011, Japan's Earthquake Early Warning System
detected Tohoku quake.
It immediately sent computer-generated alerts across the country to cell
phones, TVs, schools, factories, and transit systems.
As a result, schools had time to get all their students under desks, bullet
trains slowed to a stop and more than 16,000 elevators automatically shut
down when the alarm system went off.
In the sixty seconds before the giant temblor struck, roughly 52 million
people received text-message warnings that the quake was fast
approaching and that they needed to get out of harm's way.
Japan’s Earthquake Early Warning System
Tohoku Quake And Tsunami
55. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Process data in real-time
Easy to use
Bulk transmission of alerts
Requirements:
Handle input from multiple sources
Problem Statement
To design a Real Time Earthquake Detection Model to send life saving alerts,
which should improve its machine learning to provide near real-time
computation results.
Use Case - Problem Statement
56. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case - Dataset
Figure: Earthquake_ROC_Dataset.txt
The following is the dataset we are using for our earthquake prediction system.
The attributes of each particular row is as below:
1. Classification Index
2. First Activity Time*
3. Time Taken
4. Acceleration
5. Building Strength
6. Velocity
7. Sa
8. Sd
9. First Activity Time*
10. Time Taken
11. Acceleration
12. Building Strength
13. Velocity
14. Sa
15. Sd
*Columns 2-8 represent the Secondary Wave and columns 9-15 represent the Primary
Wave.
58. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Receiver Operating Characteristic (ROC) curve is a graphical plot
that illustrates the performance of a binary classifier system as its
discrimination threshold is varied
There are points on the graph where the earthquake metrics
exceed the ROC curve
Such points represent those earthquakes with magnitudes
greater than 6.0 and categorized as a major hazard
We will be calculating the ROC for our earthquake dataset and
then visualize the results to detect the hazardous occurrences of
earthquakes
Use Case – ROC Curve
61. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case – Spark Machine Learning
//Creating an Object earth
object earth {
def main(args: Array[String]) {
//Creating a Spark Configuration and Spark Context
val sparkConf = new
SparkConf().setAppName("earth").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
//Loading the sample.txt file as a LibSVM file
val data = MLUtils.loadLibSVMFile(sc,
"/home/edureka/Downloads/Earthquake_ROC_Dataset.txt")
//Training the data for Machine Learning
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1)
62. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case – Spark Machine Learning
//Creating a model of the trained data
val numIterations = 100
val model = SVMWithSGD.train(training, numIterations)
model.clearThreshold()
//Using map transformation of model RDD
val scoreAndLabels = test.map { point =>
val score = model.predict(point.features)
(score, point.label)
}
//Using Binary Classification Metrics on scoreAndLabels
val metrics = new
BinaryClassificationMetrics(scoreAndLabels)
val auROC = metrics.areaUnderROC()
//Displaying the area under Receiver Operating Characteristic
println("Area under ROC = " + auROC)
}
}
65. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Use Case - Visualizing Results
0
100
200
300
400
500
600
700
1
19
37
55
73
91
109
127
145
163
181
199
217
235
253
271
289
307
325
343
361
379
397
415
433
451
469
487
505
523
541
559
577
595
613
631
649
667
685
703
721
739
757
775
793
811
829
847
865
883
901
919
937
955
973
991
1009
1027
1045
1063
1081
1099
1117
1135
1153
1171
1189
1207
1225
1243
1261
1279
1297
1315
1333
1351
1369
1387
1405
1423
1441
1459
1477
1495
1513
1531
1549
1567
1585
1603
Earthquake ROC Dataset
Receiver Operating Characteristic (ROC) curve is a graphical plot that illustrates the
performance of a binary classifier system as its discrimination threshold is varied.
Area Under ROC
Earthquake Points
66. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
0
100
200
300
400
500
600
700
1
34
67
100
133
166
199
232
265
298
331
364
397
430
463
496
529
562
595
628
661
694
727
760
793
826
859
892
925
958
991
1024
1057
1090
1123
1156
1189
1222
1255
1288
1321
1354
1387
1420
1453
1486
1519
1552
1585
Earthquake ROC Dataset
Series1 Series2
Use Case - Visualizing Results
2, 638.4502758
17, 634.9229262
34, 624.3408775 51, 625.2227149
98, 642.8594628
500
520
540
560
580
600
620
640
660
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99
Series1 Series2
Points Crossing ROC Curve
As we can observe from the chart, there are points on
the graph where they exceed the ROC curve.
Such points represent those earthquakes with
magnitudes greater than 6.0 and categorized as a
major hazard.
Thus armed with this
knowledge, we could use
SparkSQL and query an existing
Hive table to retrieve email
addresses and send people
personalized warning emails.
Area Under ROC
Earthquake Points
67. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
Conclusion
Congrats!
We have hence demonstrated the power of Spark in Real Time Data
Analytics.
The hands-on examples will give you the required confidence to work on
any future projects you encounter in Apache Spark.