Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training | Edureka

www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
5 Best Practices in DevOps Culture

What to expect?
Why Apache
Spark?
Use Case
5
Hands-On
Examples
4
Spark Ecosystem
3
Spark Features
2
1

Big Data

Data Generated Every Minute!

IOT: 50 Billion Devices By 2020
Rapid adoption rate of digital infrastructure
5x faster than electricity & telephony
50
Billion
SmartObjects
World Population
Inflection Point
2003 2008 2010 2015 2020
6.307
6.721 6.894 7.347 7.83
Tablets, Laptops, Phones
“~6 things online” per person
Sensors, Smart, Objects, Device Clustered Systems

Big Data Analytics

Big Data Analytics
➢ Big Data Analytics is the process of examining large data sets to uncover
hidden patterns, unknown correlations, market trends, customer
preferences and other useful business information.
➢ Big Data Analytics is of two types:
1. Batch Analytics
2. Real-Time Analytics

Batch vs Real Time
Analytics

Analytics based on the data collected over a period of time is
Batch Analytics
Analytics based on immediate data for instant result is Real-
Time (Stream) Analytics
Client Client Client
Time
ms ms ms
Client Client Client
Time
Stored Stored Stored
ETL
Client
Batch vs Real Time Analytics

Use Cases of
Real Time Analytics

Use Cases of Real Time Analytics
Banking Healthcare
Telecommunications Stock Market
Government

Why Spark When
Hadoop Is Already There?

Batch Processing In Hadoop
Day 1 Day 2 Day 3 Day 4
Input DataInput Data Input Data Input Data
Processing
Data
Processing
Data
Processing
Data
Processing
Data
Time Lag
Day N
Day N
Processing Data Using MapReduce
Hadoop processes the data
stored over a period of time

Real Time Processing In Spark
Input DataInput Data Input Data Input Data
Processing
Data
Processing
Data
Processing
Data
Processing
Data
No Time Lag
Day N
Day N
Spark overcomes
the time lag issue

Spark Vs Hadoop
Hadoop implements Batch processing on Big Data.
It thus cannot deliver to our Real-Time use case needs.
Process data in real-time
Easy to use
Faster processing
Our Requirements:
Handle input from multiple sources

Spark Success Story

Spark Success Story
Twitter Sentiment Analysis
With Spark
Trending Topics can
be used to create
campaigns and attract
larger audience
Sentiment helps in
crisis management,
service adjusting and
target marketing
NYSE: Real Time Analysis
of Stock Market Data
Banking: Credit Card
Fraud Detection
Genomic Sequencing

Spark Overview

What Is Spark?
 Apache Spark is an open-source cluster-computing framework for real
time processing developed by the Apache Software Foundation
 Spark provides an interface for programming entire clusters with implicit
data parallelism and fault-tolerance
 It was built on top of Hadoop MapReduce and it extends the
MapReduce model to efficiently use more types of computations
Reduction
in time
Parallel
Serial
Figure: Data Parallelism In Spark
Figure: Real Time Processing In Spark

Deployment
Powerful
Caching
Polyglot
Features
100x faster than
for large scale data
processing
Simple programming
layer provides powerful
caching and disk
persistence capabilities
Can be deployed through
Mesos, Hadoop via Yarn, or
Spark’s own cluster manger
Can be programmed
in Scala, Java,
Python and R
Speed
vs
Why Spark?

Using Hadoop
Through Spark

Spark And Hadoop
Spark can run on top of
Hadoop’s distributed file system
Hadoop Distributed File System
(HDFS) to leverage the
distributed replicated storage
Spark can be used along with
MapReduce in the same
Hadoop cluster or can be used
alone as a processing
framework
Spark applications can
also be run on YARN
(Hadoop NextGen)

Spark And Hadoop
Spark is not intended to replace
Hadoop but it can regarded as an
extension to it
MapReduce and Spark are used
together where MapReduce is used for
batch processing and Spark for real-
time processing

Spark Features
Speed
Polyglot
Advanced Analytics
In-Memory Computation
Hadoop Integration
Machine Learning

Spark Features
Polyglot: Programming in Scala, Python, Java and R
Spark runs upto 100x times faster than MapReduce1
2
vs

Spark Features
Real time computation & low latency because of in-memory computation
Lazy Evaluation: Delays evaluation till needed3
4

Spark Features
Machine Learning for iterative tasks
Hadoop Integration5
6
1
2

Spark Core
Spark Streaming
Spark SQL
MLlib
GraphX
Spark Ecosystem

Spark Ecosystem
Used for structured
data. Can run
unmodified hive
queries on existing
Hadoop deployment
Spark Core Engine
Spark SQL
(SQL)
Spark
Streaming
(Streaming)
MLlib
(Machine
Learning)
GraphX
(Graph
Computation)
SparkR
(R on Spark)
Enables analytical
and interactive
apps for live
streaming data.
Package for R language to
enable R-users to leverage
Spark power from R shell
Machine learning
libraries being built
on top of Spark.
The core engine for entire Spark framework. Provides
utilities and architecture for other components
Graph Computation
engine (Similar to
Giraph). Combines data-
parallel and graph-
parallel concepts

Spark Ecosystem
Spark Core Engine
Spark SQL
(SQL)
Spark
Streaming
(Streaming)
MLlib
(Machine
learning)
GraphX
(Graph
Computation)
SparkR
(R on Spark)
DataFrames ML Pipelines
Tabular data
abstraction
introduced by
Spark SQL
ML pipelines makes
it easier to combine
multiple algorithms
or workflows

Spark Core
Spark Streaming
Spark SQL
MLlib
GraphX
Spark Core

Spark Core
Spark Core is the base engine for large-scale parallel and distributed data
processing
It is responsible for:
 Memory management and fault recovery
 Scheduling, distributing and monitoring jobs on a cluster
 Interacting with storage systems
Figure: Spark Core Job Cluster
Table
Row
Row
Row
Row
Result

Spark Architecture
Figure: Components of a Spark cluster
Driver Program
Spark Context
Cluster Manager
Worker Node
Executor
Cache
Task Task
Worker Node
Executor
Cache
Task Task

Spark Core
Spark Streaming
Spark SQL
MLlib
GraphX
Spark Streaming

Spark Streaming
 Spark Streaming is used for processing real-time streaming data
 It is a useful addition to the core Spark API
 Spark Streaming enables high-throughput and fault-tolerant
stream processing of live data streams
 The fundamental stream unit is DStream which is basically a
series of RDDs to process the real-time data Figure: Streams In Spark Streaming

Spark Streaming
Figure: Overview Of Spark Streaming
MLlib
Machine Learning
Spark SQL
SQL + DataFrames
Spark Streaming
Streaming Data
Sources
Static Data
Sources
Data Storage
Systems

Spark Streaming
Kafka
HDFS/ S3
Flume
Streaming
Twitter
Kinesis
Databases
HDFS
Dashboards
Figure: Data from a variety of sources to various storage systems
Streaming Engine
Input Data
Stream
Batches Of
Input Data
Batches Of
Processed Data
Figure: Incoming streams of data divided into batches
Figure: Extracting words from an InputStream
Data From
Time 0 to 1
Data From
Time 0 to 1
Data From
Time 0 to 1
Data From
Time 0 to 1
RDD @ Time 1 RDD @ Time 2 RDD @ Time 3 RDD @ Time 4
DStream
Figure: Input data stream divided into discrete chunks of data
Data From
Time 0 to 1
Data From
Time 0 to 1
Data From
Time 0 to 1
Data From
Time 0 to 1
DStream
Words From
Time 0 to 1
Words From
Time 0 to 1
Words From
Time 0 to 1
Words From
Time 0 to 1
Words
DStream
flatMap
Operation

Spark Core
Spark Streaming
Spark SQL
MLlib
GraphX
Spark SQL

Spark SQL Features
Spark SQL is used for the structured/semi structured data analysis in Spark
Spark SQL integrates relational processing with Spark’s functional programming1
2

Spark SQL Features
SQL queries can be converted into RDDs for transformations
Support for various data formats3
4
RDD 1 RDD 2
Shuffle
transform
Drop split
point
Invoking RDD 2 computes all partitions of RDD 1

Performance And Scalability
Spark SQL Overview
5

Spark SQL Features
Standard JDBC/ODBC Connectivity6
User Defined Functions lets users define new
Column-based functions to extend the Spark
vocabulary
User
7

Spark SQL Flow Diagram
Spark SQL
Service
Interpreter &
OptimizerResilient
Distributed
Dataset
 Spark SQL has the following libraries:
1. Data Source API
2. DataFrame API
3. Interpreter & Optimizer
4. SQL Service
 The flow diagram represents a Spark SQL process using all the four libraries in
sequence
DataFrame API
Named
Columns
Data Source
API

Spark Core
Spark Streaming
Spark SQL
MLlib
GraphX
MLlib

MLlib
Machine learning may be broken down into two classes of
algorithms:
1. Supervised algorithms use labelled data in which both the
input and output are provided to the algorithm
2. Unsupervised algorithms do not have the outputs in
advance. These algorithms are left to make sense of the data
without labels.
Machine Learning
Supervised
• Classification
- Naïve Bayes
- SVM
• Regression
- Linear
- Logistic
Unsupervised
• Clustering
- K Means
• Dimensionality
Reduction
- Principal
Component Analysis
- SVD

Mllib Techniques
Techniques for Machine Learning. There are 3 common categories of techniques:
1. Classification
2. Clustering
3. Collaborative Filtering

Mllib - Techniques
1. Classification: It is a family of supervised machine
learning algorithms that designate input as belonging
to one of several pre-defined classes
Some common use cases for classification include:
i) Credit card fraud detection
ii) Email spam detection
2. Clustering: In clustering, an algorithm groups objects
into categories by analyzing similarities between input
examples

Mllib - Techniques
Collaborative Filtering: Collaborative filtering algorithms
recommend items (this is the filtering part) based on
preference information from many users (this is the
collaborative part)
3.

Spark Core
Spark Streaming
Spark SQL
MLlib
GraphX
GraphX

GraphX
Graph Concepts
A graph is a mathematical structure used to model relations between objects. A graph is made up of vertices and edges that
connect them. The vertices are the objects and the edges are the relationships between them.
A directed graph is a graph where the edges have a direction associated with them. E.g. User Bob follows Carol on Twitter.
Bob
Carol
Relationship: Friends
Edge Vertex
Bob
Carol
Relationship: Friends
Follows

GraphX Use Cases
Used in finding the influencers in any
network such as paper-citation network
or social media network
Used to detect disasters such as
hurricanes, earthquakes, tsunami, forest
fires and volcanos so as to provide
warnings to alert people
Used to monitor financial transaction
and detect people involved in financial
fraud and money laundering
Used along with Machine Learning to
understand the customer purchase
trends
E.g. Uber, McDonalds, etc.
Analyze Business Trends
Event Detection System PageRank Financial Fraud Detection
Used to develop functionalities on
geographic information systems like
watershed delineation
Geographic Information Systems
Google Pregel
Pregel is Google’s scalable and fault-
tolerant platform with an API that is
sufficiently flexible to express arbitrary
graph algorithms

Use Case: Earthquake Detection
Using Spark

Use Case - Earthquake Detection Using Spark
Big Data Analytics is majorly used in Disaster Detection & Prevention Systems like hurricanes, earthquakes,
tsunami, forest fires, volcanos, etc.
We will be using earthquake for our use case.
An earthquake is the shaking of the surface of the Earth, resulting from the sudden release of energy in
the Earth's lithosphere that creates seismic waves.
Figure: Earthquake prone areas around the worldFigure: Earth’s tectonic plates movement

Use Case – Japan Earthquake Warning Model
 At 2:46 p.m. on March 11, 2011, Japan's Earthquake Early Warning System
detected Tohoku quake.
 It immediately sent computer-generated alerts across the country to cell
phones, TVs, schools, factories, and transit systems.
 As a result, schools had time to get all their students under desks, bullet
trains slowed to a stop and more than 16,000 elevators automatically shut
down when the alarm system went off.
 In the sixty seconds before the giant temblor struck, roughly 52 million
people received text-message warnings that the quake was fast
approaching and that they needed to get out of harm's way.
Japan’s Earthquake Early Warning System
Tohoku Quake And Tsunami

Process data in real-time
Easy to use
Bulk transmission of alerts
Requirements:
Handle input from multiple sources
Problem Statement
To design a Real Time Earthquake Detection Model to send life saving alerts,
which should improve its machine learning to provide near real-time
computation results.
Use Case - Problem Statement

Use Case - Dataset
Figure: Earthquake_ROC_Dataset.txt
The following is the dataset we are using for our earthquake prediction system.
The attributes of each particular row is as below:
1. Classification Index
2. First Activity Time*
3. Time Taken
4. Acceleration
5. Building Strength
6. Velocity
7. Sa
8. Sd
9. First Activity Time*
10. Time Taken
11. Acceleration
12. Building Strength
13. Velocity
14. Sa
15. Sd
*Columns 2-8 represent the Secondary Wave and columns 9-15 represent the Primary
Wave.

Use Case – Dataset Transformation
Figure: Dataset transformed in MS Excel

Receiver Operating Characteristic (ROC) curve is a graphical plot
that illustrates the performance of a binary classifier system as its
discrimination threshold is varied
 There are points on the graph where the earthquake metrics
exceed the ROC curve
 Such points represent those earthquakes with magnitudes
greater than 6.0 and categorized as a major hazard
We will be calculating the ROC for our earthquake dataset and
then visualize the results to detect the hazardous occurrences of
earthquakes
Use Case – ROC Curve

Implementing Earthquake
Detection Using Spark MLlib

Use Case – Running Spark
//Importing the necessary classes
import org.apache.spark.SparkContext._
import org.apache.spark.SparkContext
import org.apache.spark.annotation.Since
import org.apache.spark.mllib.classification.impl.GLMClassificationModel
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.optimization._
import org.apache.spark.mllib.pmml.PMMLExportable
import org.apache.spark.mllib.regression._
import org.apache.spark.mllib.util.{DataValidators, Loader, Saveable}
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkConf
import org.apache.spark.mllib.classification.SVMModel
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics

Use Case – Spark Machine Learning
//Creating an Object earth
object earth {
def main(args: Array[String]) {
//Creating a Spark Configuration and Spark Context
val sparkConf = new
SparkConf().setAppName("earth").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
//Loading the sample.txt file as a LibSVM file
val data = MLUtils.loadLibSVMFile(sc,
"/home/edureka/Downloads/Earthquake_ROC_Dataset.txt")
//Training the data for Machine Learning
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1)

Use Case – Spark Machine Learning
//Creating a model of the trained data
val numIterations = 100
val model = SVMWithSGD.train(training, numIterations)
model.clearThreshold()
//Using map transformation of model RDD
val scoreAndLabels = test.map { point =>
val score = model.predict(point.features)
(score, point.label)
}
//Using Binary Classification Metrics on scoreAndLabels
val metrics = new
BinaryClassificationMetrics(scoreAndLabels)
val auROC = metrics.areaUnderROC()
//Displaying the area under Receiver Operating Characteristic
println("Area under ROC = " + auROC)
}
}

Use Case – Result
Figure: Area Under ROC Result From Eclipse

Visualizing Results

Use Case - Visualizing Results
0
100
200
300
400
500
600
700
1
19
37
55
73
91
109
127
145
163
181
199
217
235
253
271
289
307
325
343
361
379
397
415
433
451
469
487
505
523
541
559
577
595
613
631
649
667
685
703
721
739
757
775
793
811
829
847
865
883
901
919
937
955
973
991
1009
1027
1045
1063
1081
1099
1117
1135
1153
1171
1189
1207
1225
1243
1261
1279
1297
1315
1333
1351
1369
1387
1405
1423
1441
1459
1477
1495
1513
1531
1549
1567
1585
1603
Earthquake ROC Dataset
Receiver Operating Characteristic (ROC) curve is a graphical plot that illustrates the
performance of a binary classifier system as its discrimination threshold is varied.
Area Under ROC
Earthquake Points

0
100
200
300
400
500
600
700
1
34
67
100
133
166
199
232
265
298
331
364
397
430
463
496
529
562
595
628
661
694
727
760
793
826
859
892
925
958
991
1024
1057
1090
1123
1156
1189
1222
1255
1288
1321
1354
1387
1420
1453
1486
1519
1552
1585
Earthquake ROC Dataset
Series1 Series2
Use Case - Visualizing Results
2, 638.4502758
17, 634.9229262
34, 624.3408775 51, 625.2227149
98, 642.8594628
500
520
540
560
580
600
620
640
660
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99
Series1 Series2
Points Crossing ROC Curve
 As we can observe from the chart, there are points on
the graph where they exceed the ROC curve.
 Such points represent those earthquakes with
magnitudes greater than 6.0 and categorized as a
major hazard.
 Thus armed with this
knowledge, we could use
SparkSQL and query an existing
Hive table to retrieve email
addresses and send people
personalized warning emails.
Area Under ROC
Earthquake Points

Conclusion
Congrats!
We have hence demonstrated the power of Spark in Real Time Data
Analytics.
The hands-on examples will give you the required confidence to work on
any future projects you encounter in Apache Spark.

Thank You …
Questions/Queries/Feedback

Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training | Edureka

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training | Edureka

Similar to Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training | Edureka (20)

More from Edureka!

More from Edureka! (20)

Recently uploaded

Recently uploaded (20)

Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training | Edureka