SlideShare uma empresa Scribd logo
1 de 51
Machine Learning (on Spark)
BDS Session
Session objective
Introduction Quick Recap Spark
Machine Learning (on Spark)
Spark: Introduction, Architecture and Features
Spark: Resilient Distributed Datasets, Transformation, Examples
Regression, Classification, Collaborative Filtering, & Clustering.
SUMMARY
Shortcoming of MapReduce
 Forces your data processing into Map and Reduce
 Other workflows missing include join, filter, flatMap, groupByKey,
union, intersection, …
 Based on “Acyclic Data Flow” from Disk to Disk (HDFS)
 Read and write to Disk before and after Map and Reduce (stateless
machine)
 Not efficient for iterative tasks, i.e. Machine Learning
 Only Java natively supported
 Support for others languages needed
 Only for Batch processing
 Interactivity, streaming data
Introduction
One Solution is Apache Spark
 A new general framework, which solves many of the short comings
of MapReduce
 It capable of leveraging the Hadoop ecosystem, e.g. HDFS, YARN,
HBase, S3, …
 Has many other workflows, i.e. join, filter, flatMapdistinct,
groupByKey, reduceByKey, sortByKey, collect, count, first…
 (around 30 efficient distributed operations)
 In-memory caching of data (for iterative, graph, and machine
learning algorithms, etc.)
 Native Scala, Java, Python, and R support
 Supports interactive shells for exploratory data analysis
 Spark API is extremely simple to use
 Developed at AMPLab UC Berkeley, now by Databricks.com
Introduction Quick Recap Spark
 expressive computing system, not limited to map-
reduce model
 facilitate system memory
 avoid saving intermediate results to disk
 cache data for repetitive queries (e.g. for machine
learning)
 compatible with Hadoop
Spark Ideas
Spark Uses Memory instead of Disk
Iteration
1
Iteration
2
HDFS read
Iteration
1
Iteration
2
HDFS
read
HDFS
Write
HDFS
read HDFS
Write
Spark: In-Memory Data Sharing
Hadoop: Use Disk for Data Sharing
Spark: Introduction
Sort competition
Hadoop MR
Record (2013)
Spark
Record (2014)
Data Size 102.5 TB 100 TB
Elapsed Time 72 mins 23 mins
# Nodes 2100 206
# Cores 50400 physical 6592 virtualized
Cluster disk
throughput
3150 GB/s
(est.)
618 GB/s
Network
dedicated data
center, 10Gbps
virtualized (EC2) 10Gbps
network
Sort rate 1.42 TB/min 4.27 TB/min
Sort rate/node 0.67 GB/min 20.7 GB/min
Sort benchmark, Daytona Gray: sort of 100 TB of data (1 trillion records),http://databricks.com/blog/2014/11/05/spark-
officially-sets-a-new-record-in-large-scale-sorting.html
Spark, 3x
faster with
1/10 the
nodes
Spark: Introduction
Apache Spark supports data analysis, machine learning, graphs, streaming
data, etc. It can read/write from a range of data types and allows
development in multiple languages.
Spark Core
Spark
Streaming
MLlib GraphX
ML Pipelines
Spark SQL
DataFrames
Data Sources
Scala, Java, Python, R, SQL
Hadoop HDFS, HBase, Hive, Apache S3, Streaming, JSON, MySQL, and HPC-style
(GlusterFS, Lustre)
Spark: Introduction
Spark: Architecture
Spark: Architecture
Spark: Features
 Streaming Data
 Machine Learning
 Interactive Analysis
 Data Warehousing
 Batch Processing
 Exploratory Data Analysis
 Graph Data Analysis
 Spatial (GIS) Data Analysis
 And many more
Spark’s Main Use Cases
 Fingerprint Matching
 Developed a Spark based fingerprint minutia detection and fingerprint
matching code
 Twitter Sentiment Analysis
 Developed a Spark based Sentiment Analysis code for a Twitter dataset
Spark’s Main Use Cases
 Uber – the online taxi company gathers terabytes of event data from its mobile
users every day.
 By using Kafka, Spark Streaming, and HDFS, to build a continuous ETL
pipeline
 Convert raw unstructured event data into structured data as it is collected
 Uses it further for more complex analytics and optimization of operations
 Pinterest – Uses a Spark ETL pipeline
 Leverages Spark Streaming to gain immediate insight into how users all over
the world are engaging with Pins—in real time.
 Can make more relevant recommendations as people navigate the site
 Recommends related Pins
 Determine which products to buy, or destinations to visit
Spark’s Main Use Cases
Here are Few other Real World Use Cases:
 Conviva – 4 million video feeds per month
 This streaming video company is second only to YouTube.
 Uses Spark to reduce customer churn by optimizing video streams and managing live
video traffic
 Maintains a consistently smooth, high quality viewing experience.
 Capital One – is using Spark and data science algorithms to understand customers in a
better way.
 Developing next generation of financial products and services
 Find attributes and patterns of increased probability for fraud
 Netflix – leveraging Spark for insights of user viewing habits and then recommends movies
to them.
 User data is also used for content creation
Spark’s Main Use Cases
 Even though Spark is versatile, that doesn’t mean Spark’s in-memory
capabilities are the best fit for all use cases:
 For many simple use cases Apache MapReduce and Hive might be a
more appropriate choice
 Spark was not designed as a multi-user environment
 Spark users are required to know that memory they have is sufficient
for a dataset
 Adding more users adds complications, since the users will have to
coordinate memory usage to run code
Spark: when not to use
 Clouds and supercomputers are collections of
computers networked together in a
datacenter
 Clouds have different networking, I/O, CPU
and cost trade-offs than supercomputers
 Cloud workloads are data oriented vs.
computation oriented and are less closely
coupled than supercomputers
 Principles of parallel computing same on both
 Apache Hadoop and Spark vs. Open MPI
HPC and Big Data Convergence
 RDDs (Resilient Distributed Datasets) is Data Containers
 All the different processing components in Spark share
the same abstraction called RDD
 As applications share the RDD abstraction, you can mix
different kind of transformations to create new RDDs
 Created by parallelizing a collection or reading a file
 Fault tolerant
Resilient Distributed Datasets (RDDs)
RDD abstraction
 Resilient Distributed Datasets
 partitioned collection of records
 spread across the cluster
 read-only
 caching dataset in memory
 different storage levels available
 fallback to disk possible
Introduction Quick Recap Spark
RDD operations
 transformations to build RDDs through deterministic
operations on other RDDs
 transformations include map, filter, join
 lazy operation
 actions to return value or export data
 actions include count, collect, save
 triggers execution
Introduction Quick Recap Spark
Job example
val log = sc.textFile(“hdfs://...”)
val errors = file.filter(_.contains(“ERROR”))
errors.cache()
errors.filter(_.contains(“I/O”)).count()
errors.filter(_.contains(“timeout”)).count()
Driver
Worker Worker Worker
Block3Block1 Block2
Cache1 Cache2 Cache2
Action!
Introduction Quick Recap Spark
RDD partition-level view
HadoopRDD
path = hdfs://...
FilteredRDD
func = _.contains(…)
shouldCache = true
log:
errors:
Partition-level view:Dataset-level view:
Task 1Task 2 ...
source: https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
Introduction Quick Recap Spark
Job scheduling
rdd1.join(rdd2)
.groupBy(…)
.filter(…)
RDD Objects
build operator DAG
DAGScheduler
split graph into
stages of tasks
submit each
stage as ready
DAG
TaskScheduler
TaskSet
launch tasks via
cluster manager
retry failed or
straggling tasks
Cluster
manager
Worker
execute tasks
store and serve
blocks
Block
manager
Threads
Task
source: https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
Introduction Quick Recap Spark
Available APIs
 You can write in Java, Scala or Python
 interactive interpreter: Scala & Python only
 standalone applications: any
 performance: Java & Scala are faster thanks to static
typing
Introduction Quick Recap Spark
Summary
 concept not limited to single pass map-reduce
 avoid soring intermediate results on disk or HDFS
 speedup computations when reusing datasets
Introduction Quick Recap Spark
Summary
 concept not limited to single pass map-reduce
 avoid soring intermediate results on disk or HDFS
 speedup computations when reusing datasets
Introduction Quick Recap Spark
 DataFrames (DFs) is one of the other distributed datasets organized in
named columns
 Similar to a relational database, Python Pandas Dataframe or R’s
DataTables
 Immutable once constructed
 Track lineage
 Enable distributed computations
 How to construct Dataframes
 Read from file(s)
 Transforming an existing DFs(Spark or Pandas)
 Parallelizing a python collection list
 Apply transformations and actions
DataFrames & SparkSQL
DataFrame example
// Create a new DataFrame that contains “students”
students = users.filter(users.age < 21)
//Alternatively, using Pandas-like syntax
students = users[users.age < 21]
//Count the number of students users by gender
students.groupBy("gender").count()
// Join young students with another DataFrame called logs
students.join(logs, logs.userId == users.userId,
“left_outer")
Resilient Distributed Datasets (RDDs)
 RDDs provide a low level interface into Spark
 DataFrames have a schema
 DataFrames are cached and optimized by Spark
 DataFrames are built on top of the RDDs and the
core Spark API
Example: performance
RDDs vs. DataFrames
Transformations
(create a new RDD)
map
filter
sample
groupByKey
reduceByKey
sortByKey
intersection
flatMap
union
join
cogroup
cross
mapValues
reduceByKey
Actions
(return results to driver
program)
collect first
Reduce take
Count takeOrdered
takeSample countByKey
take save
lookupKey foreach
Spark Operations
A
B
S
C
E
D
F
DAGs track dependencies (also known as Lineage )
 nodes are RDDs
 arrows are Transformations
Transformation
A,1 A,[1,2]
A,2
Narrow Wide
Map groupByKey
Vs.
Narrow Vs. Wide Transformation
Spark Workflow
FlatMap Map groupbyKey
Spark
Contex
t
Driver
Progra
m
Collect
Narrow Vs. Wide Transformation
 What is an action
 The final stage of the workflow
 Triggers the execution of the DAG
 Returns the results to the driver
 Or writes the data to HDFS or to a file
Actions
 Word count
text_file = sc.textFile("hdfs://usr/godil/text/book.txt")
counts = text_file.flatMap(lambda line: line.split(" ")) 
.map(lambda word: (word, 1)) 
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://usr/godil/output/wordCount.txt")
Examples from http://spark.apache.org/
RDD API Examples
Logistic Regression
# Every record of this DataFrame contains the label and
# features represented by a vector.
df = sqlContext.createDataFrame(data, ["label", "features"])
# Set parameters for the algorithm.
# Here, we limit the number of iterations to 10.
lr = LogisticRegression(maxIter=10)
# Fit the model to the data.
model = lr.fit(df)
# Given a dataset, predict each point's label, and show the results.
model.transform(df).show()
 RDD Persistence
 RDD.persist()
 Storage level:
 MEMORY_ONLY, MEMORY_AND_DISK,
MEMORY_ONLY_SER, DISK_ONLY,…….
 RDD Removal
 RDD.unpersist()
RDD Persistence and Removal
Broadcast Variables and Accumulators
(Shared Variables )
 Broadcast variables allow the programmer to keep a read-only
variable cached on each node, rather than sending a copy of it
with tasks
>broadcastV1 = sc.broadcast([1, 2, 3,4,5,6])
>broadcastV1.value
[1,2,3,4,5,6]
 Accumulators are variables that are only “added” to through an
associative operation and can be efficiently supported in
parallel
accum = sc.accumulator(0)
accum.add(x)
accum.value
RDD Persistence and Removal
Regression, Classification, Collaborative Filtering, &
Clustering.
Regression
Logistic Regression is a popular supervised machine learning
algorithm which can be used predict a categorical response. It can
be used to solve under classification type machine learning
problems.
Classification involves looking at data and assigning a class (or a
label) to it. Usually there are more than one classes, when there are
two classes(0 or 1) it identifies as Binary Classification.
Logistic Regression algorithm is used to build a machine learning
model with Apache Spark.
The Logistic Regression model builds a Binary Classifier model to
predict student exam pass/fail result based on past exam scores.
Regression
Logistic Regression is a popular supervised machine learning
algorithm which can be used predict a categorical response. It can
be used to solve under classification type machine learning
problems.
Classification involves looking at data and assigning a class (or a
label) to it. Usually there are more than one classes, when there are
two classes(0 or 1) it identifies as Binary Classification.
Logistic Regression algorithm is used to build a machine learning
model with Apache Spark.
The Logistic Regression model builds a Binary Classifier model to
predict student exam pass/fail result based on past exam scores.
π = Proportion of “Success”
In ordinary regression the model predicts the mean Y for any
combination of predictors.
What’s the “mean” of a 0/1 indicator variable?
success""ofProportion
trialsof#
'1of#



s
n
y
y i
Goal of logistic regression: Predict the “true” proportion of success, π, at
any value of the predictor.
Regression
Binary Logistic Regression Model
Y = Binary response X = Quantitative predictor
π = proportion of 1’s (yes,success) at any X
p =
e
b0 +b1 X
1+ e
b0 +b1 X
Equivalent forms of the logistic regression model:
What does this look like?
X10
1
log 









Logit form Probability form
N.B.: This is natural log (aka “ln”)
Regression
y
0.2
0.4
0.6
0.8
1.0
x
-10 -8 -6 -4 -2 0 2 4 6 8 10 12
y =
bo b1 x•+( )exp
1 bo b1 x•+( )exp+
no data Function Plot
Logit Function
Regression
Regression
Regression
Collaborative Filtering, & Clustering.
Collaborative Filtering, & Clustering.
Collaborative Filtering, & Clustering.
Collaborative Filtering, & Clustering.
Collaborative Filtering, & Clustering.
THANK YOU
Image Source
searchenterpriseai.techtarget.com
wikipedia

Mais conteúdo relacionado

Mais procurados

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Andrey Vykhodtsev
 

Mais procurados (19)

MapReduce
MapReduceMapReduce
MapReduce
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
hadoop
hadoophadoop
hadoop
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
Hadoop
HadoopHadoop
Hadoop
 
In15orlesss hadoop
In15orlesss hadoopIn15orlesss hadoop
In15orlesss hadoop
 

Semelhante a Bds session 13 14

Semelhante a Bds session 13 14 (20)

spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in Azure
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
SparkPaper
SparkPaperSparkPaper
SparkPaper
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Hadoop vs spark
Hadoop vs sparkHadoop vs spark
Hadoop vs spark
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
 
Spark vs Hadoop
Spark vs HadoopSpark vs Hadoop
Spark vs Hadoop
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
Why Spark over Hadoop?
Why Spark over Hadoop?Why Spark over Hadoop?
Why Spark over Hadoop?
 

Mais de Infinity Tech Solutions

Mais de Infinity Tech Solutions (20)

Database management system session 6
Database management system session 6Database management system session 6
Database management system session 6
 
Database management system session 5
Database management system session 5Database management system session 5
Database management system session 5
 
Database Management System-session 3-4-5
Database Management System-session 3-4-5Database Management System-session 3-4-5
Database Management System-session 3-4-5
 
Database Management System-session1-2
Database Management System-session1-2Database Management System-session1-2
Database Management System-session1-2
 
Main topic 3 problem solving and office automation
Main topic 3 problem solving and office automationMain topic 3 problem solving and office automation
Main topic 3 problem solving and office automation
 
Introduction to c programming
Introduction to c programmingIntroduction to c programming
Introduction to c programming
 
E commerce
E commerce E commerce
E commerce
 
E commerce
E commerceE commerce
E commerce
 
Computer memory, Types of programming languages
Computer memory, Types of programming languagesComputer memory, Types of programming languages
Computer memory, Types of programming languages
 
Basic hardware familiarization
Basic hardware familiarizationBasic hardware familiarization
Basic hardware familiarization
 
User defined functions in matlab
User defined functions in  matlabUser defined functions in  matlab
User defined functions in matlab
 
Programming with matlab session 6
Programming with matlab session 6Programming with matlab session 6
Programming with matlab session 6
 
Programming with matlab session 3 notes
Programming with matlab session 3 notesProgramming with matlab session 3 notes
Programming with matlab session 3 notes
 
AI/ML/DL/BCT A Revolution in Maritime Sector
AI/ML/DL/BCT A Revolution in Maritime SectorAI/ML/DL/BCT A Revolution in Maritime Sector
AI/ML/DL/BCT A Revolution in Maritime Sector
 
Programming with matlab session 5 looping
Programming with matlab session 5 loopingProgramming with matlab session 5 looping
Programming with matlab session 5 looping
 
BIG DATA Session 7 8
BIG DATA Session 7 8BIG DATA Session 7 8
BIG DATA Session 7 8
 
BIG DATA Session 6
BIG DATA Session 6BIG DATA Session 6
BIG DATA Session 6
 
MS word
MS word MS word
MS word
 
DBMS CS 4-5
DBMS CS 4-5DBMS CS 4-5
DBMS CS 4-5
 
DBMS CS3
DBMS CS3DBMS CS3
DBMS CS3
 

Último

VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Christo Ananth
 

Último (20)

(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 

Bds session 13 14

  • 1. Machine Learning (on Spark) BDS Session
  • 2. Session objective Introduction Quick Recap Spark Machine Learning (on Spark) Spark: Introduction, Architecture and Features Spark: Resilient Distributed Datasets, Transformation, Examples Regression, Classification, Collaborative Filtering, & Clustering. SUMMARY
  • 3. Shortcoming of MapReduce  Forces your data processing into Map and Reduce  Other workflows missing include join, filter, flatMap, groupByKey, union, intersection, …  Based on “Acyclic Data Flow” from Disk to Disk (HDFS)  Read and write to Disk before and after Map and Reduce (stateless machine)  Not efficient for iterative tasks, i.e. Machine Learning  Only Java natively supported  Support for others languages needed  Only for Batch processing  Interactivity, streaming data Introduction
  • 4. One Solution is Apache Spark  A new general framework, which solves many of the short comings of MapReduce  It capable of leveraging the Hadoop ecosystem, e.g. HDFS, YARN, HBase, S3, …  Has many other workflows, i.e. join, filter, flatMapdistinct, groupByKey, reduceByKey, sortByKey, collect, count, first…  (around 30 efficient distributed operations)  In-memory caching of data (for iterative, graph, and machine learning algorithms, etc.)  Native Scala, Java, Python, and R support  Supports interactive shells for exploratory data analysis  Spark API is extremely simple to use  Developed at AMPLab UC Berkeley, now by Databricks.com Introduction Quick Recap Spark
  • 5.  expressive computing system, not limited to map- reduce model  facilitate system memory  avoid saving intermediate results to disk  cache data for repetitive queries (e.g. for machine learning)  compatible with Hadoop Spark Ideas
  • 6. Spark Uses Memory instead of Disk Iteration 1 Iteration 2 HDFS read Iteration 1 Iteration 2 HDFS read HDFS Write HDFS read HDFS Write Spark: In-Memory Data Sharing Hadoop: Use Disk for Data Sharing Spark: Introduction
  • 7. Sort competition Hadoop MR Record (2013) Spark Record (2014) Data Size 102.5 TB 100 TB Elapsed Time 72 mins 23 mins # Nodes 2100 206 # Cores 50400 physical 6592 virtualized Cluster disk throughput 3150 GB/s (est.) 618 GB/s Network dedicated data center, 10Gbps virtualized (EC2) 10Gbps network Sort rate 1.42 TB/min 4.27 TB/min Sort rate/node 0.67 GB/min 20.7 GB/min Sort benchmark, Daytona Gray: sort of 100 TB of data (1 trillion records),http://databricks.com/blog/2014/11/05/spark- officially-sets-a-new-record-in-large-scale-sorting.html Spark, 3x faster with 1/10 the nodes Spark: Introduction
  • 8. Apache Spark supports data analysis, machine learning, graphs, streaming data, etc. It can read/write from a range of data types and allows development in multiple languages. Spark Core Spark Streaming MLlib GraphX ML Pipelines Spark SQL DataFrames Data Sources Scala, Java, Python, R, SQL Hadoop HDFS, HBase, Hive, Apache S3, Streaming, JSON, MySQL, and HPC-style (GlusterFS, Lustre) Spark: Introduction
  • 12.  Streaming Data  Machine Learning  Interactive Analysis  Data Warehousing  Batch Processing  Exploratory Data Analysis  Graph Data Analysis  Spatial (GIS) Data Analysis  And many more Spark’s Main Use Cases
  • 13.  Fingerprint Matching  Developed a Spark based fingerprint minutia detection and fingerprint matching code  Twitter Sentiment Analysis  Developed a Spark based Sentiment Analysis code for a Twitter dataset Spark’s Main Use Cases
  • 14.  Uber – the online taxi company gathers terabytes of event data from its mobile users every day.  By using Kafka, Spark Streaming, and HDFS, to build a continuous ETL pipeline  Convert raw unstructured event data into structured data as it is collected  Uses it further for more complex analytics and optimization of operations  Pinterest – Uses a Spark ETL pipeline  Leverages Spark Streaming to gain immediate insight into how users all over the world are engaging with Pins—in real time.  Can make more relevant recommendations as people navigate the site  Recommends related Pins  Determine which products to buy, or destinations to visit Spark’s Main Use Cases
  • 15. Here are Few other Real World Use Cases:  Conviva – 4 million video feeds per month  This streaming video company is second only to YouTube.  Uses Spark to reduce customer churn by optimizing video streams and managing live video traffic  Maintains a consistently smooth, high quality viewing experience.  Capital One – is using Spark and data science algorithms to understand customers in a better way.  Developing next generation of financial products and services  Find attributes and patterns of increased probability for fraud  Netflix – leveraging Spark for insights of user viewing habits and then recommends movies to them.  User data is also used for content creation Spark’s Main Use Cases
  • 16.  Even though Spark is versatile, that doesn’t mean Spark’s in-memory capabilities are the best fit for all use cases:  For many simple use cases Apache MapReduce and Hive might be a more appropriate choice  Spark was not designed as a multi-user environment  Spark users are required to know that memory they have is sufficient for a dataset  Adding more users adds complications, since the users will have to coordinate memory usage to run code Spark: when not to use
  • 17.  Clouds and supercomputers are collections of computers networked together in a datacenter  Clouds have different networking, I/O, CPU and cost trade-offs than supercomputers  Cloud workloads are data oriented vs. computation oriented and are less closely coupled than supercomputers  Principles of parallel computing same on both  Apache Hadoop and Spark vs. Open MPI HPC and Big Data Convergence
  • 18.  RDDs (Resilient Distributed Datasets) is Data Containers  All the different processing components in Spark share the same abstraction called RDD  As applications share the RDD abstraction, you can mix different kind of transformations to create new RDDs  Created by parallelizing a collection or reading a file  Fault tolerant Resilient Distributed Datasets (RDDs)
  • 19. RDD abstraction  Resilient Distributed Datasets  partitioned collection of records  spread across the cluster  read-only  caching dataset in memory  different storage levels available  fallback to disk possible Introduction Quick Recap Spark
  • 20. RDD operations  transformations to build RDDs through deterministic operations on other RDDs  transformations include map, filter, join  lazy operation  actions to return value or export data  actions include count, collect, save  triggers execution Introduction Quick Recap Spark
  • 21. Job example val log = sc.textFile(“hdfs://...”) val errors = file.filter(_.contains(“ERROR”)) errors.cache() errors.filter(_.contains(“I/O”)).count() errors.filter(_.contains(“timeout”)).count() Driver Worker Worker Worker Block3Block1 Block2 Cache1 Cache2 Cache2 Action! Introduction Quick Recap Spark
  • 22. RDD partition-level view HadoopRDD path = hdfs://... FilteredRDD func = _.contains(…) shouldCache = true log: errors: Partition-level view:Dataset-level view: Task 1Task 2 ... source: https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals Introduction Quick Recap Spark
  • 23. Job scheduling rdd1.join(rdd2) .groupBy(…) .filter(…) RDD Objects build operator DAG DAGScheduler split graph into stages of tasks submit each stage as ready DAG TaskScheduler TaskSet launch tasks via cluster manager retry failed or straggling tasks Cluster manager Worker execute tasks store and serve blocks Block manager Threads Task source: https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals Introduction Quick Recap Spark
  • 24. Available APIs  You can write in Java, Scala or Python  interactive interpreter: Scala & Python only  standalone applications: any  performance: Java & Scala are faster thanks to static typing Introduction Quick Recap Spark
  • 25. Summary  concept not limited to single pass map-reduce  avoid soring intermediate results on disk or HDFS  speedup computations when reusing datasets Introduction Quick Recap Spark
  • 26. Summary  concept not limited to single pass map-reduce  avoid soring intermediate results on disk or HDFS  speedup computations when reusing datasets Introduction Quick Recap Spark
  • 27.  DataFrames (DFs) is one of the other distributed datasets organized in named columns  Similar to a relational database, Python Pandas Dataframe or R’s DataTables  Immutable once constructed  Track lineage  Enable distributed computations  How to construct Dataframes  Read from file(s)  Transforming an existing DFs(Spark or Pandas)  Parallelizing a python collection list  Apply transformations and actions DataFrames & SparkSQL
  • 28. DataFrame example // Create a new DataFrame that contains “students” students = users.filter(users.age < 21) //Alternatively, using Pandas-like syntax students = users[users.age < 21] //Count the number of students users by gender students.groupBy("gender").count() // Join young students with another DataFrame called logs students.join(logs, logs.userId == users.userId, “left_outer") Resilient Distributed Datasets (RDDs)
  • 29.  RDDs provide a low level interface into Spark  DataFrames have a schema  DataFrames are cached and optimized by Spark  DataFrames are built on top of the RDDs and the core Spark API Example: performance RDDs vs. DataFrames
  • 30. Transformations (create a new RDD) map filter sample groupByKey reduceByKey sortByKey intersection flatMap union join cogroup cross mapValues reduceByKey Actions (return results to driver program) collect first Reduce take Count takeOrdered takeSample countByKey take save lookupKey foreach Spark Operations
  • 31. A B S C E D F DAGs track dependencies (also known as Lineage )  nodes are RDDs  arrows are Transformations Transformation
  • 32. A,1 A,[1,2] A,2 Narrow Wide Map groupByKey Vs. Narrow Vs. Wide Transformation
  • 33. Spark Workflow FlatMap Map groupbyKey Spark Contex t Driver Progra m Collect Narrow Vs. Wide Transformation
  • 34.  What is an action  The final stage of the workflow  Triggers the execution of the DAG  Returns the results to the driver  Or writes the data to HDFS or to a file Actions
  • 35.  Word count text_file = sc.textFile("hdfs://usr/godil/text/book.txt") counts = text_file.flatMap(lambda line: line.split(" ")) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a + b) counts.saveAsTextFile("hdfs://usr/godil/output/wordCount.txt") Examples from http://spark.apache.org/ RDD API Examples Logistic Regression # Every record of this DataFrame contains the label and # features represented by a vector. df = sqlContext.createDataFrame(data, ["label", "features"]) # Set parameters for the algorithm. # Here, we limit the number of iterations to 10. lr = LogisticRegression(maxIter=10) # Fit the model to the data. model = lr.fit(df) # Given a dataset, predict each point's label, and show the results. model.transform(df).show()
  • 36.  RDD Persistence  RDD.persist()  Storage level:  MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, DISK_ONLY,…….  RDD Removal  RDD.unpersist() RDD Persistence and Removal
  • 37. Broadcast Variables and Accumulators (Shared Variables )  Broadcast variables allow the programmer to keep a read-only variable cached on each node, rather than sending a copy of it with tasks >broadcastV1 = sc.broadcast([1, 2, 3,4,5,6]) >broadcastV1.value [1,2,3,4,5,6]  Accumulators are variables that are only “added” to through an associative operation and can be efficiently supported in parallel accum = sc.accumulator(0) accum.add(x) accum.value RDD Persistence and Removal
  • 38. Regression, Classification, Collaborative Filtering, & Clustering.
  • 39. Regression Logistic Regression is a popular supervised machine learning algorithm which can be used predict a categorical response. It can be used to solve under classification type machine learning problems. Classification involves looking at data and assigning a class (or a label) to it. Usually there are more than one classes, when there are two classes(0 or 1) it identifies as Binary Classification. Logistic Regression algorithm is used to build a machine learning model with Apache Spark. The Logistic Regression model builds a Binary Classifier model to predict student exam pass/fail result based on past exam scores.
  • 40. Regression Logistic Regression is a popular supervised machine learning algorithm which can be used predict a categorical response. It can be used to solve under classification type machine learning problems. Classification involves looking at data and assigning a class (or a label) to it. Usually there are more than one classes, when there are two classes(0 or 1) it identifies as Binary Classification. Logistic Regression algorithm is used to build a machine learning model with Apache Spark. The Logistic Regression model builds a Binary Classifier model to predict student exam pass/fail result based on past exam scores.
  • 41. π = Proportion of “Success” In ordinary regression the model predicts the mean Y for any combination of predictors. What’s the “mean” of a 0/1 indicator variable? success""ofProportion trialsof# '1of#    s n y y i Goal of logistic regression: Predict the “true” proportion of success, π, at any value of the predictor. Regression
  • 42. Binary Logistic Regression Model Y = Binary response X = Quantitative predictor π = proportion of 1’s (yes,success) at any X p = e b0 +b1 X 1+ e b0 +b1 X Equivalent forms of the logistic regression model: What does this look like? X10 1 log           Logit form Probability form N.B.: This is natural log (aka “ln”) Regression
  • 43. y 0.2 0.4 0.6 0.8 1.0 x -10 -8 -6 -4 -2 0 2 4 6 8 10 12 y = bo b1 x•+( )exp 1 bo b1 x•+( )exp+ no data Function Plot Logit Function Regression