SlideShare uma empresa Scribd logo
1 de 31
‹#›© Cloudera, Inc. All rights reserved.
Introduction to
Apache Spark
& Spark MlLib
Juliet Hougland
‹#›© Cloudera, Inc. All rights reserved.
What are we doing here?
‹#›© Cloudera, Inc. All rights reserved.
Spark Overview
‹#›© Cloudera, Inc. All rights reserved.
Spark: Easy and Fast Big Data
• Easy to Develop
• Rich APIs in Java,
Scala, Python
• Interactive shell
• Fast to Run
• General execution
graphs
• In-memory storage
‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
Apache Spark: Ecosystem
•Dataframes
•ML Lib
•Streaming
‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
Spark in CDH
YARN
Spark
Streaming
GraphX MLlib
HDFS, HBase
HivePig
MapReduce2
SQL
SearchImpala
Spark
Pyspark
‹#›© Cloudera, Inc. All rights reserved.
Spark Execution Model
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()
‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
What do you mean, lazy evaluation?
‹#›© Cloudera, Inc. All rights reserved.
• map
• flatmap
• filter
• distinct
• sample
• union
• intersection
• subtract
• cartesian
Transformations
‹#›© Cloudera, Inc. All rights reserved.
• collect()
• count()
• take(num)
• takeOrdered(num)(ordering)
• reduce(function)
• aggregate(zeroValue)(seqOp,
combOp)
• foreach(function)
Actions
‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
RDDs
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Thanks: Kostas Sakellis
‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
RDDs
…RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Thanks: Kostas Sakellis
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()
‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
RDDs
…RDD …RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
Thanks: Kostas Sakellis
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()
‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
RDDs
…RDD …RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
…RDD
Partition 1
Partition 2
Partition 3
Partition 4
Thanks: Kostas Sakellis
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()
‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
…RDD …RDD
RDDs
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
…RDD
Partition 1
Partition 2
Partition 3
Partition 4
Count
Thanks: Kostas Sakellis
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()
‹#›© Cloudera, Inc. All rights reserved.
Complex, In-Memory Processing
join
filter
groupBy
B: B:
C: D: E:
F:
Ç√Ω
map
A:
map
take
‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
Predicting Churn:
Machine Learning with
Spark MlLib
‹#›© Cloudera, Inc. All rights reserved.
Modeling Churn
‹#›© Cloudera, Inc. All rights reserved.
‹#›© Cloudera, Inc. All rights reserved.
KS, 128, 415, 382-4657, no, yes, 25, 265.1, 110, 45.07, 197.4, 99, 16.78, 244.7, 91, 11.01, 10, 3, 2.7, 1, False.
OH, 107, 415, 371-7191, no, yes, 26, 161.6, 123, 27.47, 195.5, 103, 16.62, 254.4, 103, 11.45, 13.7, 3, 3.7, 1, False.
NJ, 137, 415, 358-1921, no, no, 0, 243.4, 114, 41.38, 121.2, 110, 10.3, 162.6, 104, 7.32, 12.2, 5, 3.29, 0, False.
OH, 84, 408, 375-9999, yes, no, 0, 299.4, 71, 50.9, 61.9, 88, 5.26, 196.9, 89, 8.86, 6.6, 7, 1.78, 2, False.
OK, 75, 415, 330-6626, yes, no, 0, 166.7, 113, 28.34, 148.3, 122, 12.61, 186.9, 121, 8.41, 10.1, 3, 2.73, 3, False
The Dataset
‹#›© Cloudera, Inc. All rights reserved.
Create Dataframe
‹#›© Cloudera, Inc. All rights reserved.
Why DataFrames?
‹#›© Cloudera, Inc. All rights reserved.
Modeling Lifecycle
‹#›© Cloudera, Inc. All rights reserved.
Specify Feature Extraction
‹#›© Cloudera, Inc. All rights reserved.
Specify Feature Extraction
‹#›© Cloudera, Inc. All rights reserved.
Evaluating Classifiers: ROC
‹#›© Cloudera, Inc. All rights reserved.
Model Evaluation
‹#›© Cloudera, Inc. All rights reserved.
Evaluating Classifiers
‹#›© Cloudera, Inc. All rights reserved.
Model Evaluation
‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved.
Find the entire example at
github.com/jhlch/ds-for-telco
‹#›© Cloudera, Inc. All rights reserved.
Thank You
Juliet Hougland
@j_houg
github.com/jhlch/ds-for-telco

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
 
A deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloudA deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloud
 
Road to Cloudera certification
Road to Cloudera certificationRoad to Cloudera certification
Road to Cloudera certification
 
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Hadoop for the Data Scientist: Spark in Cloudera 5.5Hadoop for the Data Scientist: Spark in Cloudera 5.5
Hadoop for the Data Scientist: Spark in Cloudera 5.5
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
dplyr Interfaces to Large-Scale Data
dplyr Interfaces to Large-Scale Datadplyr Interfaces to Large-Scale Data
dplyr Interfaces to Large-Scale Data
 
How to use Impala query plan and profile to fix performance issues
How to use Impala query plan and profile to fix performance issuesHow to use Impala query plan and profile to fix performance issues
How to use Impala query plan and profile to fix performance issues
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Apache Hadoop 3
Apache Hadoop 3Apache Hadoop 3
Apache Hadoop 3
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Kudu Cloudera Meetup Paris
Kudu Cloudera Meetup ParisKudu Cloudera Meetup Paris
Kudu Cloudera Meetup Paris
 
Concur Discovers the True Value of Data
Concur Discovers the True Value of DataConcur Discovers the True Value of Data
Concur Discovers the True Value of Data
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 
Ignite Your Big Data With a Spark!
Ignite Your Big Data With a Spark!Ignite Your Big Data With a Spark!
Ignite Your Big Data With a Spark!
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
Where to Deploy Hadoop: Bare Metal or Cloud?
Where to Deploy Hadoop: Bare Metal or Cloud? Where to Deploy Hadoop: Bare Metal or Cloud?
Where to Deploy Hadoop: Bare Metal or Cloud?
 
Facial recognition
Facial recognitionFacial recognition
Facial recognition
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/Kudu
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 

Semelhante a Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Senior Data Scientist, Cloudera

Semelhante a Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Senior Data Scientist, Cloudera (20)

PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
Juliet Hougland, Data Scientist, Cloudera at MLconf NYC
Juliet Hougland, Data Scientist, Cloudera at MLconf NYCJuliet Hougland, Data Scientist, Cloudera at MLconf NYC
Juliet Hougland, Data Scientist, Cloudera at MLconf NYC
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
Spark etl
Spark etlSpark etl
Spark etl
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
Data Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the EnterpriseData Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the Enterprise
 
5 Apache Spark Tips in 5 Minutes
5 Apache Spark Tips in 5 Minutes5 Apache Spark Tips in 5 Minutes
5 Apache Spark Tips in 5 Minutes
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - Spark
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015
 
Why Your Apache Spark Job is Failing
Why Your Apache Spark Job is FailingWhy Your Apache Spark Job is Failing
Why Your Apache Spark Job is Failing
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is Failing
 
Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...
Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...
Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
Apache Spark Operations
Apache Spark OperationsApache Spark Operations
Apache Spark Operations
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 

 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Apache spark online training - GoLogica
Apache spark online training - GoLogicaApache spark online training - GoLogica
Apache spark online training - GoLogica
 
Data Science and CDSW
Data Science and CDSWData Science and CDSW
Data Science and CDSW
 

Mais de Cloudera, Inc.

Mais de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Último

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 

Último (20)

Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 

Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Senior Data Scientist, Cloudera

  • 1. ‹#›© Cloudera, Inc. All rights reserved. Introduction to Apache Spark & Spark MlLib Juliet Hougland
  • 2. ‹#›© Cloudera, Inc. All rights reserved. What are we doing here?
  • 3. ‹#›© Cloudera, Inc. All rights reserved. Spark Overview
  • 4. ‹#›© Cloudera, Inc. All rights reserved. Spark: Easy and Fast Big Data • Easy to Develop • Rich APIs in Java, Scala, Python • Interactive shell • Fast to Run • General execution graphs • In-memory storage
  • 5. ‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved. Apache Spark: Ecosystem •Dataframes •ML Lib •Streaming
  • 6. ‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved. Spark in CDH YARN Spark Streaming GraphX MLlib HDFS, HBase HivePig MapReduce2 SQL SearchImpala Spark Pyspark
  • 7. ‹#›© Cloudera, Inc. All rights reserved. Spark Execution Model sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
  • 8. ‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved. What do you mean, lazy evaluation?
  • 9. ‹#›© Cloudera, Inc. All rights reserved. • map • flatmap • filter • distinct • sample • union • intersection • subtract • cartesian Transformations
  • 10. ‹#›© Cloudera, Inc. All rights reserved. • collect() • count() • take(num) • takeOrdered(num)(ordering) • reduce(function) • aggregate(zeroValue)(seqOp, combOp) • foreach(function) Actions
  • 11. ‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved. RDDs sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count() HDFS Partition 1 Partition 2 Partition 3 Partition 4 Thanks: Kostas Sakellis
  • 12. ‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved. RDDs …RDD HDFS Partition 1 Partition 2 Partition 3 Partition 4 Thanks: Kostas Sakellis sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
  • 13. ‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved. RDDs …RDD …RDD HDFS Partition 1 Partition 2 Partition 3 Partition 4 Partition 1 Partition 2 Partition 3 Partition 4 Thanks: Kostas Sakellis sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
  • 14. ‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved. RDDs …RDD …RDD HDFS Partition 1 Partition 2 Partition 3 Partition 4 Partition 1 Partition 2 Partition 3 Partition 4 …RDD Partition 1 Partition 2 Partition 3 Partition 4 Thanks: Kostas Sakellis sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
  • 15. ‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved. …RDD …RDD RDDs HDFS Partition 1 Partition 2 Partition 3 Partition 4 Partition 1 Partition 2 Partition 3 Partition 4 …RDD Partition 1 Partition 2 Partition 3 Partition 4 Count Thanks: Kostas Sakellis sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
  • 16. ‹#›© Cloudera, Inc. All rights reserved. Complex, In-Memory Processing join filter groupBy B: B: C: D: E: F: Ç√Ω map A: map take
  • 17. ‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved. Predicting Churn: Machine Learning with Spark MlLib
  • 18. ‹#›© Cloudera, Inc. All rights reserved. Modeling Churn
  • 19. ‹#›© Cloudera, Inc. All rights reserved.
  • 20. ‹#›© Cloudera, Inc. All rights reserved. KS, 128, 415, 382-4657, no, yes, 25, 265.1, 110, 45.07, 197.4, 99, 16.78, 244.7, 91, 11.01, 10, 3, 2.7, 1, False. OH, 107, 415, 371-7191, no, yes, 26, 161.6, 123, 27.47, 195.5, 103, 16.62, 254.4, 103, 11.45, 13.7, 3, 3.7, 1, False. NJ, 137, 415, 358-1921, no, no, 0, 243.4, 114, 41.38, 121.2, 110, 10.3, 162.6, 104, 7.32, 12.2, 5, 3.29, 0, False. OH, 84, 408, 375-9999, yes, no, 0, 299.4, 71, 50.9, 61.9, 88, 5.26, 196.9, 89, 8.86, 6.6, 7, 1.78, 2, False. OK, 75, 415, 330-6626, yes, no, 0, 166.7, 113, 28.34, 148.3, 122, 12.61, 186.9, 121, 8.41, 10.1, 3, 2.73, 3, False The Dataset
  • 21. ‹#›© Cloudera, Inc. All rights reserved. Create Dataframe
  • 22. ‹#›© Cloudera, Inc. All rights reserved. Why DataFrames?
  • 23. ‹#›© Cloudera, Inc. All rights reserved. Modeling Lifecycle
  • 24. ‹#›© Cloudera, Inc. All rights reserved. Specify Feature Extraction
  • 25. ‹#›© Cloudera, Inc. All rights reserved. Specify Feature Extraction
  • 26. ‹#›© Cloudera, Inc. All rights reserved. Evaluating Classifiers: ROC
  • 27. ‹#›© Cloudera, Inc. All rights reserved. Model Evaluation
  • 28. ‹#›© Cloudera, Inc. All rights reserved. Evaluating Classifiers
  • 29. ‹#›© Cloudera, Inc. All rights reserved. Model Evaluation
  • 30. ‹#›© Cloudera, Inc. All rights reserved. ‹#›© Cloudera, Inc. All rights reserved. Find the entire example at github.com/jhlch/ds-for-telco
  • 31. ‹#›© Cloudera, Inc. All rights reserved. Thank You Juliet Hougland @j_houg github.com/jhlch/ds-for-telco

Notas do Editor

  1. Each of these records is a person