SlideShare uma empresa Scribd logo
1 de 26
Baixar para ler offline
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Apache Spark and R
A (big data) love story?
Mark Sellors - Technical Architect @ Mango Solutions
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
About me.
• Technical Architect
• Design and deploy analytic computing
environments
• Not really an R user but have broad knowledge of
the analytic computing ecosystem
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Overview
• The rise of big data
• Barriers to big data
• Big Data vs R
• Spark
• Spark and Hadoop
• SparkR
• Is it a love story?
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
The rise of ‘big data’
• Storage prices
• Commodity
• compute
infrastructure
• Volume of data
• Hadoop ties the
two together
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Barriers to ‘big data’
• Hadoop is complex ecosystem
• Primary programming paradigm,
Map/Reduce jobs, largely written in Java
• Map/Reduce unsuited to exploratory,
interactive analysis
• Map/Reduce is slow
• RHadoop is built on top of Map/Reduce
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Some problems with ‘big data’
• Many hadoop deployments do not
achieve an appreciable ROI.
• Hard to find the staff with crossover skills
• infrastructure
• analysts
• Existing business processes not fit for
purpose
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Hadoop
• Until recently limited to batch based
operations
• MASSIVE data sets
• easy to add storage/compute capacity
But…
• Map/Reduce operations can be quite slow
• Hard to find/deploy appropriate talent
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
R
• Interactive
• fast
• great for exploratory or batch
But…
• Single threaded
• Limited by available memory
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
The value of your
data is in what you
do with it.
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
What is it?
• Open source cluster computing
framework
• Relies heavily on in memory processing
• One of the most contributed-to big data
projects of the past year
• Started in the AMPLab at UC Berkeley in
2009
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
What problem does it solve
• In memory makes for very fast data
processing
• minimal disk IO
• High level programming abstraction
reduces the amount of code
• In turn makes it more suitable for
exploratory work.
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
How does it do it?
• Provides a core programming abstraction
called RDD
• The RDD API has been extended to
include DataFrames
• Can deploy ad-hoc processing clusters as
well as integrate with HDFS,
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
“Will Spark replace Hadoop?”
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Hadoop is an
Ecosystem!
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Spark and Hadoop
• Very Complimentary.
• Spark already comes with all the major
Hadoop distributions
• easier to use and faster than map/reduce
• suitable for exploratory work, which
previously was difficult in hadoop
deployments
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
How does this fit with R
• Originally supported languages Scala,
Java and Python
• SparkR was a separate project
• Integrated into Spark as of v1.4
• Support is still evolving - v1.5 released
last week
• MASSIVE data frames
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
SparkR Features
• Designed to be familiar
• Massive DataFrames
• SQL operations on those DataFrames
• Fitting of GLM’s
• Works on top of Hadoop or as a stand
alone cluster
• Load data from a variety of sources
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Spark SQL
• Arbitrary SQL operations on massive in-
memory data frames
• Treats the data frame as though it were a
database table
• Useful for exploring your data set
• Also great for creating subsets
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
# Create the DataFrame
df <- createDataFrame(sqlContext, iris)
# Fit a linear model over the dataset.
model <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family =
"gaussian")
# Model coefficients are returned in a similar format to R's native glm().
summary(model)
##$coefficients
## Estimate
##(Intercept) 2.2513930
##Sepal_Width 0.8035609
##Species_versicolor 1.4587432
##Species_virginica 1.9468169
# Make predictions based on the model.
predictions <- predict(model, newData = df)
head(select(predictions, "Sepal_Length", "prediction"))
## Sepal_Length prediction
##1 5.1 5.063856
##2 4.9 4.662076
##3 4.7 4.822788
##4 4.6 4.742432
##5 5.0 5.144212
##6 5.4 5.385281 Source: http://spark.apache.org
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Lowering the barrier to adoption
• Hadoop can be tricky to
get started with.
• Spark can run locally on
your laptop
• Can build ad-hoc
processing clusters
• Supports pulling data
from a variety of sources
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
What’s in it for me?
• Currently supports:
• DataFrames
• SparkSQL
• limited subset of MLlib
• Is missing any native R support for:
• Spark Streaming
• GraphX
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
Is it a love story?
• It wasn’t originally, but things are heating
up
• Data not getting any smaller
• Dramatically lowers the barrier to entry
• Evolving rapidly
Mark Sellors - Technical Architect @ Mango Solutions
msellors@mango-solutions.com
@MangoTheCat / @sellorm

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Optimizing Your Cloud Applications in RightScale
Optimizing Your Cloud Applications in RightScaleOptimizing Your Cloud Applications in RightScale
Optimizing Your Cloud Applications in RightScale
 
Journey to the Modern App with Containers, Microservices and Big Data
Journey to the Modern App with Containers, Microservices and Big DataJourney to the Modern App with Containers, Microservices and Big Data
Journey to the Modern App with Containers, Microservices and Big Data
 
IT HealthCheck
IT HealthCheckIT HealthCheck
IT HealthCheck
 
Moving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from PivotalMoving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from Pivotal
 
AWS Outage Analysis
AWS Outage AnalysisAWS Outage Analysis
AWS Outage Analysis
 
Multi-Cloud Breaks IT Ops: Best Practices to De-Risk Your Cloud Strategy
Multi-Cloud Breaks IT Ops: Best Practices to De-Risk Your Cloud StrategyMulti-Cloud Breaks IT Ops: Best Practices to De-Risk Your Cloud Strategy
Multi-Cloud Breaks IT Ops: Best Practices to De-Risk Your Cloud Strategy
 
Amazon Redshift, Customer Acquisition Cost & Advertising ROI presented with A...
Amazon Redshift, Customer Acquisition Cost & Advertising ROI presented with A...Amazon Redshift, Customer Acquisition Cost & Advertising ROI presented with A...
Amazon Redshift, Customer Acquisition Cost & Advertising ROI presented with A...
 
Correlate Log Data with Business Metrics Like a Jedi
Correlate Log Data with Business Metrics Like a JediCorrelate Log Data with Business Metrics Like a Jedi
Correlate Log Data with Business Metrics Like a Jedi
 
Cloud and Analytics - From Platforms to an Ecosystem
Cloud and Analytics - From Platforms to an EcosystemCloud and Analytics - From Platforms to an Ecosystem
Cloud and Analytics - From Platforms to an Ecosystem
 
Performance Optimization of Cloud Based Applications by Peter Smith, ACL
Performance Optimization of Cloud Based Applications by Peter Smith, ACLPerformance Optimization of Cloud Based Applications by Peter Smith, ACL
Performance Optimization of Cloud Based Applications by Peter Smith, ACL
 
Deploying Big Data Platforms
Deploying Big Data PlatformsDeploying Big Data Platforms
Deploying Big Data Platforms
 
How to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcpHow to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcp
 
Disrupting Risk Management through Emerging Technologies
Disrupting Risk Management through Emerging TechnologiesDisrupting Risk Management through Emerging Technologies
Disrupting Risk Management through Emerging Technologies
 
Best Practices for Data Center Migration Planning - August 2016 Monthly Webin...
Best Practices for Data Center Migration Planning - August 2016 Monthly Webin...Best Practices for Data Center Migration Planning - August 2016 Monthly Webin...
Best Practices for Data Center Migration Planning - August 2016 Monthly Webin...
 
Process big data within an hour, with the OVH Public Cloud
Process big data within an hour, with the OVH Public CloudProcess big data within an hour, with the OVH Public Cloud
Process big data within an hour, with the OVH Public Cloud
 
How McGraw-Hill Transformed ITOps to Mitigate Digital Risks
How McGraw-Hill Transformed ITOps to Mitigate Digital Risks How McGraw-Hill Transformed ITOps to Mitigate Digital Risks
How McGraw-Hill Transformed ITOps to Mitigate Digital Risks
 
Cloud Wars: Performance Benchmarking AWS, GCP and Azure
Cloud Wars: Performance Benchmarking AWS, GCP and Azure Cloud Wars: Performance Benchmarking AWS, GCP and Azure
Cloud Wars: Performance Benchmarking AWS, GCP and Azure
 
Informatica + Hadoop = Best of Both Worlds
Informatica + Hadoop = Best of Both WorldsInformatica + Hadoop = Best of Both Worlds
Informatica + Hadoop = Best of Both Worlds
 
Lambda Architecture 2.0 for Reactive AB Testing
Lambda Architecture 2.0 for Reactive AB TestingLambda Architecture 2.0 for Reactive AB Testing
Lambda Architecture 2.0 for Reactive AB Testing
 
Complex Data Transformations Made Easy
Complex Data Transformations Made EasyComplex Data Transformations Made Easy
Complex Data Transformations Made Easy
 

Semelhante a Apache Spark and R: A (Big Data) Love Story?

Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
NoSQLDatabases
NoSQLDatabasesNoSQLDatabases
NoSQLDatabases
Adi Challa
 

Semelhante a Apache Spark and R: A (Big Data) Love Story? (20)

Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
 
Alex mang patterns for scalability in microsoft azure application
Alex mang   patterns for scalability in microsoft azure applicationAlex mang   patterns for scalability in microsoft azure application
Alex mang patterns for scalability in microsoft azure application
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
 
The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the Masses
 
Data Engineering Course Syllabus - WeCloudData
Data Engineering Course Syllabus - WeCloudDataData Engineering Course Syllabus - WeCloudData
Data Engineering Course Syllabus - WeCloudData
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloud
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
How to Survive as a Data Architect in a Polyglot Database World
How to Survive as a Data Architect in a Polyglot Database WorldHow to Survive as a Data Architect in a Polyglot Database World
How to Survive as a Data Architect in a Polyglot Database World
 
Big Challenges in Data Modeling: NoSQL and Data Modeling
Big Challenges in Data Modeling: NoSQL and Data ModelingBig Challenges in Data Modeling: NoSQL and Data Modeling
Big Challenges in Data Modeling: NoSQL and Data Modeling
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for Scale
 
Hadoop Data Modeling
Hadoop Data ModelingHadoop Data Modeling
Hadoop Data Modeling
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data
 
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data Architecture
 
NoSQLDatabases
NoSQLDatabasesNoSQLDatabases
NoSQLDatabases
 
Managing Large Amounts of Data with Salesforce
Managing Large Amounts of Data with SalesforceManaging Large Amounts of Data with Salesforce
Managing Large Amounts of Data with Salesforce
 
Are Data Lakes the new Core DWHs?
Are Data Lakes the new Core DWHs?Are Data Lakes the new Core DWHs?
Are Data Lakes the new Core DWHs?
 
Evolution of Distributed Database Technologies in the Digital era
Evolution of Distributed Database Technologies in the Digital eraEvolution of Distributed Database Technologies in the Digital era
Evolution of Distributed Database Technologies in the Digital era
 

Último

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Último (20)

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 

Apache Spark and R: A (Big Data) Love Story?

  • 1. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com Apache Spark and R A (big data) love story? Mark Sellors - Technical Architect @ Mango Solutions
  • 2. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com About me. • Technical Architect • Design and deploy analytic computing environments • Not really an R user but have broad knowledge of the analytic computing ecosystem
  • 3. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com
  • 4. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com Overview • The rise of big data • Barriers to big data • Big Data vs R • Spark • Spark and Hadoop • SparkR • Is it a love story?
  • 5. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com The rise of ‘big data’ • Storage prices • Commodity • compute infrastructure • Volume of data • Hadoop ties the two together
  • 6. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com Barriers to ‘big data’ • Hadoop is complex ecosystem • Primary programming paradigm, Map/Reduce jobs, largely written in Java • Map/Reduce unsuited to exploratory, interactive analysis • Map/Reduce is slow • RHadoop is built on top of Map/Reduce
  • 7. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com Some problems with ‘big data’ • Many hadoop deployments do not achieve an appreciable ROI. • Hard to find the staff with crossover skills • infrastructure • analysts • Existing business processes not fit for purpose
  • 8. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com
  • 9. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com Hadoop • Until recently limited to batch based operations • MASSIVE data sets • easy to add storage/compute capacity But… • Map/Reduce operations can be quite slow • Hard to find/deploy appropriate talent
  • 10. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com R • Interactive • fast • great for exploratory or batch But… • Single threaded • Limited by available memory
  • 11. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com The value of your data is in what you do with it.
  • 12. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com
  • 13. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com What is it? • Open source cluster computing framework • Relies heavily on in memory processing • One of the most contributed-to big data projects of the past year • Started in the AMPLab at UC Berkeley in 2009
  • 14. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com What problem does it solve • In memory makes for very fast data processing • minimal disk IO • High level programming abstraction reduces the amount of code • In turn makes it more suitable for exploratory work.
  • 15. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com How does it do it? • Provides a core programming abstraction called RDD • The RDD API has been extended to include DataFrames • Can deploy ad-hoc processing clusters as well as integrate with HDFS,
  • 16. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com “Will Spark replace Hadoop?”
  • 17. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com Hadoop is an Ecosystem!
  • 18. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com Spark and Hadoop • Very Complimentary. • Spark already comes with all the major Hadoop distributions • easier to use and faster than map/reduce • suitable for exploratory work, which previously was difficult in hadoop deployments
  • 19. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com How does this fit with R • Originally supported languages Scala, Java and Python • SparkR was a separate project • Integrated into Spark as of v1.4 • Support is still evolving - v1.5 released last week • MASSIVE data frames
  • 20. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com SparkR Features • Designed to be familiar • Massive DataFrames • SQL operations on those DataFrames • Fitting of GLM’s • Works on top of Hadoop or as a stand alone cluster • Load data from a variety of sources
  • 21. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com Spark SQL • Arbitrary SQL operations on massive in- memory data frames • Treats the data frame as though it were a database table • Useful for exploring your data set • Also great for creating subsets
  • 22. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com # Create the DataFrame df <- createDataFrame(sqlContext, iris) # Fit a linear model over the dataset. model <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian") # Model coefficients are returned in a similar format to R's native glm(). summary(model) ##$coefficients ## Estimate ##(Intercept) 2.2513930 ##Sepal_Width 0.8035609 ##Species_versicolor 1.4587432 ##Species_virginica 1.9468169 # Make predictions based on the model. predictions <- predict(model, newData = df) head(select(predictions, "Sepal_Length", "prediction")) ## Sepal_Length prediction ##1 5.1 5.063856 ##2 4.9 4.662076 ##3 4.7 4.822788 ##4 4.6 4.742432 ##5 5.0 5.144212 ##6 5.4 5.385281 Source: http://spark.apache.org
  • 23. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com Lowering the barrier to adoption • Hadoop can be tricky to get started with. • Spark can run locally on your laptop • Can build ad-hoc processing clusters • Supports pulling data from a variety of sources
  • 24. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com What’s in it for me? • Currently supports: • DataFrames • SparkSQL • limited subset of MLlib • Is missing any native R support for: • Spark Streaming • GraphX
  • 25. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com Is it a love story? • It wasn’t originally, but things are heating up • Data not getting any smaller • Dramatically lowers the barrier to entry • Evolving rapidly
  • 26. Mark Sellors - Technical Architect @ Mango Solutions msellors@mango-solutions.com @MangoTheCat / @sellorm