SlideShare uma empresa Scribd logo
1 de 17
Baixar para ler offline
ANALYZE DATA USING RSTUDIO'S SPARKLYR
R AND SPARK
https://github.com/rstudio/sparkDemos/tree/master/prod/presentations/sparkSummitEast
Apache Spark
• Huge investments in big data and Hadoop
• Data scientists wanting to analyze data at scale
• Rapid progress and adoption in Spark libraries
R and RStudio
• Wide range of tools and packages
• Powerful ways to share insights
• Interactive notebooks
• Great visualizations
What we hear from our customers
Best of both worlds
If you are investing in Spark,
then there is nothing
stopping you from using it
with the full power of R
Using R with Spark
Benefits of Spark for the R user
Apache Spark…
• Can integrate with Hadoop
• Supports familiar SQL syntax
• Has built-in machine learning
• Is designed for performance
• Great for interactive data analysis
R users can take advantage
of all these investments
New! Open-source
R package from RStudio
• Integrated with the RStudio IDE
• Sparklyr is a dplyr back-end for Spark
• Extensible foundation for Spark
applications and R
sparklyr
http://spark.rstudio.com/
Create your own R
packages with
interfaces to Spark
•Interfaces to custom
machine learning pipelines
•Interfaces to 3rd party
Spark packages
•Many other R interfaces
sparklyr extensions
Example
Count the number of lines in a file
Extension
library(sparklyr)
count_lines <- function(sc, file) {
spark_context(sc) %>%
invoke("textFile", file, 1L) %>%
invoke("count")
}
Call
sc <- spark_connect(master = "local")
count_lines(sc, "hdfs://path/data.csv")
R for data science toolchain
“You’ll learn how to get your data into R
[with Spark], get it into the most useful
structure, transform it, visualize it and
model it [with Spark].” 
Import
Create a connection
sc <- spark_connect()
Import data from file/S3/HDFS/R
spark_read_csv(sc,“table”,“hdfs://<path>”)
sdf_copy_to(sc, table,“table”)
nyct2010_tbl <- tbl(sc,“table")
Write data
spark_write_parquet(table,“hdfs://<path>”)
Sparklyr
Connect to Spark.
Read and write data in
CSV, JSON, and Parquet
formats.
Data can be stored in
HDFS, S3, or on the
local filesystem.
Wrangle
dplyr
my_tbl %>%
filter(Petal_Width < 0.3) %>%
select(Petal_Length, Petal_Width)
Spark SQL
select Petal_Length, Petal_Width
from mytable
where Petal_Width < 0.3
Use dplyr to write
Spark SQL
A fast, consistent tool
for working with data
frame like objects, 

both in memory and
out of memory.
Visualize
ggplot2
collect(mpg_tbl) %>%
ggplot() +
aes(displ, hwy, color = class) +
geom_point()
Use ggplot2 to
visualize data
collected from Spark
A plotting system for R
that makes it easy to
produce complex multi-
layered graphics.
Model
Models
K-means
Linear regression
Logistic regression
Survival regression
Generalized linear regression
Decision trees
Random forests
Gradient boosted trees
Principal component analysis
Naive Bayes
Multilayer perceptron
Latent Dirichlet allocation
One vs rest
Industry Specific
Chemometrics
ClinicalTrials
Econometrics
Environmetrics
Finance
Genetics
Pharmacokinetics
Phylogenetics
Psychometrics
Social Sciences
Models
GLMNet
Bayesian regression
Multinomial regression
Random Forest
Gradient boosted machine
Decision trees
Multi-Layer Perceptron
Auto-encoder
Restricted Boltzmann
K-Means
LSH
SVD
ALS
ARIMA
Forecasting
Collaborative filtering
Solvers and optimization
General Topics
Machine Learning
Bayesian
Cluster
Design of experiments
ExtremeValue
Meta Analsis
Multivariate
NLP
Robust methods
Spatial
Survival
Time Series
Graphical models
No data movement required.
Native ML algorithms.
Fast growing ecosystem.
Over 10,000 packages.
Time tested, industry specific models.
Integrated with other R packages
Scalable, high performance models.
Wide variety of algorithms.
Useful diagnostics.
MLlib
Communicate
R MarkdownNotebooks
Make decisions
Take actions
See results
Weave together text
and code to produce
high quality documents,
apps, and plots.
Share
Demo
Analyzing 1 billion records with Spark and R
http://colorado.rstudio.com:3939/content/262/
https://github.com/rstudio/sparkDemos/tree/master/prod/presentations/sparkSummitEast
rsparkling extension
Spark is extensible…
sparklyr is extensible
https://github.com/h2oai/rsparkling/blob/master/R/h2o_context.R#L53
Spark
R H2O
rsparkling
sparklyr
h2o
sparkling
water
Benefits Limitations
No data movement required.
Native ML algorithms.
Fast growing ecosystem.
Comparatively fewer algorithms
and fewer diagnostics.
Scalable, high performance models.
Wide variety of algorithms.
Useful diagnostics.
Data conversion requires 3-4X memory.
Added complexity around introducing and
learning another tool.
Access to CRAN packages, visualization,
reporting tools, and time tested algorithms.
Data collection is expensive
and collection size is limited (< 10 GB).
Where should I model my data?
Others…
MLlib
What’s new with sparklyr?
spark.rstudio.com
https://github.com/rstudio/sparkDemos/tree/master/prod/presentations/sparkSummitEast

Mais conteúdo relacionado

Mais de Spark Summit

MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Spark Summit
 
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Spark Summit
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Spark Summit
 

Mais de Spark Summit (20)

Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
 
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
 
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
 
Variant-Apache Spark for Bioinformatics with Piotr Szul
Variant-Apache Spark for Bioinformatics with Piotr SzulVariant-Apache Spark for Bioinformatics with Piotr Szul
Variant-Apache Spark for Bioinformatics with Piotr Szul
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
 
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene PangBest Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene Pang
 

Último

Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 

Último (20)

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 

R and Spark: How to Analyze Data Using RStudio's Sparklyr and H2O's Rsparkling Packages: Spark Summit East talk by Nathan Stephens

  • 1. ANALYZE DATA USING RSTUDIO'S SPARKLYR R AND SPARK https://github.com/rstudio/sparkDemos/tree/master/prod/presentations/sparkSummitEast
  • 2. Apache Spark • Huge investments in big data and Hadoop • Data scientists wanting to analyze data at scale • Rapid progress and adoption in Spark libraries R and RStudio • Wide range of tools and packages • Powerful ways to share insights • Interactive notebooks • Great visualizations What we hear from our customers
  • 3. Best of both worlds If you are investing in Spark, then there is nothing stopping you from using it with the full power of R Using R with Spark
  • 4. Benefits of Spark for the R user Apache Spark… • Can integrate with Hadoop • Supports familiar SQL syntax • Has built-in machine learning • Is designed for performance • Great for interactive data analysis R users can take advantage of all these investments
  • 5. New! Open-source R package from RStudio • Integrated with the RStudio IDE • Sparklyr is a dplyr back-end for Spark • Extensible foundation for Spark applications and R sparklyr http://spark.rstudio.com/
  • 6. Create your own R packages with interfaces to Spark •Interfaces to custom machine learning pipelines •Interfaces to 3rd party Spark packages •Many other R interfaces sparklyr extensions Example Count the number of lines in a file Extension library(sparklyr) count_lines <- function(sc, file) { spark_context(sc) %>% invoke("textFile", file, 1L) %>% invoke("count") } Call sc <- spark_connect(master = "local") count_lines(sc, "hdfs://path/data.csv")
  • 7. R for data science toolchain “You’ll learn how to get your data into R [with Spark], get it into the most useful structure, transform it, visualize it and model it [with Spark].” 
  • 8. Import Create a connection sc <- spark_connect() Import data from file/S3/HDFS/R spark_read_csv(sc,“table”,“hdfs://<path>”) sdf_copy_to(sc, table,“table”) nyct2010_tbl <- tbl(sc,“table") Write data spark_write_parquet(table,“hdfs://<path>”) Sparklyr Connect to Spark. Read and write data in CSV, JSON, and Parquet formats. Data can be stored in HDFS, S3, or on the local filesystem.
  • 9. Wrangle dplyr my_tbl %>% filter(Petal_Width < 0.3) %>% select(Petal_Length, Petal_Width) Spark SQL select Petal_Length, Petal_Width from mytable where Petal_Width < 0.3 Use dplyr to write Spark SQL A fast, consistent tool for working with data frame like objects, 
 both in memory and out of memory.
  • 10. Visualize ggplot2 collect(mpg_tbl) %>% ggplot() + aes(displ, hwy, color = class) + geom_point() Use ggplot2 to visualize data collected from Spark A plotting system for R that makes it easy to produce complex multi- layered graphics.
  • 11. Model Models K-means Linear regression Logistic regression Survival regression Generalized linear regression Decision trees Random forests Gradient boosted trees Principal component analysis Naive Bayes Multilayer perceptron Latent Dirichlet allocation One vs rest Industry Specific Chemometrics ClinicalTrials Econometrics Environmetrics Finance Genetics Pharmacokinetics Phylogenetics Psychometrics Social Sciences Models GLMNet Bayesian regression Multinomial regression Random Forest Gradient boosted machine Decision trees Multi-Layer Perceptron Auto-encoder Restricted Boltzmann K-Means LSH SVD ALS ARIMA Forecasting Collaborative filtering Solvers and optimization General Topics Machine Learning Bayesian Cluster Design of experiments ExtremeValue Meta Analsis Multivariate NLP Robust methods Spatial Survival Time Series Graphical models No data movement required. Native ML algorithms. Fast growing ecosystem. Over 10,000 packages. Time tested, industry specific models. Integrated with other R packages Scalable, high performance models. Wide variety of algorithms. Useful diagnostics. MLlib
  • 12. Communicate R MarkdownNotebooks Make decisions Take actions See results Weave together text and code to produce high quality documents, apps, and plots. Share
  • 13. Demo Analyzing 1 billion records with Spark and R http://colorado.rstudio.com:3939/content/262/ https://github.com/rstudio/sparkDemos/tree/master/prod/presentations/sparkSummitEast
  • 14. rsparkling extension Spark is extensible… sparklyr is extensible https://github.com/h2oai/rsparkling/blob/master/R/h2o_context.R#L53 Spark R H2O rsparkling sparklyr h2o sparkling water
  • 15. Benefits Limitations No data movement required. Native ML algorithms. Fast growing ecosystem. Comparatively fewer algorithms and fewer diagnostics. Scalable, high performance models. Wide variety of algorithms. Useful diagnostics. Data conversion requires 3-4X memory. Added complexity around introducing and learning another tool. Access to CRAN packages, visualization, reporting tools, and time tested algorithms. Data collection is expensive and collection size is limited (< 10 GB). Where should I model my data? Others… MLlib
  • 16. What’s new with sparklyr?