R and Spark: How to Analyze Data Using RStudio's Sparklyr and H2O's Rsparkling Packages: Spark Summit East talk by Nathan Stephens

ANALYZE DATA USING RSTUDIO'S SPARKLYR
R AND SPARK
https://github.com/rstudio/sparkDemos/tree/master/prod/presentations/sparkSummitEast

Apache Spark
• Huge investments in big data and Hadoop
• Data scientists wanting to analyze data at scale
• Rapid progress and adoption in Spark libraries
R and RStudio
• Wide range of tools and packages
• Powerful ways to share insights
• Interactive notebooks
• Great visualizations
What we hear from our customers

Best of both worlds
If you are investing in Spark,
then there is nothing
stopping you from using it
with the full power of R
Using R with Spark

Benefits of Spark for the R user
Apache Spark…
• Can integrate with Hadoop
• Supports familiar SQL syntax
• Has built-in machine learning
• Is designed for performance
• Great for interactive data analysis
R users can take advantage
of all these investments

New! Open-source
R package from RStudio
• Integrated with the RStudio IDE
• Sparklyr is a dplyr back-end for Spark
• Extensible foundation for Spark
applications and R
sparklyr
http://spark.rstudio.com/

Create your own R
packages with
interfaces to Spark
•Interfaces to custom
machine learning pipelines
•Interfaces to 3rd party
Spark packages
•Many other R interfaces
sparklyr extensions
Example
Count the number of lines in a file
Extension
library(sparklyr)
count_lines <- function(sc, file) {
spark_context(sc) %>%
invoke("textFile", file, 1L) %>%
invoke("count")
}
Call
sc <- spark_connect(master = "local")
count_lines(sc, "hdfs://path/data.csv")

R for data science toolchain
“You’ll learn how to get your data into R
[with Spark], get it into the most useful
structure, transform it, visualize it and
model it [with Spark].”

Import
Create a connection
sc <- spark_connect()
Import data from ﬁle/S3/HDFS/R
spark_read_csv(sc,“table”,“hdfs://<path>”)
sdf_copy_to(sc, table,“table”)
nyct2010_tbl <- tbl(sc,“table")
Write data
spark_write_parquet(table,“hdfs://<path>”)
Sparklyr
Connect to Spark.
Read and write data in
CSV, JSON, and Parquet
formats.
Data can be stored in
HDFS, S3, or on the
local ﬁlesystem.

Wrangle
dplyr
my_tbl %>%
filter(Petal_Width < 0.3) %>%
select(Petal_Length, Petal_Width)
Spark SQL
select Petal_Length, Petal_Width
from mytable
where Petal_Width < 0.3
Use dplyr to write
Spark SQL
A fast, consistent tool
for working with data
frame like objects,  
both in memory and
out of memory.

Visualize
ggplot2
collect(mpg_tbl) %>%
ggplot() +
aes(displ, hwy, color = class) +
geom_point()
Use ggplot2 to
visualize data
collected from Spark
A plotting system for R
that makes it easy to
produce complex multi-
layered graphics.

Model
Models
K-means
Linear regression
Logistic regression
Survival regression
Generalized linear regression
Decision trees
Random forests
Gradient boosted trees
Principal component analysis
Naive Bayes
Multilayer perceptron
Latent Dirichlet allocation
One vs rest
Industry Speciﬁc
Chemometrics
ClinicalTrials
Econometrics
Environmetrics
Finance
Genetics
Pharmacokinetics
Phylogenetics
Psychometrics
Social Sciences
Models
GLMNet
Bayesian regression
Multinomial regression
Random Forest
Gradient boosted machine
Decision trees
Multi-Layer Perceptron
Auto-encoder
Restricted Boltzmann
K-Means
LSH
SVD
ALS
ARIMA
Forecasting
Collaborative ﬁltering
Solvers and optimization
General Topics
Machine Learning
Bayesian
Cluster
Design of experiments
ExtremeValue
Meta Analsis
Multivariate
NLP
Robust methods
Spatial
Survival
Time Series
Graphical models
No data movement required.
Native ML algorithms.
Fast growing ecosystem.
Over 10,000 packages.
Time tested, industry specific models.
Integrated with other R packages
Scalable, high performance models.
Wide variety of algorithms.
Useful diagnostics.
MLlib

Communicate
R MarkdownNotebooks
Make decisions
Take actions
See results
Weave together text
and code to produce
high quality documents,
apps, and plots.
Share

Demo
Analyzing 1 billion records with Spark and R
http://colorado.rstudio.com:3939/content/262/

rsparkling extension
Spark is extensible…
sparklyr is extensible
https://github.com/h2oai/rsparkling/blob/master/R/h2o_context.R#L53
Spark
R H2O
rsparkling
sparklyr
h2o
sparkling
water

Benefits Limitations
No data movement required.
Native ML algorithms.
Fast growing ecosystem.
Comparatively fewer algorithms
and fewer diagnostics.
Scalable, high performance models.
Wide variety of algorithms.
Useful diagnostics.
Data conversion requires 3-4X memory.
Added complexity around introducing and
learning another tool.
Access to CRAN packages, visualization,
reporting tools, and time tested algorithms.
Data collection is expensive
and collection size is limited (< 10 GB).
Where should I model my data?
Others…
MLlib

spark.rstudio.com

R and Spark: How to Analyze Data Using RStudio's Sparklyr and H2O's Rsparkling Packages: Spark Summit East talk by Nathan Stephens

Recomendados

Recomendados

Mais conteúdo relacionado

Mais de Spark Summit

Mais de Spark Summit (20)

Último

Último (20)

R and Spark: How to Analyze Data Using RStudio's Sparklyr and H2O's Rsparkling Packages: Spark Summit East talk by Nathan Stephens