O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
SCALABLE DATA
SCIENCE WITH SPARKR
Felix Cheung
Principal Engineer & Apache Spark Committer
Disclaimer:
Apache Spark community contributions
Agenda
• Spark + R, Architecture
• Features
• SparkR for Data Science
• Ecosystem
• What’s coming
Spark in 5 seconds
• General-purpose cluster computing system
• Spark SQL + DataFrame/Dataset + data sources
• Streaming/S...
R
• A programming language for statistical computing and
graphics
• S – 1975
• S4 - advanced object-oriented features
• R ...
SparkR
• R language APIs for Spark and Spark SQL
• Exposes Spark functionality in an R-friendly
DataFrame APIs
• Runs as i...
Architecture
• Native R classes and methods
• RBackend
• Scala “helper” methods (ML pipeline etc.)
www.slideshare.net/Spar...
Key Advantage
• JVM processing, full access to DAG capabilities
and Catalyst optimizer, predicate pushdown, code
generatio...
Features - What’s new in SparkR
• SQL & Data Source (JSON, csv, JDBC, libsvm)
• SparkSession & default session
• Catalog (...
SparkR for Data Science
Decisions, decisions?
Distributed?
Native R
UDF
Spark.ml
YesNo
DataFrame
Spark ML Pipeline
Transformer EstimatorTransformer
Feature engineering Modeling
Spark ML Pipeline
• Pre-processing, feature extraction, model fitting,
validation stages
• Transformer
• Estimator
• Cross...
SparkR API for ML Pipeline
spark.lda(
data = text, k =
20, maxIter = 25,
optimizer = "em")
RegexTokenizer
StopWordsRemover...
Model Operations
• summary - print a summary of the fitted model
• predict - make predictions on new data
• write.ml/read....
Spark.ml in SparkR 2.0.0
• Generalized Linear Model (GLM)
• Naive Bayes Model
• k-means Clustering
• Accelerated Failure T...
Spark.ml in SparkR 2.1.0
• Generalized Linear Model (GLM)
• Naive Bayes Model
• k-means Clustering
• Accelerated Failure T...
RFormula
• Specify modeling in symbolic form
y ~ f0 + f1
response y is modeled linearly by f0 and f1
• Support a subset of...
Generalized Linear Model
# R-like
glm(Sepal_Length ~ Sepal_Width + Species,
gaussianDF, family = "gaussian")
spark.glm(bin...
Naive Bayes
spark.naiveBayes(nbDF, Survived ~ Class + Sex
+ Age)
• index label, predicted label to string
k-means
spark.kmeans(kmeansDF, ~ Sepal_Length +
Sepal_Width + Petal_Length + Petal_Width,
k = 3)
Accelerated Failure Time (AFT) Survival Model
spark.survreg(aftDF, Surv(futime, fustat) ~
ecog_ps + rx)
• formula rewrite ...
Isotonic Regression
spark.isoreg(df, label ~ feature, isotonic =
FALSE)
Gaussian Mixture Model
spark.gaussianMixture(df, ~ V1 + V2, k = 2)
Alternating Least Squares
spark.als(df, "rating", "user", "item", rank
= 20, reg = 0.1, maxIter = 10, nonnegative =
TRUE)
Multilayer Perceptron Model
spark.mlp(df, label ~ features,
blockSize = 128, layers = c(4, 5, 4,
3), solver = "l-bfgs", ma...
Multiclass Logistic Regression
spark.logit(df, label ~ ., regParam =
0.3, elasticNetParam = 0.8, family =
"multinomial", t...
Random Forest
spark.randomForest(df, Employed ~ ., type
= "regression", maxDepth = 5, maxBins =
16)
spark.randomForest(df,...
Gradient Boosted Tree
spark.gbt(df, Employed ~ ., type =
"regression", maxDepth = 5, maxBins = 16)
spark.gbt(df, IndexedSp...
Modeling Parameters
spark.randomForest
function(data, formula, type = c("regression", "classification"),
maxDepth = 5, max...
Spark.ml Challenges
• Limited API sets
• Keeping up to changes - Almost all
• Non-trivial to map spark.ml API to R API
• S...
Native-R UDF
• User-Defined Functions - custom transformation
• Apply by Partition
• Apply by Group
UDFdata.frame data.fra...
Parallel Processing By Partition
R
R
R
Partition
Partition
Partition
UDF
UDF
UDF
data.frame
data.frame
data.frame
data.fra...
UDF: Apply by Partition
• Similar to R apply
• Function to process each partition of a DataFrame
• Mapping of Spark/R data...
UDF: Apply by Partition + Collect
• No schema
out <- dapplyCollect(
carsSubDF,
function(x) {
x <- cbind(x, "kmpg" = x$mpg*...
Example - UDF
results <- dapplyCollect(train,
function(x) {
model <-
randomForest::randomForest(as.factor(dep_delayed_1
5m...
UDF: Apply by Group
• By grouping columns
gapply(carsDF, "cyl",
function(key, x) {
y <- data.frame(key, max(x$mpg))
},
sch...
UDF: Apply by Group + Collect
• No Schema
out <- gapplyCollect(carsDF, "cyl",
function(key, x) {
y <- data.frame(key, max(...
R Spark
byte byte
integer integer
float float
double, numeric double
character, string string
binary, raw binary
logical b...
UDF Challenges
• “struct” - No support for nested structures as columns
• Scaling up / data skew
• What if partition or gr...
UDF: lapply
• Like R lapply or doParallel
• Good for “embarrassingly parallel” tasks
• Such as hyperparameter tuning
UDF: lapply
• Take a native R list, distribute it
• Run the UDF in parallel
UDFelement *anything*
vector/
list
list
UDF: parallel distributed processing
• Output is a list - needs to fit in memory at the driver
costs <- exp(seq(from = log...
Demo at felixcheung.github.io
SparkR as a Package (target:ASAP)
• Goal: simple one-line installation of SparkR from CRAN
install.packages("SparkR")
• Sp...
Ecosystem
• RStudio sparklyr
• RevoScaleR/RxSpark, R Server
• H2O R
• Apache SystemML (R-like API)
• STC R4ML
• Renjin (no...
Recap: SparkR 2.0.0, 2.1.0, 2.1.1
• SparkSession
• ML
• GLM, LDA
• UDF
• numPartitions, coalesce
• sparkR.uiWebUrl
• insta...
What’s coming in SparkR 2.2.0
• More, richer ML
• Bisecting K-means, Linear SVM, GLM - Tweedie,
FP-Growth, Decision Tree (...
SS in 1 line
library(magrittr)
kbsrvs <- "kafka-0.broker.kafka.svc.cluster.local:9092"
topic <- "test1"
read.stream("kafka...
SS-ML-UDF
See my other talk….
Thank You.
https://github.com/felixcheung
linkedin: http://linkd.in/1OeZDb7
Scalable Data Science with SparkR
Scalable Data Science with SparkR
Próximos SlideShares
Carregando em…5
×

Scalable Data Science with SparkR

2.426 visualizações

Publicada em

R is a very popular platform for Data Science. Apache Spark is a highly scalable data platform. How could we have the best of both worlds? How could a Data Scientist leverage the rich 10000+ packages on CRAN, and integrate Spark into their existing Data Science toolset?

SparkR is a new language binding for Apache Spark and it is designed to be familiar to native R users. In this talk we will walkthrough many examples how several new features in Apache Spark 2.x will enable scalable machine learning on Big Data. In addition to talking about the R interface to the ML Pipeline model, we will explore how SparkR support running user code on large scale data in a distributed manner, and give examples on how that could be used to work with your favorite R packages. We will also discuss best practices around using this new feature. We will also look at exciting changes in and coming next in Apache Spark 2.x releases.

Publicada em: Tecnologia
  • Seja o primeiro a comentar

Scalable Data Science with SparkR

  1. 1. SCALABLE DATA SCIENCE WITH SPARKR Felix Cheung Principal Engineer & Apache Spark Committer
  2. 2. Disclaimer: Apache Spark community contributions
  3. 3. Agenda • Spark + R, Architecture • Features • SparkR for Data Science • Ecosystem • What’s coming
  4. 4. Spark in 5 seconds • General-purpose cluster computing system • Spark SQL + DataFrame/Dataset + data sources • Streaming/Structured Streaming • ML • GraphX
  5. 5. R • A programming language for statistical computing and graphics • S – 1975 • S4 - advanced object-oriented features • R – 1993 = S + lexical scoping • Interpreted • Matrix arithmetic • Comprehensive R Archive Network (CRAN) – 10k+ packages
  6. 6. SparkR • R language APIs for Spark and Spark SQL • Exposes Spark functionality in an R-friendly DataFrame APIs • Runs as its own REPL sparkR • or as a R package loaded in IDEs like RStudio library(SparkR) sparkR.session()
  7. 7. Architecture • Native R classes and methods • RBackend • Scala “helper” methods (ML pipeline etc.) www.slideshare.net/SparkSummit/07-venkataraman-sun
  8. 8. Key Advantage • JVM processing, full access to DAG capabilities and Catalyst optimizer, predicate pushdown, code generation, etc. databricks.com/blog/2015/06/09/announcing-sparkr-r-on-spark.html
  9. 9. Features - What’s new in SparkR • SQL & Data Source (JSON, csv, JDBC, libsvm) • SparkSession & default session • Catalog (database & table management) • Spark packages, spark.addFiles() • install.spark() • ML • R-native UDF • Structured Streaming • Cluster support (YARN, mesos, standalone)
  10. 10. SparkR for Data Science
  11. 11. Decisions, decisions? Distributed? Native R UDF Spark.ml YesNo
  12. 12. DataFrame Spark ML Pipeline Transformer EstimatorTransformer Feature engineering Modeling
  13. 13. Spark ML Pipeline • Pre-processing, feature extraction, model fitting, validation stages • Transformer • Estimator • Cross-validation/hyperparameter tuning
  14. 14. SparkR API for ML Pipeline spark.lda( data = text, k = 20, maxIter = 25, optimizer = "em") RegexTokenizer StopWordsRemover CountVectorizer R JVM LDA API builds ML Pipeline
  15. 15. Model Operations • summary - print a summary of the fitted model • predict - make predictions on new data • write.ml/read.ml - save/load fitted models (slight layout difference: pipeline model plus R metadata)
  16. 16. Spark.ml in SparkR 2.0.0 • Generalized Linear Model (GLM) • Naive Bayes Model • k-means Clustering • Accelerated Failure Time (AFT) Survival Model
  17. 17. Spark.ml in SparkR 2.1.0 • Generalized Linear Model (GLM) • Naive Bayes Model • k-means Clustering • Accelerated Failure Time (AFT) Survival Model • Isotonic Regression Model • Gaussian Mixture Model (GMM) • Latent Dirichlet Allocation (LDA) • Alternating Least Squares (ALS) • Multilayer Perceptron Model (MLP) • Kolmogorov-Smirnov Test (K-S test) • Multiclass Logistic Regression • Random Forest • Gradient Boosted Tree (GBT)
  18. 18. RFormula • Specify modeling in symbolic form y ~ f0 + f1 response y is modeled linearly by f0 and f1 • Support a subset of R formula operators ~ , . , : , + , - • Implemented as feature transformer in core Spark, available to Scala/Java, Python • String label column is indexed • String term columns are one-hot encoded
  19. 19. Generalized Linear Model # R-like glm(Sepal_Length ~ Sepal_Width + Species, gaussianDF, family = "gaussian") spark.glm(binomialDF, Species ~ Sepal_Length + Sepal_Width, family = "binomial") • “binomial” output string label, prediction
  20. 20. Naive Bayes spark.naiveBayes(nbDF, Survived ~ Class + Sex + Age) • index label, predicted label to string
  21. 21. k-means spark.kmeans(kmeansDF, ~ Sepal_Length + Sepal_Width + Petal_Length + Petal_Width, k = 3)
  22. 22. Accelerated Failure Time (AFT) Survival Model spark.survreg(aftDF, Surv(futime, fustat) ~ ecog_ps + rx) • formula rewrite for censor
  23. 23. Isotonic Regression spark.isoreg(df, label ~ feature, isotonic = FALSE)
  24. 24. Gaussian Mixture Model spark.gaussianMixture(df, ~ V1 + V2, k = 2)
  25. 25. Alternating Least Squares spark.als(df, "rating", "user", "item", rank = 20, reg = 0.1, maxIter = 10, nonnegative = TRUE)
  26. 26. Multilayer Perceptron Model spark.mlp(df, label ~ features, blockSize = 128, layers = c(4, 5, 4, 3), solver = "l-bfgs", maxIter = 100, tol = 0.5, stepSize = 1)
  27. 27. Multiclass Logistic Regression spark.logit(df, label ~ ., regParam = 0.3, elasticNetParam = 0.8, family = "multinomial", thresholds = c(0, 1, 1)) • binary or multiclass
  28. 28. Random Forest spark.randomForest(df, Employed ~ ., type = "regression", maxDepth = 5, maxBins = 16) spark.randomForest(df, Species ~ Petal_Length + Petal_Width, "classification", numTree = 30) • “classification” index label, predicted label to string
  29. 29. Gradient Boosted Tree spark.gbt(df, Employed ~ ., type = "regression", maxDepth = 5, maxBins = 16) spark.gbt(df, IndexedSpecies ~ ., type = "classification", stepSize = 0.1) • “classification” index label, predicted label to string • Binary classification
  30. 30. Modeling Parameters spark.randomForest function(data, formula, type = c("regression", "classification"), maxDepth = 5, maxBins = 32, numTrees = 20, impurity = NULL, featureSubsetStrategy = "auto", seed = NULL, subsamplingRate = 1.0, minInstancesPerNode = 1, minInfoGain = 0.0, checkpointInterval = 10, maxMemoryInMB = 256, cacheNodeIds = FALSE)
  31. 31. Spark.ml Challenges • Limited API sets • Keeping up to changes - Almost all • Non-trivial to map spark.ml API to R API • Simple API, but fixed ML pipeline • Debugging is hard • Not a ML specific problem - Getting better?
  32. 32. Native-R UDF • User-Defined Functions - custom transformation • Apply by Partition • Apply by Group UDFdata.frame data.frame
  33. 33. Parallel Processing By Partition R R R Partition Partition Partition UDF UDF UDF data.frame data.frame data.frame data.frame data.frame data.frame
  34. 34. UDF: Apply by Partition • Similar to R apply • Function to process each partition of a DataFrame • Mapping of Spark/R data types dapply(carsSubDF, function(x) { x <- cbind(x, x$mpg * 1.61) }, schema)
  35. 35. UDF: Apply by Partition + Collect • No schema out <- dapplyCollect( carsSubDF, function(x) { x <- cbind(x, "kmpg" = x$mpg*1.61) })
  36. 36. Example - UDF results <- dapplyCollect(train, function(x) { model <- randomForest::randomForest(as.factor(dep_delayed_1 5min) ~ Distance + night + early, data = x, importance = TRUE, ntree = 20) predictions <- predict(model, t) data.frame(UniqueCarrier = t$UniqueCarrier, delayed = predictions) }) closure capture - serialize & broadcast “t” access package “randomForest::” at each invocation
  37. 37. UDF: Apply by Group • By grouping columns gapply(carsDF, "cyl", function(key, x) { y <- data.frame(key, max(x$mpg)) }, schema)
  38. 38. UDF: Apply by Group + Collect • No Schema out <- gapplyCollect(carsDF, "cyl", function(key, x) { y <- data.frame(key, max(x$mpg)) names(y) <- c("cyl", "max_mpg") y })
  39. 39. R Spark byte byte integer integer float float double, numeric double character, string string binary, raw binary logical boolean POSIXct, POSIXlt timestamp Date date array, list array env map UDF: data type mapping * not a complete list
  40. 40. UDF Challenges • “struct” - No support for nested structures as columns • Scaling up / data skew • What if partition or group too big for single R process? • Not enough data variety to run model? • Performance costs • Serialization/deserialization, data transfer • esp. beware of closure capture • Package management
  41. 41. UDF: lapply • Like R lapply or doParallel • Good for “embarrassingly parallel” tasks • Such as hyperparameter tuning
  42. 42. UDF: lapply • Take a native R list, distribute it • Run the UDF in parallel UDFelement *anything* vector/ list list
  43. 43. UDF: parallel distributed processing • Output is a list - needs to fit in memory at the driver costs <- exp(seq(from = log(1), to = log(1000), length.out = 5)) train <- function(cost) { model <- e1071::svm(Species ~ ., iris, cost = cost) summary(model) } summaries <- spark.lapply(costs, train)
  44. 44. Demo at felixcheung.github.io
  45. 45. SparkR as a Package (target:ASAP) • Goal: simple one-line installation of SparkR from CRAN install.packages("SparkR") • Spark Jar downloaded from official release and cached automatically, or manually install.spark() since Spark 2 • R vignettes • Community can write packages that depends on SparkR package, eg. SparkRext • Advanced Spark JVM interop APIs sparkR.newJObject, sparkR.callJMethod sparkR.callJStatic
  46. 46. Ecosystem • RStudio sparklyr • RevoScaleR/RxSpark, R Server • H2O R • Apache SystemML (R-like API) • STC R4ML • Renjin (not Spark) • IBM BigInsights Big R (not Spark!)
  47. 47. Recap: SparkR 2.0.0, 2.1.0, 2.1.1 • SparkSession • ML • GLM, LDA • UDF • numPartitions, coalesce • sparkR.uiWebUrl • install.spark
  48. 48. What’s coming in SparkR 2.2.0 • More, richer ML • Bisecting K-means, Linear SVM, GLM - Tweedie, FP-Growth, Decision Tree (2.3.0) • Column functions • to_date/to_timestamp format, approxQuantile columns, from_json/to_json • Structured Streaming • DataFrame - checkpoint, hint • Catalog API - createTable, listDatabases, refreshTable, …
  49. 49. SS in 1 line library(magrittr) kbsrvs <- "kafka-0.broker.kafka.svc.cluster.local:9092" topic <- "test1" read.stream("kafka", kafka.bootstrap.servers = kbsrvs, subscribe = topic) %>% selectExpr("explode(split(value as string, ' ')) as word") %>% group_by("word") %>% count() %>% write.stream("console", outputMode = "complete")
  50. 50. SS-ML-UDF See my other talk….
  51. 51. Thank You. https://github.com/felixcheung linkedin: http://linkd.in/1OeZDb7

×