SlideShare uma empresa Scribd logo
1 de 25
Mahout Scala and Spark Bindings:
Bringing algebraic semantics
Dmitriy Lyubimov
2014
Requirements for an ideal ML Environment
Wanted:
1. Clear R (Matlab)-like semantics and type system that covers
 Linear Algebra, Stats and Data Frames
2. Modern programming language qualities
 Functional programming
 Object Oriented programming
 Sensible byte code Performance
 A Big Plus: Scripting and Interactive shell
3. Distributed scalability
with a sensible performance
4. Collection of off-the-shelf building blocks
and algorithms
5. Visualization
Mahout Scala & Spark Bindings aim to address (1-a), (2), (3),
(4).
Scala & Spark Bindings are:
1. Scala as programming/scripting environment
2. R-like DSL :
val g = bt.t %*% bt - c - c.t +
(s_q cross s_q) * (xi dot xi)
3. Algebraic expression optimizer for distributed
Linear Algebra
 Provides a translation layer to distributed engines: Spark,
(…)
What is Scala and Spark Bindings? (2)
What are the data types?
1. Scalar real values (Double)
2. In-core vectors (2 types of sparse, 1 type of dense)
3. In-core matrices: sparse and dense
 A number of specialized matrices
4. Distributed Row Matrices (DRM)
 Compatible across Mahout MR and Spark solvers via
persistence format
Dual representation of in-memory DRM
// Run LSA
val (drmU, drmV, s) = dssvd(A)
 U inherits row keys of A
automatically
 Special meaning of integer row
keys for physical transpose
Automatic row key
tracking:
Features (1)
 Matrix, vector, scalar
operators: in-core,
out-of- core
 Slicing operators
 Assignments (in-core
only)
 Vector-specific
 Summaries
drmA %*% drmB
A %*% x
A.t %*% A
A * B
A(5 until 20, 3 until 40)
A(5, ::); A(5, 5)
x(a to b)
A(5, ::) := x
A *= B
A -=: B; 1 /=: x
x dot y; x cross y
A.nrow; x.length;
A.colSums; B.rowMeans
x.sum; A.norm …
Features (2) – decompositions
 In-core
 Out-of-core
val (inCoreQ, inCoreR) = qr(inCoreM)
val ch = chol(inCoreM)
val (inCoreV, d) = eigen(inCoreM)
val (inCoreU, inCoreV, s) = svd(inCoreM)
val (inCoreU, inCoreV, s) = ssvd(inCoreM,
k = 50, q = 1)
val (drmQ, inCoreR) = thinQR(drmA)
val (drmU, drmV, s) = dssvd(drmA, k = 50,
q = 1)
Features (3) – construction and collect
 Parallelizing
from an in-
core matrix
 Collecting to
an in-core
val inCoreA = dense(
(1, 2, 3, 4),
(2, 3, 4, 5),
(3, -4, 5, 6),
(4, 5, 6, 7),
(8, 6, 7, 8)
)
val A = drmParallelize(inCoreA,
numPartitions = 2)
val inCoreB = drmB.collect
Features (4) – HDFS persistence
 Load DRM
from HDFS
 Save DRM
to HDFS
val drmB = drmFromHDFS(path = inputPath)
drmA.writeDRM(path = uploadPath)
Delayed execution and actions
 Optimizer action
 Defines
optimization
granularity
 Guarantees the
result will be
formed in its
entirety
 Computational
action
 Actually triggers
Spark action
 Optimizer actions
are implicitly
triggered by
computation
// Example: A = B’U
// Logical DAG:
val drmA = drmB.t %*% drmU
// Physical DAG:
drmA.checkpoint()
drmA.writeDrm(path)
(drmB.t %*% drmU).writeDRM(path)
Common computational paths
Checkpoint caching (maps 1:1 to Spark)
 Checkpoint caching is a combination of None | in-
memory | disk | serialized | replicated options
 Method “checkpoint()” signature:
def checkpoint(sLevel: StorageLevel =
StorageLevel.MEMORY_ONLY): CheckpointedDrm[K]
 Unpin data when no longer needed
drmA.uncache()
Optimization factors
 Geometry (size) of operands
 Orientation of operands
 Whether identically partitioned
 Whether computational paths are shared
E. g.: Matrix multiplication:
 5 physical operators for drmA %*% drmB
 2 operators for drmA %*% inCoreA
 1 operator for drm A %*% x
 1 operator for x %*% drmA
Component Stack
Customization: vertical block operator
 Custom vertical block processing
 must produce blocks of the same height
// A * 5.0
drmA.mapBlock() {
case (keys, block) =>
block *= 5.0
keys -> block
}
Customization: Externalizing RDDs
 Externalizing raw RDD
 Triggers optimizer checkpoint implicitly
val rawRdd:DrmRDD[K] = drmA.rdd
 Wrapping raw RDD into a DRM
 Stitching with data prep pipelines
 Building complex distributed algorithm
val drmA = drmWrap(rdd = rddA [, … ])
Broadcasting an in-core matrix or vector
 We cannot wrap in-core vector or matrix in a closure:
they do not support Java serialization
 Use broadcast api
 Also may improve performance (e.g. set up Spark to
broadcast via Torrent broadcast)
// Example: Subtract vector xi from each row:
val bcastXi = drmBroadcast(xi)
drmA.mapBlock() {
case(keys, block) =>
for (row <- block) row -= bcastXi
keys -> block
}
Guinea Pigs – actionable lines of code
 Thin QR
 Stochastic Singular Value Decomposition
 Stochastic PCA (MAHOUT-817 re-flow)
 Co-occurrence analysis recommender (aka RSJ)
Actionable lines of code (-blanks -comments -CLI)
Thin QR (d)ssvd (d)spca
R prototype n/a 28 38
In-core Scala
bindings
n/a 29 50
DRM Spark
bindings
17 32 68
Mahout/Java/MR n/a ~2581 ~2581
dspca (tail)
… …
val c = s_q cross s_b
val inCoreBBt = (drmBt.t %*% drmBt)
.checkpoint(StorageLevel.NONE).collect -
c - c.t + (s_q cross s_q) * (xi dot xi)
val (inCoreUHat, d) = eigen(inCoreBBt)
val s = d.sqrt
val drmU = drmQ %*% inCoreUHat
val drmV = drmBt %*% (inCoreUHat %*%: diagv(1 /: s))
(drmU(::, 0 until k), drmV(::, 0 until k), s(0 until k))
}
Interactive Shell & Scripting!
Pitfalls
 Side-effects are not like in R
 In-core: no copy-on-write semantics
 Distributed: Cache policies without serialization may
cause cached blocks experience side effects from
subsequent actions
 Use something like MEMORY_DISK_SER for cached parents of
pipelines with side effects
 Beware of naïve and verbatim translations of in-core
methods
Recap: Key Concepts
 High level Math, Algebraic and Data Frames logical semantic
constructs
 R-like (Matlab-like), easy to prototype, read, maintain, customize
 Operator-centric: same operator semantics regardless of
operand types
 Strategical notion: Portability of logical semantic constructs
 Write once, run anywhere
 Cost-based & Rewriting Optimizer
 Tactical notion: low cost POC, sensible in-memory
computation performance
 Spark
 Strong programming language environment (Scala)
 Scriptable & interactive shell (extra bonus)
 Compatibility with the rest of Mahout solvers via DRM
persistence
Similar work
 Breeze:
 Excellent math and linear algebra DSL
 In-core only
 MLLib
 A collection of ML on Spark
 tightly coupled to Spark
 not an environment
 MLI
 Tightly coupled to Spark
 SystemML
 Advanced cost-based optimization
 Tightly bound to a specific resource manager(?)
 + yet another language
 Julia (closest conceptually)
 + yet another language
 + yet another backend
Wanted and WIP
 Data Frames DSL API & physical layer(M-1490)
 E.g. For standardizing feature vectorization in Mahout
 E.g. For custom business rules scripting
 “Bring Your Own Distributed Method” (BYODM) –
build out ScalaBindings’ “write once – run
everywhere” collection of things
 Bindings for http://Stratosphere.eu
 Automatic parallelism adjustments
 Ability scale and balance problem to all available
resources automatically
 For more, see Spark Bindings home page
Links
 Scala and Spark Bindings
http://mahout.apache.org/users/sparkbindings/home.
html
 Stochastic Singular Value Decomposition
http://mahout.apache.org/users/dim-
reduction/ssvd.html
 Blog http://weatheringthrutechdays.blogspot.com
Thank you.

Mais conteúdo relacionado

Mais procurados

A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)
A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)
A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)Menlo Systems GmbH
 
MLconf NYC Xiangrui Meng
MLconf NYC Xiangrui MengMLconf NYC Xiangrui Meng
MLconf NYC Xiangrui MengMLconf
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David SzakallasDatabricks
 
Parallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear ModelsParallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear ModelsRevolution Analytics
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...DB Tsai
 
Technical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal WabbitTechnical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal Wabbitjakehofman
 
Scaling out logistic regression with Spark
Scaling out logistic regression with SparkScaling out logistic regression with Spark
Scaling out logistic regression with SparkBarak Gitsis
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkCloudera, Inc.
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsJen Aman
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...Spark Summit
 
Exploring Optimization in Vowpal Wabbit
Exploring Optimization in Vowpal WabbitExploring Optimization in Vowpal Wabbit
Exploring Optimization in Vowpal WabbitShiladitya Sen
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
 
Multinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkMultinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkDB Tsai
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CAApache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CARobert Metzger
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondXiangrui Meng
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkDB Tsai
 
Algorithms on Hadoop at Last.fm
Algorithms on Hadoop at Last.fmAlgorithms on Hadoop at Last.fm
Algorithms on Hadoop at Last.fmMark Levy
 

Mais procurados (20)

A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)
A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)
A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)
 
MLconf NYC Xiangrui Meng
MLconf NYC Xiangrui MengMLconf NYC Xiangrui Meng
MLconf NYC Xiangrui Meng
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David Szakallas
 
Parallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear ModelsParallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear Models
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
Technical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal WabbitTechnical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal Wabbit
 
Scaling out logistic regression with Spark
Scaling out logistic regression with SparkScaling out logistic regression with Spark
Scaling out logistic regression with Spark
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable Statistics
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
 
Exploring Optimization in Vowpal Wabbit
Exploring Optimization in Vowpal WabbitExploring Optimization in Vowpal Wabbit
Exploring Optimization in Vowpal Wabbit
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
 
Multinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkMultinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache Spark
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CAApache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
 
Algorithms on Hadoop at Last.fm
Algorithms on Hadoop at Last.fmAlgorithms on Hadoop at Last.fm
Algorithms on Hadoop at Last.fm
 

Destaque

Whats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutWhats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutTed Dunning
 
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...Amazon Web Services
 
Distributed Machine Learning with Apache Mahout
Distributed Machine Learning with Apache MahoutDistributed Machine Learning with Apache Mahout
Distributed Machine Learning with Apache MahoutSuneel Marthi
 
Scala Programming Introduction
Scala Programming IntroductionScala Programming Introduction
Scala Programming IntroductionairisData
 
Why Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data WorldWhy Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data WorldDean Wampler
 
Apache Spark & Scala
Apache Spark & ScalaApache Spark & Scala
Apache Spark & ScalaEdureka!
 
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)Jee Vang, Ph.D.
 
Introduction to Functional Programming with Scala
Introduction to Functional Programming with ScalaIntroduction to Functional Programming with Scala
Introduction to Functional Programming with Scalapramode_ce
 

Destaque (11)

Whats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutWhats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache Mahout
 
Apache mahout
Apache mahoutApache mahout
Apache mahout
 
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
 
Distributed Machine Learning with Apache Mahout
Distributed Machine Learning with Apache MahoutDistributed Machine Learning with Apache Mahout
Distributed Machine Learning with Apache Mahout
 
Scala Programming Introduction
Scala Programming IntroductionScala Programming Introduction
Scala Programming Introduction
 
Mahout
MahoutMahout
Mahout
 
Why Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data WorldWhy Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data World
 
Apache Spark & Scala
Apache Spark & ScalaApache Spark & Scala
Apache Spark & Scala
 
Why Scala?
Why Scala?Why Scala?
Why Scala?
 
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
 
Introduction to Functional Programming with Scala
Introduction to Functional Programming with ScalaIntroduction to Functional Programming with Scala
Introduction to Functional Programming with Scala
 

Semelhante a Mahout scala and spark bindings

Apache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelApache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelAndrey Lomakin
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingDatabricks
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustSpark Summit
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Introduce spark (by 조창원)
Introduce spark (by 조창원)Introduce spark (by 조창원)
Introduce spark (by 조창원)I Goo Lee.
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型wang xing
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Automatic Task-based Code Generation for High Performance DSEL
Automatic Task-based Code Generation for High Performance DSELAutomatic Task-based Code Generation for High Performance DSEL
Automatic Task-based Code Generation for High Performance DSELJoel Falcou
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 
Scalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache SparkScalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache Sparkfelixcss
 
Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Mark Smith
 
Graph convolutional networks in apache spark
Graph convolutional networks in apache sparkGraph convolutional networks in apache spark
Graph convolutional networks in apache sparkEmiliano Martinez Sanchez
 

Semelhante a Mahout scala and spark bindings (20)

Scala Macros
Scala MacrosScala Macros
Scala Macros
 
Apache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelApache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data model
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Introduce spark (by 조창원)
Introduce spark (by 조창원)Introduce spark (by 조창원)
Introduce spark (by 조창원)
 
Software Security
Software SecuritySoftware Security
Software Security
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
Spark training-in-bangalore
Spark training-in-bangaloreSpark training-in-bangalore
Spark training-in-bangalore
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Automatic Task-based Code Generation for High Performance DSEL
Automatic Task-based Code Generation for High Performance DSELAutomatic Task-based Code Generation for High Performance DSEL
Automatic Task-based Code Generation for High Performance DSEL
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Scalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache SparkScalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache Spark
 
Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016
 
Graph convolutional networks in apache spark
Graph convolutional networks in apache sparkGraph convolutional networks in apache spark
Graph convolutional networks in apache spark
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
 

Último

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 

Último (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 

Mahout scala and spark bindings

  • 1. Mahout Scala and Spark Bindings: Bringing algebraic semantics Dmitriy Lyubimov 2014
  • 2. Requirements for an ideal ML Environment Wanted: 1. Clear R (Matlab)-like semantics and type system that covers  Linear Algebra, Stats and Data Frames 2. Modern programming language qualities  Functional programming  Object Oriented programming  Sensible byte code Performance  A Big Plus: Scripting and Interactive shell 3. Distributed scalability with a sensible performance 4. Collection of off-the-shelf building blocks and algorithms 5. Visualization Mahout Scala & Spark Bindings aim to address (1-a), (2), (3), (4).
  • 3. Scala & Spark Bindings are: 1. Scala as programming/scripting environment 2. R-like DSL : val g = bt.t %*% bt - c - c.t + (s_q cross s_q) * (xi dot xi) 3. Algebraic expression optimizer for distributed Linear Algebra  Provides a translation layer to distributed engines: Spark, (…) What is Scala and Spark Bindings? (2)
  • 4. What are the data types? 1. Scalar real values (Double) 2. In-core vectors (2 types of sparse, 1 type of dense) 3. In-core matrices: sparse and dense  A number of specialized matrices 4. Distributed Row Matrices (DRM)  Compatible across Mahout MR and Spark solvers via persistence format
  • 5. Dual representation of in-memory DRM // Run LSA val (drmU, drmV, s) = dssvd(A)  U inherits row keys of A automatically  Special meaning of integer row keys for physical transpose Automatic row key tracking:
  • 6. Features (1)  Matrix, vector, scalar operators: in-core, out-of- core  Slicing operators  Assignments (in-core only)  Vector-specific  Summaries drmA %*% drmB A %*% x A.t %*% A A * B A(5 until 20, 3 until 40) A(5, ::); A(5, 5) x(a to b) A(5, ::) := x A *= B A -=: B; 1 /=: x x dot y; x cross y A.nrow; x.length; A.colSums; B.rowMeans x.sum; A.norm …
  • 7. Features (2) – decompositions  In-core  Out-of-core val (inCoreQ, inCoreR) = qr(inCoreM) val ch = chol(inCoreM) val (inCoreV, d) = eigen(inCoreM) val (inCoreU, inCoreV, s) = svd(inCoreM) val (inCoreU, inCoreV, s) = ssvd(inCoreM, k = 50, q = 1) val (drmQ, inCoreR) = thinQR(drmA) val (drmU, drmV, s) = dssvd(drmA, k = 50, q = 1)
  • 8. Features (3) – construction and collect  Parallelizing from an in- core matrix  Collecting to an in-core val inCoreA = dense( (1, 2, 3, 4), (2, 3, 4, 5), (3, -4, 5, 6), (4, 5, 6, 7), (8, 6, 7, 8) ) val A = drmParallelize(inCoreA, numPartitions = 2) val inCoreB = drmB.collect
  • 9. Features (4) – HDFS persistence  Load DRM from HDFS  Save DRM to HDFS val drmB = drmFromHDFS(path = inputPath) drmA.writeDRM(path = uploadPath)
  • 10. Delayed execution and actions  Optimizer action  Defines optimization granularity  Guarantees the result will be formed in its entirety  Computational action  Actually triggers Spark action  Optimizer actions are implicitly triggered by computation // Example: A = B’U // Logical DAG: val drmA = drmB.t %*% drmU // Physical DAG: drmA.checkpoint() drmA.writeDrm(path) (drmB.t %*% drmU).writeDRM(path)
  • 12. Checkpoint caching (maps 1:1 to Spark)  Checkpoint caching is a combination of None | in- memory | disk | serialized | replicated options  Method “checkpoint()” signature: def checkpoint(sLevel: StorageLevel = StorageLevel.MEMORY_ONLY): CheckpointedDrm[K]  Unpin data when no longer needed drmA.uncache()
  • 13. Optimization factors  Geometry (size) of operands  Orientation of operands  Whether identically partitioned  Whether computational paths are shared E. g.: Matrix multiplication:  5 physical operators for drmA %*% drmB  2 operators for drmA %*% inCoreA  1 operator for drm A %*% x  1 operator for x %*% drmA
  • 15. Customization: vertical block operator  Custom vertical block processing  must produce blocks of the same height // A * 5.0 drmA.mapBlock() { case (keys, block) => block *= 5.0 keys -> block }
  • 16. Customization: Externalizing RDDs  Externalizing raw RDD  Triggers optimizer checkpoint implicitly val rawRdd:DrmRDD[K] = drmA.rdd  Wrapping raw RDD into a DRM  Stitching with data prep pipelines  Building complex distributed algorithm val drmA = drmWrap(rdd = rddA [, … ])
  • 17. Broadcasting an in-core matrix or vector  We cannot wrap in-core vector or matrix in a closure: they do not support Java serialization  Use broadcast api  Also may improve performance (e.g. set up Spark to broadcast via Torrent broadcast) // Example: Subtract vector xi from each row: val bcastXi = drmBroadcast(xi) drmA.mapBlock() { case(keys, block) => for (row <- block) row -= bcastXi keys -> block }
  • 18. Guinea Pigs – actionable lines of code  Thin QR  Stochastic Singular Value Decomposition  Stochastic PCA (MAHOUT-817 re-flow)  Co-occurrence analysis recommender (aka RSJ) Actionable lines of code (-blanks -comments -CLI) Thin QR (d)ssvd (d)spca R prototype n/a 28 38 In-core Scala bindings n/a 29 50 DRM Spark bindings 17 32 68 Mahout/Java/MR n/a ~2581 ~2581
  • 19. dspca (tail) … … val c = s_q cross s_b val inCoreBBt = (drmBt.t %*% drmBt) .checkpoint(StorageLevel.NONE).collect - c - c.t + (s_q cross s_q) * (xi dot xi) val (inCoreUHat, d) = eigen(inCoreBBt) val s = d.sqrt val drmU = drmQ %*% inCoreUHat val drmV = drmBt %*% (inCoreUHat %*%: diagv(1 /: s)) (drmU(::, 0 until k), drmV(::, 0 until k), s(0 until k)) }
  • 20. Interactive Shell & Scripting!
  • 21. Pitfalls  Side-effects are not like in R  In-core: no copy-on-write semantics  Distributed: Cache policies without serialization may cause cached blocks experience side effects from subsequent actions  Use something like MEMORY_DISK_SER for cached parents of pipelines with side effects  Beware of naïve and verbatim translations of in-core methods
  • 22. Recap: Key Concepts  High level Math, Algebraic and Data Frames logical semantic constructs  R-like (Matlab-like), easy to prototype, read, maintain, customize  Operator-centric: same operator semantics regardless of operand types  Strategical notion: Portability of logical semantic constructs  Write once, run anywhere  Cost-based & Rewriting Optimizer  Tactical notion: low cost POC, sensible in-memory computation performance  Spark  Strong programming language environment (Scala)  Scriptable & interactive shell (extra bonus)  Compatibility with the rest of Mahout solvers via DRM persistence
  • 23. Similar work  Breeze:  Excellent math and linear algebra DSL  In-core only  MLLib  A collection of ML on Spark  tightly coupled to Spark  not an environment  MLI  Tightly coupled to Spark  SystemML  Advanced cost-based optimization  Tightly bound to a specific resource manager(?)  + yet another language  Julia (closest conceptually)  + yet another language  + yet another backend
  • 24. Wanted and WIP  Data Frames DSL API & physical layer(M-1490)  E.g. For standardizing feature vectorization in Mahout  E.g. For custom business rules scripting  “Bring Your Own Distributed Method” (BYODM) – build out ScalaBindings’ “write once – run everywhere” collection of things  Bindings for http://Stratosphere.eu  Automatic parallelism adjustments  Ability scale and balance problem to all available resources automatically  For more, see Spark Bindings home page
  • 25. Links  Scala and Spark Bindings http://mahout.apache.org/users/sparkbindings/home. html  Stochastic Singular Value Decomposition http://mahout.apache.org/users/dim- reduction/ssvd.html  Blog http://weatheringthrutechdays.blogspot.com Thank you.

Notas do Editor

  1. On the right side is the picture called “Statue of a miner”. I snapped it inKuntaHora, Czech Repulbic about 10 years ago. More specifically, it is a statute of a miner mining for sivler for a state mint. It was a heavy