This document discusses machine learning techniques for large-scale datasets using Apache Spark. It provides an overview of Spark's machine learning library (MLlib), describing algorithms like logistic regression, linear regression, collaborative filtering, and clustering. It also compares Spark to traditional Hadoop MapReduce, highlighting how Spark leverages caching and iterative algorithms to enable faster machine learning model training.
ICT role in 21st century education and its challenges
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference
1. Large-Scale Machine Learning with
DB Tsai
Machine Learning Engineering Lead @ AlpineDataLabs
Internet of Things Conference @ Moscone Center, SF
http://www.iotaconf.com/
October 20, 2014
Learn more about Advanced Analytics at http://www.alpinenow.com
2. TRADITIONAL
DESKTOP
IN-DATABASE
METHODS
Learn more about Advanced Analytics at http://www.alpinenow.com
WEB-BASED AND
COLLABORATIVE
SIMPLIFIED CODE-FREE
HADOOP & MPP DATABASE
ONGOING INNOVATION
The Path to Innovation
3. The Path to Innovation
Iterative algorithms
scan through the
data each time
Learn more about Advanced Analytics at http://www.alpinenow.com
With Spark, data is
cached in memory
after first iteration
Quasi-Newton methods
enhance in-memory
benefits
921s
150m
m
rows
97s
4. Machine Learning in the Big Data Era
• Hadoop Map Reduce solutions
+ =
• MapReduce scales well for batch processing
• Lots of machine learning algorithms are iterative by nature
• There are lots of tricks people do, like training with sub-samples of
data, and then average the models. Why have big data if you’re only
approximating.
Learn more about Advanced Analytics at http://www.alpinenow.com
5. Lightning-fast cluster computing
• Empower users to iterate
through the data by utilizing
the in-memory cache.
• Logistic regression runs up
to 100x faster than Hadoop
M/R in memory.
• We’re able to train exact
models without doing any
approximation.
Learn more about Advanced Analytics at http://www.alpinenow.com
6. Why MLlib?
• MLlib is a Spark subproject providing Machine Learning
primitives
• It’s built on Apache Spark, a fast and general engine for
large-scale data processing
• Shipped with Apache Spark since version 0.8
• High quality engineering design and effort
• More than 50 contributors since July 2014
Learn more about Advanced Analytics at http://www.alpinenow.com
7. Algorithms supported in MLlib
• Classification: SVMs, logistic regression, decision trees,
naïve Bayes, and random forests
• Regression: linear regression, and random forests
• Collaborative filtering: alternating least squares (ALS)
• Clustering: k-means
• Dimensionality reduction: singular value decomposition
(SVD), and principal component analysis (PCA)
• Basic statistics: summary statistics, correlations, stratified
sampling, hypothesis testing, and random data generation
• Feature extraction and transformation: TF-IDF, Word2Vec,
StandardScaler, and Normalizer
Learn more about Advanced Analytics at http://www.alpinenow.com
8. MapReduce Review
• MapReduce – Simplified Data Processing on Large
Clusters, 2004.
• Scales Linearly
• Data Locality
• Fault Tolerance in Data Storage and Computation
Learn more about Advanced Analytics at http://www.alpinenow.com
9. Hadoop MapReduce Review
• Mapper: Loads the data and emits a set of key-value pair
• Reducer: Collects the key-value pairs with the same key to process,
and output the result.
• Combiner: Can reduce shuffle traffic by combining key-value pairs
locally before going to reducer.
• In-Mapper Combiner: Aggregating the result in the mapper side,
and using the LRU cache to prevent out of heap space.
http://alpinenow.com/blog/in-mapper-combiner/
• Good: Built in fault tolerance, scalable, and production proven in
industry.
• Bad: Optimized for disk IO without leveraging memory well; iterative
algorithms go through disk IO again and again; primitive API is not
easy and clean to develop.
Learn more about Advanced Analytics at http://www.alpinenow.com
10. Spark MapReduce
• Spark also uses MapReduce as a programming model but
with much richer APIs in Scala, Java, and Python.
• With Scala expressive APIs, 5-10x less code.
• Not just a distributed computation framework, Spark provides
several pre-built components helping users to implement
application faster and easier.
- Spark Streaming
- Spark SQL
- MLlib (Machine Learning)
- GraphX (Graph Processing)
Learn more about Advanced Analytics at http://www.alpinenow.com
11. Resilient Distributed Datasets (RDDs)
• RDD is a fault-tolerant collection of elements that can be
operated on in parallel.
• RDDs can be created by parallelizing an existing
collection in your driver program, or referencing a dataset
in an external storage system, such as a shared
filesystem, HDFS, HBase, HIVE, or any data source
offering a Hadoop InputFormat.
• RDDs can be cached in memory or on disk
Learn more about Advanced Analytics at http://www.alpinenow.com
12. Hadoop M/R vs Spark M/R
• Hadoop
• Spark
Learn more about Advanced Analytics at http://www.alpinenow.com
13. RDD Operations - two types of operations
• Transformations: Creates a new dataset from
an existing one. They are lazy, in that they do
not compute their results right away.
• Actions: Returns a value to the driver program
after running a computation on the dataset.
Learn more about Advanced Analytics at http://www.alpinenow.com
14. Transformations
• map(func) - Return a new distributed dataset formed by passing each
element of the source through a function func.
• filter(func) - Return a new dataset formed by selecting those elements of the
source on which func returns true.
• flatMap(func) - Similar to map, but each input item can be mapped to 0 or
more output items (so func should return a Seq rather than a single item).
• mapPartitions(func) - Similar to map, but runs separately on each partition
(block) of the RDD, so func must be of type Iterator<T> => Iterator<U> when
running on an RDD of type T.
• groupByKey([numTasks]) - When called on a dataset of (K, V) pairs, returns a
dataset of (K, Iterable<V>) pairs.
• reduceByKey(func, [numTasks]) – When called on a dataset of (K, V) pairs,
returns a dataset of (K, V) pairs where the values for each key are
aggregated using the given reduce function func, which must be of type (V,V)
=> V.
http://spark.apache.org/docs/latest/programming-guide.html#transformations
Learn more about Advanced Analytics at http://www.alpinenow.com
15. Actions
• reduce(func) - Aggregate the elements of the dataset
using a function func (which takes two arguments and
returns one). The function should be commutative and
associative so that it can be computed correctly in
parallel.
• collect() - Return all the elements of the dataset as an
array at the driver program. This is usually useful after a
filter or other operation that returns a sufficiently small
subset of the data.
• count(), first(), take(n), saveAsTextFile(path), etc.
http://spark.apache.org/docs/latest/programming-guide.
html#actions
Learn more about Advanced Analytics at http://www.alpinenow.com
16. RDD Persistence/Cache
• RDD can be persisted using the persist() or cache()
methods on it. The first time it is computed in an action, it
will be kept in memory on the nodes. Spark’s cache is
fault-tolerant – if any partition of an RDD is lost, it will
automatically be recomputed using the transformations
that originally created it.
• Persisted RDD can be stored using a different storage
level, allowing you, for example, to persist the dataset on
disk, persist it in memory but as serialized Java objects
(to save space).
Learn more about Advanced Analytics at http://www.alpinenow.com
17. RDD Storage Level
• MEMORY_ONLY - Store RDD as deserialized Java objects in the JVM.
If the RDD does not fit in memory, some partitions will not be cached
and will be recomputed on the fly each time they're needed. This is the
default level.
• MEMORY_AND_DISK - Store RDD as deserialized Java objects in the
JVM. If the RDD does not fit in memory, store the partitions that don't fit
on disk, and read them from there when they're needed.
• MEMORY_ONLY_SER - Store RDD as serialized Java objects (one
byte array per partition). This is generally more space-efficient than
deserialized objects, especially when using a fast serializer, but more
CPU-intensive to read.
• MEMORY_AND_DISK_SER - Similar to MEMORY_ONLY_SER, but
spill partitions that don't fit in memory to disk instead of recomputing
them on the fly each time they're needed.
http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
Learn more about Advanced Analytics at http://www.alpinenow.com
18. Word Count Example in Scala
Learn more about Advanced Analytics at http://www.alpinenow.com
22. API’s design philosophy in MLlib
• Works seamlessly with Spark Core, and Spark SQL; users can use
core API’s or Spark SQL for data pre-processing, and then pipe into
training step.
• Algorithms are implemented in Scala. Public interfaces don’t use
advanced Scala features to ensure Java compatibility.
• Many of MLlib API’s have python bindings.
• MLlib is under active development. The APIs marked Experimental/
DeveloperApi may change in future releases, and will provide
migration guide if they are changed.
• API’s are well documented, and designed to be expressive.
• Code is well-tested, comprehensive unittest coverage. There are lots
of comments in the code, and it’s a enjoyable experience to read the
code.
Learn more about Advanced Analytics at http://www.alpinenow.com
23. Data Types
• MLlib local vectors and local matrices are currently
wrapping Breeze implementation; as a result, the underlying linear algebra
operations are provided by Breeze and jblas.
https://github.com/scalanlp/breeze
• However, the methods converting MLlib to Breeze vectors/matrices or the
other way around are private to org.apache.spark.mllib scope. This
restriction can be workaround by having your custom code in
org.apache.spark.mllib.something package.
• A training sample used in supervised learning is stored in LabeledPoint
which contains a label/response and a feature vector in dense or sparse.
• Distributed RowMatrix – basically, it’s RDD[Vector] which doesn’t have
meaningful row indices.
• Distributed IndexedRowMatrix – it’s similar to RowMatrix, but each row
is represented by its index and a local vector.
Learn more about Advanced Analytics at http://www.alpinenow.com
24. Local vector
The base class of local vectors is Vector, and we provide two implementations:
DenseVector and SparseVector.
Learn more about Advanced Analytics at http://www.alpinenow.com
25. Some useful tips related to local vector
• If you want to use native Breeze functionality, you can
have your code in org.apache.spark.mllib package.
Learn more about Advanced Analytics at http://www.alpinenow.com
26. Real code in MLlib in MultivariateOnlineSummarizer
Learn more about Advanced Analytics at http://www.alpinenow.com
27. LabeledPoint
• Double is used for storing the label, so we can use the labeled points
in both regression and classification. For binary classification, a label
should be either 0.0 or 1.0. For N-class classification, labels should
be class indices starting from zero: 0.0, 1.0, 2.0, …, N - 1
Learn more about Advanced Analytics at http://www.alpinenow.com
28. Supervised Learning
• Binary Classification: linear SVMs (SGD), logistic regression (L-BFGS
and SGD), decision trees, random forests (Spark 1.2), and
naïve Bayes.
• Multiclass Classification: Decision trees, naïve Bayes (coming
soon - multinomial logistic regression in GLMNET)
• Regression: linear least squares (SGD), Lasso (SGD + soft-threshold),
ridge regression (SGD), decision trees, and random
forests (Spark 1.2)
• Currently, the regularization in linear model will penalize all the
weights including the intercept which is not desired in some use-cases.
Alpine has GLMNET implementation using OWLQN which
can exactly reproduce R’s GLMNET package result with scalability.
We’re in the process of merging it into MLlib community.
Learn more about Advanced Analytics at http://www.alpinenow.com
31. SPARK-2934: LogisticRegressionWithLBFGS
• Merged in Spark 1.1
• Contributed by Alpine Data Labs
• Using L-BFGS to train Logistic Regression instead of
default Gradient Descent.
• Users don't have to construct their objective function for
Logistic Regression, and don't have to implement the
whole details.
• Together with SPARK-2979 to minimize the condition
number, the convergence rate is further improved.
Learn more about Advanced Analytics at http://www.alpinenow.com
32. SPARK-2979: Improve the convergence rate by
standardizing the training features
l Merged in Spark 1.1
l Contributed by Alpine Data Labs
l Due to the invariance property of MLEs, the scale of your inputs are
irrelevant.
l However, the optimizer will not be happy with poor condition numbers
which can often be improved by scaling.
l The model is trained in the scaled space, but the coefficients are
converted to original space; as a result, it's transparent to users.
l Without this, some training datasets mixing the columns with different
scales may not be able to converge.
l Scikit and glmnet package also standardize the features before training to
improve the convergence.
l Only enable in Logistic Regression for now.
Learn more about Advanced Analytics at http://www.alpinenow.com
40. Spark-1157: L-BFGS Optimizer
• No, its not a blender!
Learn more about Advanced Analytics at http://www.alpinenow.com
41. What is Spark-1157: L-BFGS Optimizer
• Merged in Spark 1.0
• Contributed by Alpine Data Labs
• Popular algorithms for parameter estimation in Machine Learning.
• It’s a quasi-Newton Method.
• Hessian matrix of second derivatives doesn't need to be evaluated
directly.
• Hessian matrix is approximated using gradient evaluations.
• It converges a way faster than the default optimizer in Spark,
Gradient Decent.
• We are contributing OWLQN which is an variant of LBFGS to deal
with L1 problem to Spark. It’s a building block of GLMNET.
Learn more about Advanced Analytics at http://www.alpinenow.com
43. SPARK-2505: Weighted Regularization
ongoing work
l Each components of weights can be penalized differently.
l We can exclude intercept from regularization in this framework.
l Decoupling regularization from the raw gradient update which is
not used in other optimization schemes.
l Allow various update/learning rate schemes (adagrad,
normalized adaptive gradient, etc) to be applied independent of
the regularization
l Smooth and L1 regularization will be handled differently in
optimizer.
Learn more about Advanced Analytics at http://www.alpinenow.com
44. SPARK-2309: Multinomial Logistic Regression
ongoing work
l For K classes multinomial problem, we can generalize it via
K -1 linear models with logist link functions.
l As a result, the weights will have dimension of (K-1)(N + 1)
where N is number of features.
l MLlib interface is designed for one set of paramerters per
model, so it requires some interface design changes.
l Expected to be merged in next release of MLlib, Spark 1.2
Ref: http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297
Learn more about Advanced Analytics at http://www.alpinenow.com
45. SPARK-2272: Transformer
A spark, the soul of a transformer
Learn more about Advanced Analytics at http://www.alpinenow.com
46. SPARK-2272: Transformer
l Merged in Spark 1.1
l Contributed by Alpine Data Labs
l MLlib data preprocessing pipeline.
l StandardScaler
- Standardize features by removing the mean and scaling to unit variance.
- RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear
models typically works better with zero mean and unit variance.
l Normalizer
- Normalizes samples individually to unit L^n norm.
- Common operation for text classification or clustering for instance.
- For example, the dot product of two l2-normalized TF-IDF vectors is the
cosine similarity of the vectors.
Learn more about Advanced Analytics at http://www.alpinenow.com
48. Learn more about Advanced Analytics at http://www.alpinenow.com
Normalizer
49. SPARK-1969: Online summarizer
l Merged in Spark 1.1
l Contributed by Alpine Data Labs
l Online algorithms for computing the mean, variance, min, and max in a streaming
fashion.
l Two online summerier can be merged, so we can use one summerier for one block of
data in map phase, and merge all of them in reduce phase to obtain the global
summarizer.
l A numerically stable one-pass algorithm is implemented to avoid catastrophic cancellation
in naive implementation.
Ref: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
Two-pass algorithm
l Optimized for sparse vector, and the time complexity is O(non-zeors) instead of
O(numCols) for each sample.
Learn more about Advanced Analytics at http://www.alpinenow.com
Naive algorithm
52. Spark SQL
• Spark SQL allows relational queries expressed in SQL, HiveQL, or
Scala to be executed using Spark. At the core of this component is a
new type of RDD, SchemaRDD.
• SchemaRDDs are composed of Row objects, along with a schema
that describes the data types of each column in the row. A
SchemaRDD is similar to a table in a traditional relational database.
• A SchemaRDD can be created from an existing RDD, a Parquet file,
a JSON dataset, or by running HiveQL against data stored in Apache
Hive.
http://spark.apache.org/docs/latest/sql-programming-guide.html
Learn more about Advanced Analytics at http://www.alpinenow.com
53. Spark SQL + MLlib
l With SparkSQL, users can easily load the parquet/
avro datasets into Spark, and perform the data pre-processing
before the training steps.
l MLlib considers to use schemaRDD as a native
typed data format, like R’s data-frame. This allows
us to create output model with types and column
names, and also be easier to create PMML model.
Learn more about Advanced Analytics at http://www.alpinenow.com
54. Spark SQL + MLlib
l With SparkSQL, users can easily load the parquet/
avro datasets into Spark, and perform the data pre-processing
before the training steps.
l MLlib considers to use schemaRDD as a native
typed data format, like R’s data-frame. This allows
us to create output model with types and column
names, and also be easier to create PMML model.
Learn more about Advanced Analytics at http://www.alpinenow.com
55. Example: Prepare training data using Spark SQL
Learn more about Advanced Analytics at http://www.alpinenow.com
56. Example: Prepare training data using Spark SQL
Learn more about Advanced Analytics at http://www.alpinenow.com
57. Interested in MLlib?
l MLlib official guide -
https://spark.apache.org/docs/latest/mllib-guide.html
l Github – https://github.com/apache/spark
l Mailing lists - user@spark.apache.org
or dev@spark.apache.org
Learn more about Advanced Analytics at http://www.alpinenow.com
58. For more information, contact us
1550 Bryant Street
Suite 1000
San Francisco, CA 94103
USA
+1 (877) 542-0062
www.alpinenow.com
Learn more about Advanced Analytics at http://www.alpinenow.com
Get Started Today!
http://start.alpinenow.com