Simplifying Microservices & Apps - The art of effortless development - Meetup...
Hivemall talk@Hadoop summit 2014, San Jose
1. National Institute of Advanced Industrial Science
and Technology (AIST), Japan
Makoto YUI
m.yui@aist.go.jp, @myui
Hivemall: Scalable Machine Learning
Library for Apache Hive
Hadoop Summit 2014, San Jose
1 / 43
2. Plan of the talk
• What is Hivemall
• Why Hivemall
• What Hivemall can do
• How to use Hivemall
• How Hivemall works
• How to deal with iterations w/ comparing to Spark
• Experimental Evaluation
• Conclusion
Hadoop Summit 2014, San Jose
2 / 43
3. What is Hivemall
• A collection of machine learning algorithms
implemented as Hive UDFs/UDTFs
• Classification & Regression
• Recommendation
• k-Nearest Neighbor Search
.. and more
• An open source project on Github
• Licensed under LGPL
• github.com/myui/hivemall (bit.ly/hivemall)
• 4 contributors
Hadoop Summit 2014, San Jose
3 / 43
6. Hadoop Summit 2014, San Jose
Motivation – Why a new ML framework?
Mahout?
Vowpal Wabbit?
(w/ Hadoop streaming)
Spark MLlib?
0xdata H2O? Cloudera Oryx?
Machine Learning frameworks out there
that run with Hadoop
Quick Poll:
How many people in this room are using them?
6 / 43
7. Framework User interface
Mahout Java API Programming
Spark MLlib/MLI Scala API programming
Scala Shell (REPL)
H2O R programming
GUI
Cloudera Oryx Http REST API programming
Vowpal Wabbit
(w/ Hadoop streaming)
C++ API programming
Command Line
Hadoop Summit 2014, San Jose
Motivation – Why a new ML framework?
Existing distributed machine learning frameworks
are NOT easy to use
7 / 43
8. Hadoop Summit 2014, San Jose
Classification with Mahout
org/apache/mahout/classifier/sgd/TrainNewsGroups.java
Find the complete code at
bit.ly/news20-mahout
8 / 43
9. Hadoop Summit 2014, San Jose
Why Hivemall
1. Ease of use
• No programming
• Every machine learning step is done within HiveQL
• No compilation/packaging overhead
• Easy for existing Hive users
• You can evaluate Hivemall within 5 minutes or so
• Installation is just as follows
9 / 43
10. Hadoop Summit 2014, San Jose
Why Hivemall
2. Scalable to data
• Scalable to # of training/testing instances
• Scalable to # of features
• Built-in support for feature hashing
• Scalable to the size of prediction model
• Suppose there are 200 labels * 100 million
features ⇒ Requires 150GB
• Hivemall does not need a prediction model fit
in memory both in the training/prediction
• Feature engineering step is also scalable
and parallelized using Hive
10 / 43
11. Hadoop Summit 2014, San Jose
Why Hivemall
3. Scalable to computing resources
• Exploiting the benefits of Hadoop &
Hive
• Provisioning the machine learning
service on Amazon Elastic MapReduce
• Provides an EMR bootstrap for the
automated setup
Find an example on
bit.ly/hivemall-emr
11 / 43
12. Hadoop Summit 2014, San Jose
Why Hivemall
4. Supports the state-of-the-art online
learning algorithms (for classification)
• Less configuration parameters
(no learning rate as one in SGD)
• CW, AROW[1], and SCW[2] are not yet
supported in the other ML frameworks
• Surprising fast convergence properties
(few iterations is enough)
1. Adaptive Regularization of Weight Vectors (AROW), Crammer et al., NIPS 2009
2. Exact Soft Confidence-Weighted Learning (SCW), Wang et al., ICML 2012
12 / 43
13. Hadoop Summit 2014, San Jose
Why Hivemall
Algorithms
News20.binary
Classification Accuracy
Perceptron 0.9460
Passive-Aggressive
(a.k.a. Online-SVM)
0.9604
LibLinear 0.9636
LibSVM/TinySVM 0.9643
Confidence Weighted (CW) 0.9656
AROW [1] 0.9660
SCW [2] 0.9662
Better
4. Supports the state-of-the-art online
learning algorithms (for classification)
CW-variants are very smart online ML algorithm
13 / 43
14. Hadoop Summit 2014, San Jose
Why CW variants are so good?
Suppose a binary classification setting to classify
sentences positive or negative
→ learn the weight for each word (each word is a feature)
I like this authorPositive
I like this author, but found this book dullNegative
Label Feature Vector
Naïve update will reduce both at same rateWlike Wdull
CW-variants adjust weights at different rates
14 / 43
15. Hadoop Summit 2014, San Jose
Why CW variants are so good?
weight
weight
Adjust a weight
Adjust a weight &
confidence
0.6 0.80.6
0.80.6
At this confidence,
the weight is 0.5
Confidence
(covariance)
0.5
15 / 43
16. Hadoop Summit 2014, San Jose
Why Hivemall
4. Supports the state-of-the-art online
learning algorithms (for classification)
• Fast convergence properties
• Perform small update where confidence
is enough
• Perform large update where confidence is
low (e.g., at the beginning)
• A few iterations are enough
16 / 43
17. Plan of the talk
• What is Hivemall
• Why Hivemall
• What Hivemall can do
• How to use Hivemall
• How Hivemall works
• How to deal with iterations w/ comparing to Spark
• Experimental Evaluation
• Conclusion
Hadoop Summit 2014, San Jose
17 / 43
18. Hadoop Summit 2014, San Jose
What Hivemall can do
• Classification (both one- and multi-class)
Perceptron
Passive Aggressive (PA)
Confidence Weighted (CW)
Adaptive Regularization of Weight Vectors (AROW)
Soft Confidence Weighted (SCW)
• Regression
Logistic Regression using Stochastic Gradient Descent (SGD)
PA Regression
AROW Regression
• k-Nearest Neighbor & Recommendation
Minhash and b-Bit Minhash (LSH variant)
Brute-force search using similarity measures (cosine similarity)
• Feature engineering
Feature hashing
Feature scaling (normalization, z-score)
18 / 43
19. Hadoop Summit 2014, San Jose
How to use Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature Vector
Feature Vector
Label
Data preparation
19 / 43
20. Hadoop Summit 2014, San Jose
Create external table e2006tfidf_train (
rowid int,
label float,
features ARRAY<STRING>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '¥t'
COLLECTION ITEMS TERMINATED BY ",“
STORED AS TEXTFILE LOCATION '/dataset/E2006-
tfidf/train';
How to use Hivemall - Data preparation
Define a Hive table for training/testing data
20 / 43
21. Hadoop Summit 2014, San Jose
How to use Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature Vector
Feature Vector
Label
Feature Engineering
21 / 43
22. Hadoop Summit 2014, San Jose
create view e2006tfidf_train_scaled
as
select
rowid,
rescale(target,${min_label},${max_label})
as label,
features
from
e2006tfidf_train;
Applying a Min-Max Feature Normalization
How to use Hivemall - Feature Engineering
Transforming a label value
to a value between 0.0 and 1.0
22 / 43
23. Hadoop Summit 2014, San Jose
How to use Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature Vector
Feature Vector
Label
Training
23 / 43
24. Hadoop Summit 2014, San Jose
How to use Hivemall - Training
CREATE TABLE lr_model AS
SELECT
feature,
avg(weight) as weight
FROM (
SELECT logress(features,label,..)
as (feature,weight)
FROM train
) t
GROUP BY feature
Training by logistic regression
map-only task to learn a prediction model
Shuffle map-outputs to reduces by feature
Reducers perform model averaging
in parallel
24 / 43
25. Hadoop Summit 2014, San Jose
How to use Hivemall - Training
CREATE TABLE news20b_cw_model1 AS
SELECT
feature,
voted_avg(weight) as weight
FROM
(SELECT
train_cw(features,label)
as (feature,weight)
FROM
news20b_train
) t
GROUP BY feature
Training of Confidence Weighted Classifier
Vote to use negative or positive
weights for avg
+0.7, +0.3, +0.2, -0.1, +0.7
Training for the CW classifier
25 / 43
26. Hadoop Summit 2014, San Jose
create table news20mc_ensemble_model1 as
select
label,
cast(feature as int) as feature,
cast(voted_avg(weight) as float) as weight
from
(select
train_multiclass_cw(addBias(features),label)
as (label,feature,weight)
from
news20mc_train_x3
union all
select
train_multiclass_arow(addBias(features),label)
as (label,feature,weight)
from
news20mc_train_x3
union all
select
train_multiclass_scw(addBias(features),label)
as (label,feature,weight)
from
news20mc_train_x3
) t
group by label, feature;
Ensemble learning for stable prediction performance
Just stack prediction models
by union all
26 / 43
27. Hadoop Summit 2014, San Jose
How to use Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature Vector
Feature Vector
Label
Prediction
27 / 43
28. Hadoop Summit 2014, San Jose
How to use Hivemall - Prediction
CREATE TABLE lr_predict
as
SELECT
t.rowid,
sigmoid(sum(m.weight)) as prob
FROM
testing_exploded t LEFT OUTER JOIN
lr_model m ON (t.feature = m.feature)
GROUP BY
t.rowid
Prediction is done by LEFT OUTER JOIN
between test data and prediction model
No need to load the entire model into memory
28 / 43
29. Plan of the talk
• What is Hivemall
• Why Hivemall
• What Hivemall can do
• How to use Hivemall
• How Hivemall works
• How to deal with iterations w/ comparing to Spark
• Experimental Evaluation
• Conclusion
Hadoop Summit 2014, San Jose
29 / 43
30. Implemented machine learning algorithms as User-
Defined Table generating Functions (UDTFs)
Hadoop Summit 2014, San Jose
How Hivemall works in the training
+1, <1,2>
..
+1, <1,7,9>
-1, <1,3, 9>
..
+1, <3,8>
tuple
<label, array<features>>
tuple<feature, weights>
Prediction model
UDTF
Relation
<feature, weights>
param-mix param-mix
Training
table
Shuffle
by feature
train train
Friendly to the Hive relational
query engine
• Resulting prediction model is
a relation of feature and its
weight
Embarrassingly parallel
• # of mapper and reducers are
configurable
Bagging-like effect which helps
to reduce the variance of each
classifier/partition
30 / 43
31. Hadoop Summit 2014, San Jose
train train
+1, <1,2>
..
+1, <1,7,9>
-1, <1,3, 9>
..
+1, <3,8>
merge
tuple
<label, array<features >
array<weight>
array<sum of weight>,
array<count>
Training
table
Prediction model
-1, <2,7, 9>
..
+1, <3,8>
final
merge
merge
-1, <2,7, 9>
..
+1, <3,8>
train train
array<weight>
Why not UDAF (as one in MADLib)
4 ops in parallel
2 ops in parallel
No parallelism
Machine learning as an aggregate function
Bottleneck in the final merge
Throughput limited by its fan out
Memory
consumption
grows
Parallelism
decreases
31 / 43
32. How to deal with Iterations
Iterations are mandatory to get a good prediction
model
• However, MapReduce is not suited for iterations because
IN/OUT of MR job is through HDFS
• Spark avoid it by in-memory computation
iter. 1 iter. 2 . . .
Input
HDFS
read
HDFS
write
HDFS
read
HDFS
write
iter. 1 iter. 2
Input
32 / 43
33. val data = spark.textFile(...).map(readPoint).cache()
for (i <- 1 to ITERATIONS) {
val gradient = data.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}
Repeated MapReduce steps
to do gradient descent
For each node, loads data
in memory once
This is just a toy example! Why?
Training with Iterations in Spark
Logistic Regression example of Spark
Input to the gradient computation should be shuffled
for each iteration (without it, more iteration is required)
33 / 43
34. Hadoop Summit 2014, San Jose
What MLlib actually do?
Val data = ..
for (i <- 1 to numIterations) {
val sampled =
val gradient =
w -= gradient
}
Mini-batch Gradient Descent with Sampling
Iterations are mandatory for convergence because
each iteration uses only small fraction of data
GradientDescent.scala
bit.ly/spark-gd
sample subset of data (partitioned RDD)
averaging the subgradients over the sampled data
using Spark MapReduce
34 / 43
35. How to deal with Iterations in Hivemall
Hivemall provides the amplify UDTF to enumerate
iteration effects in machine learning without several
MapReduce steps
SET hivevar:xtimes=3;
CREATE VIEW training_x3
as
SELECT
*
FROM (
SELECT
amplify(${xtimes}, *) as (rowid, label, features)
FROM
training
) t
CLUSTER BY RANDOM
35 / 43
36. Map-only shuffling and amplifying
rand_amplify UDTF randomly shuffles the
input rows for each Map task
CREATE VIEW training_x3
as
SELECT
rand_amplify(${xtimes}, ${shufflebuffersize}, *)
as (rowid, label, features)
FROM
training;
36 / 43
37. Detailed plan w/ map-local shuffle
…
Shuffle
(distributed by feature)
Reducetask
Merge
Aggregate
Reduce write
Maptask
Table scan
Rand Amplifier
Map write
Logress UDTF
Partial aggregate
Maptask
Table scan
Rand Amplifier
Map write
Logress UDTF
Partial aggregate
Reducetask
Merge
Aggregate
Reduce write
Scanned entries
are amplified and
then shuffled
Note this is pipeline op.
The Rand Amplifier operator is interleaved between
the table scan and the training operator
37 / 43
38. Hadoop Summit 2014, San Jose
Method
ELAPSED TIME
(sec)
AUC
Plain 89.718 0.734805
amplifier+clustered by
(a.k.a. global shuffle)
479.855 0.746214
rand_amplifier
(a.k.a. map-local shuffle)
116.424 0.743392
Performance effects of amplifiers
For map-local shuffle, prediction accuracy
got improved with an acceptable overhead
38 / 43
39. Plan of the talk
• What is Hivemall
• Why Hivemall
• What Hivemall can do
• How to use Hivemall
• How Hivemall works
• How to deal with iterations w/ comparing to Spark
• Experimental Evaluation
• Conclusion
Hadoop Summit 2014, San Jose
39 / 43
40. Experimental Evaluation
Compared the performance of our batch learning
scheme to state-of-the-art machine learning
techniques, namely Bismarck and Vowpal Wabbit
• Dataset
KDD Cup 2012, Track 2 dataset, which is one of the largest
publically available datasets for machine learning, provided
by a commercial search engine provider
• The training data is about 235 million records in 33 GB
• # of feature dimensions is about 54 million
• Task
Predicting Click-Through-Rates of search engine ads
• Experimental Environment
In-house 33 commodity servers (32 slaves nodes for Hadoop)
each equipped with 8 processors and 24 GB memory
40
bit.ly/hivemall-kdd-dataset
40 / 43
41. Hadoop Summit 2014, San Jose
116.4
596.67
493.81
755.24
0
100
200
300
400
500
600
700
800
Hivemall VW1 VW32 Bismarck
0.64
0.66
0.68
0.7
0.72
0.74
0.76
Hivemall VW1 VW32 Bismarck
Throughput: 2.3 million tuples/sec on 32 nodes
Latency: 96 sec for training 235 million records of 23 GB
Performance comparison
Prediction performance
(AUC) is good
Elapsed time (sec) for training
The lower, the better
41 / 43
42. Hadoop Summit 2014, San Jose
val training = MLUtils.loadLibSVMFile(sc,
"hdfs://host:8020/small/training_libsvmfmt", multiclass = false)
val model = LogisticRegressionWithSGD.train(training, numIterations)
..
How about Spark 1.0 MLlib
Works fine for small data (10k training examples in about 1.5 MB)
on 33 nodes with allocating 5 GB memory to each worker
LoC is small and easy to understand
However, Spark does not work for large dataset
(235 million training example of 2^24 feature dimensions in
about 33 GB)
Further investigation is required
42 / 43
43. Hadoop Summit 2014, San Jose
Conclusion
Hivemall is an open source library that provides a
collection of machine learning algorithms as Hive
UDFs/UDTFs
Easy to use
Scalable to computing resources
Runs on Amazon EMR
Support state of the art classification algorithms
Plan to support Shark/Spark SQL
Project Site:
github.com/myui/hivemall or bit.ly/hivemall
Message of this talk: Please evaluate Hivemall by yourself.
5 minutes is enough for a quick start
Slide available on
bit.ly/hivemall-slide
43 / 43