The document provides an overview of using Hivemall, an open source machine learning library built for Hive, for recommendation tasks. It begins with an introduction to Hivemall and its vision of enabling machine learning on SQL. It then covers recommendation 101, discussing explicit versus implicit feedback. Matrix factorization and Bayesian probabilistic ranking algorithms for recommendations from implicit feedback are described. Key aspects covered include data preparation in Hive, model training, and prediction. The document concludes with considerations for building recommendations on large, implicit feedback datasets in Hivemall.
6. CREATE EXTERNAL TABLE e2006tfidf_train (
rowid int,
label float,
features ARRAY<STRING>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '¥t'
COLLECTION ITEMS TERMINATED BY ",“
STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';
How to use Hivemall - Data preparation
Define a Hive table for training/testing data
6
10. How to use Hivemall - Training
CREATE TABLE lr_model AS
SELECT
feature,
avg(weight) as weight
FROM (
SELECT logress(features,label,..)
as (feature,weight)
FROM train
) t
GROUP BY feature
Training by logistic regression
map-only task to learn a prediction model
Shuffle map-outputs to reduces by feature
Reducers perform model averaging
in parallel
10
11. How to use Hivemall - Training
CREATE TABLE news20b_cw_model1 AS
SELECT
feature,
voted_avg(weight) as weight
FROM
(SELECT
train_cw(features,label)
as (feature,weight)
FROM
news20b_train
) t
GROUP BY feature
Training of Confidence Weighted Classifier
Vote to use negative or positive
weights for avg
+0.7, +0.3, +0.2, -0.1, +0.7
Training for the CW classifier
11
15. List of supported Algorithms
15
Classification
✓ Perceptron
✓ Passive Aggressive (PA, PA1,
PA2)
✓ Confidence Weighted (CW)
✓ Adaptive Regularization of
Weight Vectors (AROW)
✓ Soft Confidence Weighted
(SCW)
✓ AdaGrad+RDA
✓ Factorization Machines
✓ RandomForest Classification
Regression
✓Logistic Regression (SGD)
✓AdaGrad (logistic loss)
✓AdaDELTA (logistic loss)
✓PA Regression
✓AROW Regression
✓Factorization Machines
✓RandomForest Regression
SCW is a good first choice
Try RandomForest if SCW does
not work
Logistic regression is good for
getting a probability of a positive
class
Factorization Machines is good
where features are sparse and
categorical ones
21. U/I Item 1 Item 2 Item 3 … Item I
User 1 5 3
User 2 2 1
… 3 4
User U 1 4 5
21
Explicit Feedback
22. U/I Item 1 Item 2 Item 3 … Item I
User 1 ? 5 ? ? 3
User 2 2 ? 1 ? ?
… ? 3 ? 4 ?
User U 1 ? 4 ? 5
22
Explicit Feedback
23. 23
Explicit Feedback
U/I Item 1 Item 2 Item 3 … Item I
User 1 ? 5 ? ? 3
User 2 2 ? 1 ? ?
… ? 3 ? 4 ?
User U 1 ? 4 ? 5
• Very Sparse Dataset
• # of feedback is small
• Unknown data >> Training data
• User preference to rated items is clear
• Has negative feedbacks
• Evaluation is easy (MAE/RMSE)
24. U/I Item 1 Item 2 Item 3 … Item I
User 1 ⭕ ⭕
User 2 ⭕ ⭕
… ⭕ ⭕
User U ⭕ ⭕ ⭕
24
Implicit Feedback
25. U/I Item 1 Item 2 Item 3 … Item I
User 1 ⭕ ⭕
User 2 ⭕ ⭕
… ⭕ ⭕
User U ⭕ ⭕ ⭕
25
Implicit Feedback
• Sparse Dataset
• Number of Feedbacks are large
• User preference is unclear
• No negative feedback
• Known feedback maybe negative
• Unknown feedback maybe positive
• Evaluation is not so easy (NDCG, Prec@K, Recall@K)