Touch the mahout

Touch The Mahout !

makoto uehara

自己紹介
【名前】
・上原誠 (@pioho07)
Facebook申請歓迎

【経歴】
・
・

～2012年2月某SIerでインフラ周りに従事
2012年3月サイバーエージェント入社
- Amebaスマフォプラットフォームの構築
- 統合ログ解析基盤やオンラインデータベースの
インフラミドルウェア部分を担当

- Hadoop､HBase、Flume

Whats Mahout ?

スケーラブルな
機械学習・データマイニングライブラリ

CyberZ’s Hadoop Cluster
Slave
Server*5

HDD:100TB
CPU:120Core
memory:480GB

Master
Server*3

Hadoop Cluster

管理ツール
使ってます

CDH4.4.0
Hive
MR

Hue
HDFS

ClouderaManager
v4.7.3

Mahout’s Algorithm

Help見ると
アルゴリズム
いっぱい

[hdfs@svr001 ~]$ mahout
An example program must be given as the first argument.
Valid program names are:
arff.vector: : Generate Vectors from an ARFF file or directory
baumwelch: : Baum-Welch algorithm for unsupervised HMM training
canopy: : Canopy clustering
cat: : Print a file or resource as the logistic regression models would see it
cleansvd: : Cleanup and verification of SVD output
clusterdump: : Dump cluster output to text
clusterpp: : Groups Clustering Output In Clusters
cmdump: : Dump confusion matrix in HTML or text formats
cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally.
dirichlet: : Dirichlet Clustering
recommendfactorized: : Compute recommendations using the factorization of a rating mat
recommenditembased: : Compute recommendations using item-based collaborative filterin
fkmeans: : Fuzzy K-means clustering
fpg: : Frequent Pattern Growth
：
:
:

Reccomendation

[hdfs@svr001 ~]$ mahout
An example program must be given as the first argument.
Valid program names are:
arff.vector: : Generate Vectors from an ARFF file or directory
baumwelch: : Baum-Welch algorithm for unsupervised HMM training
レコメンド
canopy: : Canopy clustering
やってみたいよね
cat: : Print a file or resource as the logistic regression models would see it
cleansvd: : Cleanup and verification of SVD output
clusterdump: : Dump cluster output to text
clusterpp: : Groups Clustering Output In Clusters
cmdump: : Dump confusion matrix in HTML or text formats
cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally.
dirichlet: : Dirichlet Clustering
recommendfactorized: : Compute recommendations using the factorization of a rating mat
recommenditembased: : Compute recommendations using item-based collaborative filterin
fkmeans: : Fuzzy K-means clustering
fpg: : Frequent Pattern Growth
：
：
：

Reccomendation

「ユーザAはこの商品を4点と評価しています
ユーザBはこの商品を3.5点と評価しています」
といった情報を元にして、
「ユーザAにはこの商品がお勧めです」
という結果を出すヤツ

まず、誰がどの商品をオススメしているかを表すデータを作成します
MahoutにはユーザとかアイテムをIDにして渡さないといけないので、
入力ファイルは数字祭なファイルになります

こんな感じで
入力ファイル作成
●Input File１
UserID：1-5
ItemID：101-107
Score：0.0-5.0

●Input File１
1,101,5.0
1,102,3.0
1,103,2.5
2,101,2.0
2,102,2.5
2,103,5.0
2,104,2.0
3,101,2.5
3,102,4.0
3,105,4.5
4,101,5.0
4,103,3.0
4,104,4.5
4,106,4.0
5,101,4.0
5,102,1.0
5,103,2.0
5,104,4.0
5,105,3.5
5,106,4.0

User2

User3

Score=4.5
Score=4.5

Mahout Command Line
mahout¥

Mahout

recommenditembased¥

Algorithm

--input /mahout/recommend_sample1.csv¥

Input File

--output /mahout/recome1 ¥

Output Dir

–similarityClassname¥
SIMILARITY_PEARSON_CORRELATION

SIMILARITY

Command Run
[hdfs@svr001 ~]$ mahout recommenditembased --input
/mahout/recommend_sample1.csv --output /mahout/recome1 --similarityClassname
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /opt/cloudera/parcels/CDH-4.4.01.cdh4.4.0.p0.39/lib/hadoop/bin/hadoop and
HADOOP_CONF_DIR=/etc/hadoop/conf
MAHOUT-JOB: /opt/cloudera/parcels/CDH-4.4.01.cdh4.4.0.p0.39/lib/mahout/mahout-examples-0.7-cdh4.4.0-job.jar
13/12/12 11:26:23 INFO common.AbstractJob: Command line arguments: {-booleanData=[false], --endPhase=[2147483647], -input=[/mahout/recommend_sample4.csv], --maxPrefsPerUser=[1000], -minPrefsPerUser=[1], --output=[temp/preparePreferenceMatrix], --ratingShift=[0.0],
--startPhase=[0], --tempDir=[temp]}
13/12/12 11:26:24 WARN mapred.JobClient: Use GenericOptionsParser for parsing
the arguments. Applications should implement Tool for the same.
13/12/12 11:26:25 INFO input.FileInputFormat: Total input paths to process : 1
13/12/12 11:26:25 INFO mapred.JobClient: Running job: job_201312042139_0225
13/12/12 11:26:26 INFO mapred.JobClient: map 0% reduce 0%

Command Run

●処理時間はけっこうかかった。
２１行の入力ファイルで
２～3分かかった。。
●mahout version 0.7

●はまったとこ
入力ファイルに改行が入ってるとエラーが出てハマった

SIMILARITY(相関)

SIMILARITY in Recommenditembased

SIMILARITY_COOCCURRENCE
SIMILARITY_LOGLIKELIHOOD
SIMILARITY_TANIMOTO_COEFFICIENT
SIMILARITY_CITY_BLOCK
SIMILARITY_COSINE
SIMILARITY_EUCLIDEAN_DISTANCE

よくわかりません
><

SIMILARITY(相関)


今回この2つで
試してみよう

Run Result
●Similarity
PEARSON_CORRELATION(同時に起こる2つのことの一時的な特性)

●Run
mahout recommenditembased --input /mahout/recommend_sample1.csv --output
/mahout/recome1 --similarityClassname SIMILARITY_PEARSON_CORRELATION

●Result
[hdfs@svr001 ~]$ hadoop fs -cat /mahout/recome/part-r-0000*
3
[103:4.279442]
1
[105:2.868604]
2
[105:3.1569808]

Run Result
●Similarity
TANIMOTO_COEFFICIENT(谷本係数)

●Run
/mahout/recome2 --similarityClassname SIMILARITY_TANIMOTO_COEFFICIENT

●Result
[hdfs@svr001 ~]$ hadoop fs -cat /mahout/recome2/part-r-0000*
3
[106:3.5357144,104:3.3799999,103:3.3125]
1
[105:3.6363637,106:3.5,104:3.4714286]
4
[105:4.2746477,102:4.2]
2
[106:2.9056604,105:2.6296296]

SIMILARITY(相関)
PEARSON_CORRELATION だと、
User3にはItem 103をScore=4.2794でレコメンド

TANIMOTO_COEFFICIENTだと、
User3にはItem 106をScore=3.5357でレコメンド

アルゴリズムの違いで
お勧め商品もスコアも違うね

SIMILARITY(相関)

入力ファイルいじって
無理やりにでもUser2に商品102を
お勧めしてみてやる！

【Step1】
入力データを、
user1,2,3の購入傾向が似ている状況にする

【Step2】
user1,2,3は
商品101,103を購入し評価5.0としている
user1,2は商品102も購入し高評価している

なので
↓↓↓
【Result！】

user3に商品102をrecommend
するっしょ！？

1,101,5.0
1,102,5.0
1,103,5.0
1,108,1.5
2,101,5.0
2,102,4.0
2,103,5.0
2,104,2.0
2,108,5.0
3,101,5.0
3,103,5.0
3,105,4.5
3,107,1.0
3,108,4.5
4,101,1.5
4,103,1.0
4,104,2.5
4,106,6.0
5,101,1.0
5,102,1.5
5,103,2.0
5,104,3.0
5,105,3.5
5,106,5.0

レコメンドさせられた？結果は・・？
/mahout/recome1 --similarityClassname SIMILARITY_PEARSON_CORRELATION
3
[102:4.8554997,106:4.5731964]
1
[106:5.0,105:4.615747]
4
[105:2.6432545,102:2.6432545,108:1.2666667]
2
[105:4.7233295,106:4.133889]
5
[108:3.5]

ｷﾀ━(ﾟ∀ﾟ)━!

レコメンドさせるられたか？結果

ただ、
PEARSON_CORRELATIONだとうまくいったが、
TANIMOTO_COEFFICIENTだとうまくいかなかった
アルゴリズムの違いなんだろうね・・

レコメンドさせるられたか？結果

mahout recommenditembased --input /mahout/recommend_sample2.csv -output /mahout/recome22 --similarityClassname
3
[104:4.8636365,106:4.852941,102:4.8076925]
1
[104:4.0,105:3.9807692,107:3.9545455,106:3.857143]
4
[107:4.25,108:4.0,102:3.6410255,105:3.6325302]
2
[107:5.0,105:4.354839,106:3.6893203]
5
[107:2.6111112,108:1.872093]
２番目になってる

ご清聴ありがとうございました！

Touch the mahout

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (11)

Semelhante a Touch the mahout

Semelhante a Touch the mahout (20)

Mais de Makoto Uehara

Mais de Makoto Uehara (9)

Último

Último (9)

Touch the mahout