Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri

Framework for a suite of Co-clustering
algorithms for predictive modeling on Hadoop

Vaijanath N. Rao
(vaijanath.rao@teamaol.com)
Rohini Uppuluri
(rohini.uppuluri@teamaol.com)

Agenda
• Introduction
• Background
• Some Approaches

• Co-Clustering
• Introduction
• Related Work
• Why Hadoop?

• Goal
• Our Framework
• Conclusions and Future Work

Presentation for
[CLIENT]

Background
Modeling for Prediction
• Will user A like this movie?

• Will a user B like this camera

• Customer purchase decisions in an e-commerce setting

And tons of other things…

Presentation for
[CLIENT]

Some Approaches
• Collaborative filtering
• User Based, Item Based, Model Based, Content Based, Hybrid (See [1],
[2] ) etc
• Latent Models
• Probabilistic Latent Semantic Indexing [3,6]
• Matrix Factorization [4,7,8],
• Probabilistic Discrete Latent Factor[5]

• Co-clustering
• Clustering along multiple axes: [9,10] etc; survey in [16]

Presentation for
[CLIENT]

Co-clustering
Products
Product
Attributes

R
1 0 1 1 1

ed
Row Cluster Updation
Users

uc
0 ? 1 ? 0 Column Cluster Updation
Global Model Updation

in
1 1 ? 0 0

g
Er
? 0 0 ? 0
...

ro
Row Cluster Updation

r
User Column Cluster Updation
Attributes Global Model Updation

... Row Cluster Updation
Column Cluster Updation
Global Model Updation

Clustered Products
0 0 1 1 1

Clustered Users
1 ? 1 ? 0

1 1 ? 0 0

? 1 0 ? 0

Presentation for
[CLIENT]

Some Approaches
• Bregman co-clustering - Framework [11]
• Information theoretic co-clustering [12]
• Min sum squared co-clustering [13]

• Scalable Framework based on Bregman
framework[14]
• DisCo [15]

Presentation for
[CLIENT]

Why Hadoop
• Real world data – Huge
• Large matrix to operate on(millions and
millions of rows, millions of columns!)
• Lot of computations

Presentation for
[CLIENT]

Goal
• Number of approaches, need for a common
framework
• To build a framework to fit in the multiple algorithms
on hadoop
• Easy framework for users to choose and use

Presentation for
[CLIENT]

Overview

Row Cluster
Updator Job Row Clusters

Input

Column Cluster
Column Clusters
Updator Job

Global Model
Updator Job

Global Model

Presentation for
[CLIENT]

Overview : Core Interfaces
• Input vector (type, id, datavec, attributevec, cost, assignment)
• Cluster ( vector, len)
• Row Cluster
• Column Cluster
• Distance/Error Function (vector1, vector2)
• Model (matrix)
• Row Model
• Column Model
• Group Model
• Objective Function (Model1, Model2)

Presentation for
[CLIENT]

Currently we have
• Graph Based Bi-clustering
• Disco

Presentation for
[CLIENT]

Disco Algorithm
1. Initialization
1.1 row and column clusters
1.2 Compute global model
2. While objective function is met
2.1 For each row in the data, pick the row group
which minimizes error
2.2 Update row clusters
2.3 Update global model
2.4 For each column in the data, pick the column
group which minimizes error
2.5 Update column clusters
2.6 Update global model
3. Return row and column clusters

Presentation for
[CLIENT]

Pick the Best Row Group/Cluster

Presentation for
[CLIENT]

Example

Presentation for
[CLIENT]

RowCluster Updator Job

Presentation for
[CLIENT]

BiClustering

Presentation for
[CLIENT]

Row Updator Job
key value

value RowCluster Mapper
key KeyType
0 rowId clickVector attributeVector bestRowClusterId cost
rowId DATA
clickVector Pick the best row group
lineId
attributeVector cluster which minimizes cost
curRowClusterId or error Best Row Key type
clickVector
curRowClusterError Cluster Id ROWCLUSTER

RowCluster Reducer
keyType:

DATA:
Just Emit rowId clickVector attributeVector bestRowClusterId cost
ROW CLUSTER
Aggregate Row Cluster
Also write

Updated Row Clusters

Presentation for
[CLIENT]

Conclusions and Future Work
• Implementing more algorithms
• Easy to use examples and more documentation

Presentation for
[CLIENT]

References
[1] J. Herlocker, J. Konstan, A. Borchers, and J. Riedl. An algorithmic framework for
performing collaborative filtering. In SIGIR, pages 230–237, 1999
[2] J. Basilico and T. Hofmann. Unifying collaborative and content-based filtering. In ICML
’04, pages 65–72, 2004.
[3] T. Hoffman and J. Puzicha. Latent class models for collaborative filtering. In Proc. IJCAI
’99, 1999.
[4] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In NIPS ’07, 2007
[5] D. Agarwal and S. Merugu. Predictive discrete latent factor models for large scale
dyadic data. In Proc. KDD ’07, pages 26–35, 2007
[6] A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable
online collaborative filtering. In WWW ’07, pages 271–280, 2007
[7] Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative
filtering model. In KDD ’08, pages 426–434, 2008
[8] H. Ma, H. Yang, M. Lyu, and I. King. Sorec: social recommendation using probabilistic
matrix factorization. In CIKM ’08, pages 931–940, 2008
[9] Y. Cheng and G. M. Church. Biclustering of expression data. In Proc ICMB ’00, pages
93–103, 2000
[10] T. George and S. Merugu. A scalable collaborative filtering framework based on co-
clustering. In ICDM, pages 625 – 628, 2005

Presentation for
[CLIENT]

References (contd..)
[11] A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, and D. Modha. A generalized maximum
entropy approach to bregman co-clustering and matrix approximation. JMLR, 1919--
1986, 2007.
[12] I. Dhillon, S. Mallela, and D. Modha. Information-theoretic co-clustering. In Proc. KDD
’03, pages 89–98, 2003
[13] H. Cho, I. S. Dhillon, Y. Guan, and S. Sra. Minimum sum squared residue co-
clustering of gene expression data. In Proc. SDM ’04, 2004
[14] M. Deodhar, G. Gupta, J. Ghosh, H. Cho, and I. Dhillon. A scalable framework for
discovering coherent co-clusters in noisy data. In ICML ’08, 2008
[15] S. Papadimitriou and J. Sun. Disco: Distributed co-clustering with mapreduce: A case
study towards petabyte-scale end-to-end mining. In ICDM ’08, pages 512–521, 2008
[16] S. C. Madeira and A. L. Oliveira. Biclustering algorithms for biological data analysis: A
survey. IEEE Trans. on Computational Biology and Bioinformatics, 1(1):24–45, 2004

Presentation for
[CLIENT]

Thank you

Presentation for
[CLIENT]

Row Cluster Updator Job
key value

Value type
rowId clickVector attributeVector bestRowClusterId cost
RowCluster Mapper DATA
key value
clickVector
attributeVector Pick the best row group Value type
rowId rowId Updated rowCluster
curRowClusterId cluster which minimizes cost ROWCLUSTER
curRowClusterError or error
Value type
rowId Updated Partial GlobalModel ROW GLOB MODEL

RowCluster Reducer
ValueType:
DATA: rowId clickVector attributeVector bestRowClusterId cost
Just Emit
ROW CLUSTER
Aggregate Row Cluster Also write
ROW GLOB CLUSTER
Aggregate Partial Global Model Updated Row Clusters
for given row cluster

Updated Partial Global
Model

Presentation for
[CLIENT]

Column Cluster Updator Job
key value

Value type
colId clickVector attributeVector bestColClusterId cost
ColCluster Mapper DATA
key value
clickVector
attributeVector Pick the best col group Value type
colId colId Updated colCluster
curColClusterId cluster which minimizes cost COLCLUSTER
curColClusterError or error
Value type
colId Updated Partial GlobalModel COL GLOB MODEL

ColCluster Reducer
colId clickVector attributeVector bestColClusterId cost
ValueType:
DATA:
Just Emit
COL CLUSTER Also write
Aggregate Col Cluster
COL GLOB CLUSTER Updated Col Clusters
Aggregate Partial Global Model
for given col cluster
Updated Partial Global
Model

Presentation for
[CLIENT]

Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri

Semelhante a Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri (20)

Mais de Yahoo Developer Network

Mais de Yahoo Developer Network (20)

Último

Último (20)

Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri