Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network
Mais conteúdo relacionado
Semelhante a Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri
Semelhante a Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri (20)
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri
1. Framework for a suite of Co-clustering
algorithms for predictive modeling on Hadoop
Vaijanath N. Rao
(vaijanath.rao@teamaol.com)
Rohini Uppuluri
(rohini.uppuluri@teamaol.com)
2. Agenda
• Introduction
• Background
• Some Approaches
• Co-Clustering
• Introduction
• Related Work
• Why Hadoop?
• Goal
• Our Framework
• Conclusions and Future Work
Presentation for
[CLIENT]
3. Background
Modeling for Prediction
• Will user A like this movie?
• Will a user B like this camera
• Customer purchase decisions in an e-commerce setting
And tons of other things…
Presentation for
[CLIENT]
4. Some Approaches
• Collaborative filtering
• User Based, Item Based, Model Based, Content Based, Hybrid (See [1],
[2] ) etc
• Latent Models
• Probabilistic Latent Semantic Indexing [3,6]
• Matrix Factorization [4,7,8],
• Probabilistic Discrete Latent Factor[5]
• Co-clustering
• Clustering along multiple axes: [9,10] etc; survey in [16]
Presentation for
[CLIENT]
5. Co-clustering
Products
Product
Attributes
R
1 0 1 1 1
ed
Row Cluster Updation
Users
uc
0 ? 1 ? 0 Column Cluster Updation
Global Model Updation
in
1 1 ? 0 0
g
Er
? 0 0 ? 0
...
ro
Row Cluster Updation
r
User Column Cluster Updation
Attributes Global Model Updation
... Row Cluster Updation
Column Cluster Updation
Global Model Updation
Clustered Products
0 0 1 1 1
Clustered Users
1 ? 1 ? 0
1 1 ? 0 0
? 1 0 ? 0
Presentation for
[CLIENT]
6. Some Approaches
• Bregman co-clustering - Framework [11]
• Information theoretic co-clustering [12]
• Min sum squared co-clustering [13]
• Scalable Framework based on Bregman
framework[14]
• DisCo [15]
Presentation for
[CLIENT]
7. Why Hadoop
• Real world data – Huge
• Large matrix to operate on(millions and
millions of rows, millions of columns!)
• Lot of computations
Presentation for
[CLIENT]
8. Goal
• Number of approaches, need for a common
framework
• To build a framework to fit in the multiple algorithms
on hadoop
• Easy framework for users to choose and use
Presentation for
[CLIENT]
9. Overview
Row Cluster
Updator Job Row Clusters
Input
Column Cluster
Column Clusters
Updator Job
Global Model
Updator Job
Global Model
Presentation for
[CLIENT]
10. Overview : Core Interfaces
• Input vector (type, id, datavec, attributevec, cost, assignment)
• Cluster ( vector, len)
• Row Cluster
• Column Cluster
• Distance/Error Function (vector1, vector2)
• Model (matrix)
• Row Model
• Column Model
• Group Model
• Objective Function (Model1, Model2)
Presentation for
[CLIENT]
11. Currently we have
• Graph Based Bi-clustering
• Disco
Presentation for
[CLIENT]
12. Disco Algorithm
1. Initialization
1.1 row and column clusters
1.2 Compute global model
2. While objective function is met
2.1 For each row in the data, pick the row group
which minimizes error
2.2 Update row clusters
2.3 Update global model
2.4 For each column in the data, pick the column
group which minimizes error
2.5 Update column clusters
2.6 Update global model
3. Return row and column clusters
Presentation for
[CLIENT]
13. Pick the Best Row Group/Cluster
Presentation for
[CLIENT]
20. Row Updator Job
key value
value RowCluster Mapper
key KeyType
0 rowId clickVector attributeVector bestRowClusterId cost
rowId DATA
clickVector Pick the best row group
lineId
attributeVector cluster which minimizes cost
curRowClusterId or error Best Row Key type
clickVector
curRowClusterError Cluster Id ROWCLUSTER
RowCluster Reducer
keyType:
DATA:
Just Emit rowId clickVector attributeVector bestRowClusterId cost
ROW CLUSTER
Aggregate Row Cluster
Also write
Updated Row Clusters
Presentation for
[CLIENT]
22. Conclusions and Future Work
• Implementing more algorithms
• Easy to use examples and more documentation
Presentation for
[CLIENT]
23. References
[1] J. Herlocker, J. Konstan, A. Borchers, and J. Riedl. An algorithmic framework for
performing collaborative filtering. In SIGIR, pages 230–237, 1999
[2] J. Basilico and T. Hofmann. Unifying collaborative and content-based filtering. In ICML
’04, pages 65–72, 2004.
[3] T. Hoffman and J. Puzicha. Latent class models for collaborative filtering. In Proc. IJCAI
’99, 1999.
[4] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In NIPS ’07, 2007
[5] D. Agarwal and S. Merugu. Predictive discrete latent factor models for large scale
dyadic data. In Proc. KDD ’07, pages 26–35, 2007
[6] A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable
online collaborative filtering. In WWW ’07, pages 271–280, 2007
[7] Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative
filtering model. In KDD ’08, pages 426–434, 2008
[8] H. Ma, H. Yang, M. Lyu, and I. King. Sorec: social recommendation using probabilistic
matrix factorization. In CIKM ’08, pages 931–940, 2008
[9] Y. Cheng and G. M. Church. Biclustering of expression data. In Proc ICMB ’00, pages
93–103, 2000
[10] T. George and S. Merugu. A scalable collaborative filtering framework based on co-
clustering. In ICDM, pages 625 – 628, 2005
Presentation for
[CLIENT]
24. References (contd..)
[11] A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, and D. Modha. A generalized maximum
entropy approach to bregman co-clustering and matrix approximation. JMLR, 1919--
1986, 2007.
[12] I. Dhillon, S. Mallela, and D. Modha. Information-theoretic co-clustering. In Proc. KDD
’03, pages 89–98, 2003
[13] H. Cho, I. S. Dhillon, Y. Guan, and S. Sra. Minimum sum squared residue co-
clustering of gene expression data. In Proc. SDM ’04, 2004
[14] M. Deodhar, G. Gupta, J. Ghosh, H. Cho, and I. Dhillon. A scalable framework for
discovering coherent co-clusters in noisy data. In ICML ’08, 2008
[15] S. Papadimitriou and J. Sun. Disco: Distributed co-clustering with mapreduce: A case
study towards petabyte-scale end-to-end mining. In ICDM ’08, pages 512–521, 2008
[16] S. C. Madeira and A. L. Oliveira. Biclustering algorithms for biological data analysis: A
survey. IEEE Trans. on Computational Biology and Bioinformatics, 1(1):24–45, 2004
Presentation for
[CLIENT]
26. Row Cluster Updator Job
key value
Value type
rowId clickVector attributeVector bestRowClusterId cost
RowCluster Mapper DATA
key value
clickVector
attributeVector Pick the best row group Value type
rowId rowId Updated rowCluster
curRowClusterId cluster which minimizes cost ROWCLUSTER
curRowClusterError or error
Value type
rowId Updated Partial GlobalModel ROW GLOB MODEL
RowCluster Reducer
ValueType:
DATA: rowId clickVector attributeVector bestRowClusterId cost
Just Emit
ROW CLUSTER
Aggregate Row Cluster Also write
ROW GLOB CLUSTER
Aggregate Partial Global Model Updated Row Clusters
for given row cluster
Updated Partial Global
Model
Presentation for
[CLIENT]
27. Column Cluster Updator Job
key value
Value type
colId clickVector attributeVector bestColClusterId cost
ColCluster Mapper DATA
key value
clickVector
attributeVector Pick the best col group Value type
colId colId Updated colCluster
curColClusterId cluster which minimizes cost COLCLUSTER
curColClusterError or error
Value type
colId Updated Partial GlobalModel COL GLOB MODEL
ColCluster Reducer
colId clickVector attributeVector bestColClusterId cost
ValueType:
DATA:
Just Emit
COL CLUSTER Also write
Aggregate Col Cluster
COL GLOB CLUSTER Updated Col Clusters
Aggregate Partial Global Model
for given col cluster
Updated Partial Global
Model
Presentation for
[CLIENT]