SlideShare uma empresa Scribd logo
1 de 27
Baixar para ler offline
Framework for a suite of Co-clustering
algorithms for predictive modeling on Hadoop




 Vaijanath N. Rao
 (vaijanath.rao@teamaol.com)
 Rohini Uppuluri
 (rohini.uppuluri@teamaol.com)
Agenda
• Introduction
   • Background
   • Some Approaches

• Co-Clustering
   • Introduction
   • Related Work
   • Why Hadoop?

• Goal
• Our Framework
• Conclusions and Future Work

                                Presentation for
                                       [CLIENT]
Background
Modeling for Prediction
   • Will user A like this movie?

   • Will a user B like this camera

   • Customer purchase decisions in an e-commerce setting

   And tons of other things…




                                                  Presentation for
                                                         [CLIENT]
Some Approaches
• Collaborative filtering
   • User Based, Item Based, Model Based, Content Based, Hybrid (See [1],
       [2] ) etc
• Latent Models
   • Probabilistic Latent Semantic Indexing [3,6]
   •   Matrix Factorization [4,7,8],
   • Probabilistic Discrete Latent Factor[5]

• Co-clustering
   • Clustering along multiple axes: [9,10] etc; survey in [16]




                                                                  Presentation for
                                                                         [CLIENT]
Co-clustering
                     Products
                                         Product
                                        Attributes




                                                                              R
                 1   0    1     1   1




                                                                               ed
                                              Row Cluster Updation
Users




                                                                                 uc
                 0   ?    1     ?   0        Column Cluster Updation
                                              Global Model Updation




                                                                                       in
                 1   1    ?     0   0




                                                                                         g
                                                                                                             Er
                 ?   0    0     ?   0
                                                     ...




                                                                                                                ro
                                                                  Row Cluster Updation




                                                                                                                  r
          User                                                   Column Cluster Updation
        Attributes                                                Global Model Updation




                                                                       ...           Row Cluster Updation
                                                                                    Column Cluster Updation
                                                                                     Global Model Updation


                                                                                                               Clustered Products
                                                                                                               0      0   1         1        1




                                                                                           Clustered Users
                                                                                                               1      ?   1         ?        0

                                                                                                               1      1   ?         0        0

                                                                                                               ?      1   0         ?        0


                                                                                                                              Presentation for
                                                                                                                                     [CLIENT]
Some Approaches
• Bregman co-clustering - Framework [11]
   • Information theoretic co-clustering [12]
   • Min sum squared co-clustering [13]

• Scalable Framework based on Bregman
  framework[14]
• DisCo [15]




                                                Presentation for
                                                       [CLIENT]
Why Hadoop
• Real world data – Huge
• Large matrix to operate on(millions and
  millions of rows, millions of columns!)
• Lot of computations




                                       Presentation for
                                              [CLIENT]
Goal
• Number of approaches, need for a common
  framework
• To build a framework to fit in the multiple algorithms
  on hadoop
• Easy framework for users to choose and use




                                               Presentation for
                                                      [CLIENT]
Overview

                Row Cluster
                Updator Job     Row Clusters

       Input


               Column Cluster
                                Column Clusters
                Updator Job




                Global Model
                Updator Job




                Global Model



                                               Presentation for
                                                      [CLIENT]
Overview : Core Interfaces
• Input vector (type, id, datavec, attributevec, cost, assignment)
• Cluster ( vector, len)
   • Row Cluster
   • Column Cluster
• Distance/Error Function (vector1, vector2)
• Model (matrix)
   • Row Model
   • Column Model
   • Group Model
• Objective Function (Model1, Model2)




                                                       Presentation for
                                                              [CLIENT]
Currently we have
• Graph Based Bi-clustering
• Disco




                              Presentation for
                                     [CLIENT]
Disco Algorithm
 1. Initialization
           1.1 row and column clusters
           1.2 Compute global model
 2. While objective function is met
           2.1 For each row in the data, pick the row group
                which minimizes error
           2.2 Update row clusters
           2.3 Update global model
           2.4 For each column in the data, pick the column
                group which minimizes error
           2.5 Update column clusters
           2.6 Update global model
 3. Return row and column clusters

                                                   Presentation for
                                                          [CLIENT]
Pick the Best Row Group/Cluster




                                  Presentation for
                                         [CLIENT]
Example




          Presentation for
                 [CLIENT]
RowCluster Updator Job




                         Presentation for
                                [CLIENT]
Example




          Presentation for
                 [CLIENT]
BiClustering




               Presentation for
                      [CLIENT]
Pick the Best Row Group/Cluster




                                  Presentation for
                                         [CLIENT]
Example




          Presentation for
                 [CLIENT]
Row Updator Job
                                                                  key                             value

                 value              RowCluster Mapper
key                                                                                                                                 KeyType
          0                                                      rowId       clickVector attributeVector bestRowClusterId cost
          rowId                                                                                                                      DATA
          clickVector               Pick the best row group
 lineId
          attributeVector        cluster which minimizes cost
          curRowClusterId        or error                       Best Row                                            Key type
                                                                                          clickVector
          curRowClusterError                                    Cluster Id                                         ROWCLUSTER




                                   RowCluster Reducer
                               keyType:

                               DATA:
                                 Just Emit                        rowId         clickVector attributeVector bestRowClusterId cost
                               ROW CLUSTER
                                Aggregate Row Cluster
                                                                  Also write


                                                                Updated Row Clusters




                                                                                                           Presentation for
                                                                                                                  [CLIENT]
Example




          Presentation for
                 [CLIENT]
Conclusions and Future Work
• Implementing more algorithms
• Easy to use examples and more documentation




                                         Presentation for
                                                [CLIENT]
References
[1] J. Herlocker, J. Konstan, A. Borchers, and J. Riedl. An algorithmic framework for
     performing collaborative filtering. In SIGIR, pages 230–237, 1999
[2] J. Basilico and T. Hofmann. Unifying collaborative and content-based filtering. In ICML
     ’04, pages 65–72, 2004.
[3] T. Hoffman and J. Puzicha. Latent class models for collaborative filtering. In Proc. IJCAI
     ’99, 1999.
[4] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In NIPS ’07, 2007
[5] D. Agarwal and S. Merugu. Predictive discrete latent factor models for large scale
     dyadic data. In Proc. KDD ’07, pages 26–35, 2007
[6] A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable
     online collaborative filtering. In WWW ’07, pages 271–280, 2007
[7] Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative
     filtering model. In KDD ’08, pages 426–434, 2008
[8] H. Ma, H. Yang, M. Lyu, and I. King. Sorec: social recommendation using probabilistic
     matrix factorization. In CIKM ’08, pages 931–940, 2008
[9] Y. Cheng and G. M. Church. Biclustering of expression data. In Proc ICMB ’00, pages
     93–103, 2000
[10] T. George and S. Merugu. A scalable collaborative filtering framework based on co-
     clustering. In ICDM, pages 625 – 628, 2005




                                                                             Presentation for
                                                                                    [CLIENT]
References (contd..)
[11] A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, and D. Modha. A generalized maximum
    entropy approach to bregman co-clustering and matrix approximation. JMLR, 1919--
    1986, 2007.
[12] I. Dhillon, S. Mallela, and D. Modha. Information-theoretic co-clustering. In Proc. KDD
    ’03, pages 89–98, 2003
[13] H. Cho, I. S. Dhillon, Y. Guan, and S. Sra. Minimum sum squared residue co-
    clustering of gene expression data. In Proc. SDM ’04, 2004
[14] M. Deodhar, G. Gupta, J. Ghosh, H. Cho, and I. Dhillon. A scalable framework for
    discovering coherent co-clusters in noisy data. In ICML ’08, 2008
[15] S. Papadimitriou and J. Sun. Disco: Distributed co-clustering with mapreduce: A case
    study towards petabyte-scale end-to-end mining. In ICDM ’08, pages 512–521, 2008
[16] S. C. Madeira and A. L. Oliveira. Biclustering algorithms for biological data analysis: A
    survey. IEEE Trans. on Computational Biology and Bioinformatics, 1(1):24–45, 2004




                                                                             Presentation for
                                                                                    [CLIENT]
Thank you




            Presentation for
                   [CLIENT]
Row Cluster Updator Job
                                                                   key                        value

                                                                                                                                     Value type
                                                                   rowId        clickVector attributeVector bestRowClusterId cost
                                   RowCluster Mapper                                                                                   DATA
key             value
         clickVector
         attributeVector            Pick the best row group                                                          Value type
 rowId                                                             rowId              Updated rowCluster
         curRowClusterId         cluster which minimizes cost                                                       ROWCLUSTER
         curRowClusterError      or error
                                                                                                                      Value type
                                                                   rowId         Updated Partial GlobalModel        ROW GLOB MODEL




                                  RowCluster Reducer
                              ValueType:
                              DATA:                                rowId         clickVector attributeVector bestRowClusterId cost
                                  Just Emit
                              ROW CLUSTER
                                Aggregate Row Cluster              Also write
                              ROW GLOB CLUSTER
                                  Aggregate Partial Global Model    Updated Row Clusters
                              for given row cluster


                                                                    Updated Partial Global
                                                                           Model




                                                                                                               Presentation for
                                                                                                                      [CLIENT]
Column Cluster Updator Job
                                                                   key                         value

                                                                                                                                       Value type
                                                                   colId       clickVector attributeVector bestColClusterId cost
                                    ColCluster Mapper                                                                                    DATA
key             value
         clickVector
         attributeVector            Pick the best col group                                                             Value type
 colId                                                             colId                Updated colCluster
         curColClusterId         cluster which minimizes cost                                                          COLCLUSTER
         curColClusterError      or error
                                                                                                                         Value type
                                                                   colId           Updated Partial GlobalModel         COL GLOB MODEL




                                   ColCluster Reducer
                                                                   colId           clickVector attributeVector bestColClusterId cost
                              ValueType:
                              DATA:
                                  Just Emit
                              COL CLUSTER                             Also write
                                Aggregate Col Cluster
                              COL GLOB CLUSTER                     Updated Col Clusters
                                  Aggregate Partial Global Model
                              for given col cluster
                                                                   Updated Partial Global
                                                                          Model




                                                                                                                 Presentation for
                                                                                                                        [CLIENT]

Mais conteúdo relacionado

Semelhante a Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri

Semantic SOA Governance
Semantic SOA GovernanceSemantic SOA Governance
Semantic SOA Governancearivolit
 
How Do Developers React to API Deprecation? The Case of a Smalltalk Ecosystem
How Do Developers React to API Deprecation? The Case of a Smalltalk EcosystemHow Do Developers React to API Deprecation? The Case of a Smalltalk Ecosystem
How Do Developers React to API Deprecation? The Case of a Smalltalk Ecosystemmircea.lungu
 
Introduction to Performance Testing Part 1
Introduction to Performance Testing Part 1Introduction to Performance Testing Part 1
Introduction to Performance Testing Part 1C.T.Co
 
Using OSGi R4 Service Platform in Vehicle Embedded Systems - Miguel Lopez, So...
Using OSGi R4 Service Platform in Vehicle Embedded Systems - Miguel Lopez, So...Using OSGi R4 Service Platform in Vehicle Embedded Systems - Miguel Lopez, So...
Using OSGi R4 Service Platform in Vehicle Embedded Systems - Miguel Lopez, So...mfrancis
 
OpenTravel XML Object Suite - Component Model
OpenTravel XML Object Suite - Component ModelOpenTravel XML Object Suite - Component Model
OpenTravel XML Object Suite - Component ModelOpenTravel Alliance
 
Re the status_quo_and_what_lies_ahead
Re the status_quo_and_what_lies_aheadRe the status_quo_and_what_lies_ahead
Re the status_quo_and_what_lies_aheadEdward John Crain
 
Using Evolution Patterns to Evolve Software Architectures
Using Evolution Patterns to Evolve Software ArchitecturesUsing Evolution Patterns to Evolve Software Architectures
Using Evolution Patterns to Evolve Software ArchitecturesTom Mens
 
Using Composite Feature Models to Support Agile Software Product Line Evoluti...
Using Composite Feature Models to Support Agile Software Product Line Evoluti...Using Composite Feature Models to Support Agile Software Product Line Evoluti...
Using Composite Feature Models to Support Agile Software Product Line Evoluti...Simon Urli
 
Framework Engineering_Final
Framework Engineering_FinalFramework Engineering_Final
Framework Engineering_FinalYoungSu Son
 
EclipseCon Europe 2015 - liferay modularity patterns using OSGi -Rafik Harabi
EclipseCon Europe 2015 - liferay modularity patterns using OSGi -Rafik HarabiEclipseCon Europe 2015 - liferay modularity patterns using OSGi -Rafik Harabi
EclipseCon Europe 2015 - liferay modularity patterns using OSGi -Rafik HarabiRafik HARABI
 
Ontology-Based Systems Federation
Ontology-Based Systems FederationOntology-Based Systems Federation
Ontology-Based Systems FederationAnatoly Levenchuk
 
The SENSORIA Development Environment
The SENSORIA Development EnvironmentThe SENSORIA Development Environment
The SENSORIA Development EnvironmentIstvan Rath
 
Telecom Transformation Using SOA_2
Telecom Transformation Using SOA_2Telecom Transformation Using SOA_2
Telecom Transformation Using SOA_2didemtopuz
 
Telecom Transformation Using SOA
Telecom Transformation Using SOATelecom Transformation Using SOA
Telecom Transformation Using SOAdidemtopuz
 
AddQ Testautomatiseringserfarenheter
AddQ TestautomatiseringserfarenheterAddQ Testautomatiseringserfarenheter
AddQ TestautomatiseringserfarenheterAddQ Consulting
 
SCRUM + CMMI = SCRUMMI?
SCRUM + CMMI = SCRUMMI?SCRUM + CMMI = SCRUMMI?
SCRUM + CMMI = SCRUMMI?mharbolt
 
Software Frameworks for Music Information Retrieval
Software Frameworks for Music Information RetrievalSoftware Frameworks for Music Information Retrieval
Software Frameworks for Music Information RetrievalXavier Amatriain
 

Semelhante a Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri (20)

Semantic SOA Governance
Semantic SOA GovernanceSemantic SOA Governance
Semantic SOA Governance
 
How Do Developers React to API Deprecation? The Case of a Smalltalk Ecosystem
How Do Developers React to API Deprecation? The Case of a Smalltalk EcosystemHow Do Developers React to API Deprecation? The Case of a Smalltalk Ecosystem
How Do Developers React to API Deprecation? The Case of a Smalltalk Ecosystem
 
Introduction to Performance Testing Part 1
Introduction to Performance Testing Part 1Introduction to Performance Testing Part 1
Introduction to Performance Testing Part 1
 
Using OSGi R4 Service Platform in Vehicle Embedded Systems - Miguel Lopez, So...
Using OSGi R4 Service Platform in Vehicle Embedded Systems - Miguel Lopez, So...Using OSGi R4 Service Platform in Vehicle Embedded Systems - Miguel Lopez, So...
Using OSGi R4 Service Platform in Vehicle Embedded Systems - Miguel Lopez, So...
 
OpenTravel XML Object Suite - Component Model
OpenTravel XML Object Suite - Component ModelOpenTravel XML Object Suite - Component Model
OpenTravel XML Object Suite - Component Model
 
Re the status_quo_and_what_lies_ahead
Re the status_quo_and_what_lies_aheadRe the status_quo_and_what_lies_ahead
Re the status_quo_and_what_lies_ahead
 
Using Evolution Patterns to Evolve Software Architectures
Using Evolution Patterns to Evolve Software ArchitecturesUsing Evolution Patterns to Evolve Software Architectures
Using Evolution Patterns to Evolve Software Architectures
 
Using Composite Feature Models to Support Agile Software Product Line Evoluti...
Using Composite Feature Models to Support Agile Software Product Line Evoluti...Using Composite Feature Models to Support Agile Software Product Line Evoluti...
Using Composite Feature Models to Support Agile Software Product Line Evoluti...
 
Framework Engineering_Final
Framework Engineering_FinalFramework Engineering_Final
Framework Engineering_Final
 
EclipseCon Europe 2015 - liferay modularity patterns using OSGi -Rafik Harabi
EclipseCon Europe 2015 - liferay modularity patterns using OSGi -Rafik HarabiEclipseCon Europe 2015 - liferay modularity patterns using OSGi -Rafik Harabi
EclipseCon Europe 2015 - liferay modularity patterns using OSGi -Rafik Harabi
 
Ontology-Based Systems Federation
Ontology-Based Systems FederationOntology-Based Systems Federation
Ontology-Based Systems Federation
 
The SENSORIA Development Environment
The SENSORIA Development EnvironmentThe SENSORIA Development Environment
The SENSORIA Development Environment
 
ICSM05.ppt
ICSM05.pptICSM05.ppt
ICSM05.ppt
 
Telecom Transformation Using SOA_2
Telecom Transformation Using SOA_2Telecom Transformation Using SOA_2
Telecom Transformation Using SOA_2
 
Telecom Transformation Using SOA
Telecom Transformation Using SOATelecom Transformation Using SOA
Telecom Transformation Using SOA
 
AddQ Testautomatiseringserfarenheter
AddQ TestautomatiseringserfarenheterAddQ Testautomatiseringserfarenheter
AddQ Testautomatiseringserfarenheter
 
Jin Hai
Jin HaiJin Hai
Jin Hai
 
SCRUM + CMMI = SCRUMMI?
SCRUM + CMMI = SCRUMMI?SCRUM + CMMI = SCRUMMI?
SCRUM + CMMI = SCRUMMI?
 
Software Frameworks for Music Information Retrieval
Software Frameworks for Music Information RetrievalSoftware Frameworks for Music Information Retrieval
Software Frameworks for Music Information Retrieval
 
Service-Oriented Modeling Language
Service-Oriented Modeling LanguageService-Oriented Modeling Language
Service-Oriented Modeling Language
 

Mais de Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsYahoo Developer Network
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
 

Mais de Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
 
CICD at Oath using Screwdriver
CICD at Oath using ScrewdriverCICD at Oath using Screwdriver
CICD at Oath using Screwdriver
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 

Último

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 

Último (20)

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 

Apache Hadoop India Summit 2011 talk "Framework for a Suite of Co-clustering Algorithms for Predictive Modeling on Hadoop" by Vaijanath Rao and Rohini Uppuluri

  • 1. Framework for a suite of Co-clustering algorithms for predictive modeling on Hadoop Vaijanath N. Rao (vaijanath.rao@teamaol.com) Rohini Uppuluri (rohini.uppuluri@teamaol.com)
  • 2. Agenda • Introduction • Background • Some Approaches • Co-Clustering • Introduction • Related Work • Why Hadoop? • Goal • Our Framework • Conclusions and Future Work Presentation for [CLIENT]
  • 3. Background Modeling for Prediction • Will user A like this movie? • Will a user B like this camera • Customer purchase decisions in an e-commerce setting And tons of other things… Presentation for [CLIENT]
  • 4. Some Approaches • Collaborative filtering • User Based, Item Based, Model Based, Content Based, Hybrid (See [1], [2] ) etc • Latent Models • Probabilistic Latent Semantic Indexing [3,6] • Matrix Factorization [4,7,8], • Probabilistic Discrete Latent Factor[5] • Co-clustering • Clustering along multiple axes: [9,10] etc; survey in [16] Presentation for [CLIENT]
  • 5. Co-clustering Products Product Attributes R 1 0 1 1 1 ed Row Cluster Updation Users uc 0 ? 1 ? 0 Column Cluster Updation Global Model Updation in 1 1 ? 0 0 g Er ? 0 0 ? 0 ... ro Row Cluster Updation r User Column Cluster Updation Attributes Global Model Updation ... Row Cluster Updation Column Cluster Updation Global Model Updation Clustered Products 0 0 1 1 1 Clustered Users 1 ? 1 ? 0 1 1 ? 0 0 ? 1 0 ? 0 Presentation for [CLIENT]
  • 6. Some Approaches • Bregman co-clustering - Framework [11] • Information theoretic co-clustering [12] • Min sum squared co-clustering [13] • Scalable Framework based on Bregman framework[14] • DisCo [15] Presentation for [CLIENT]
  • 7. Why Hadoop • Real world data – Huge • Large matrix to operate on(millions and millions of rows, millions of columns!) • Lot of computations Presentation for [CLIENT]
  • 8. Goal • Number of approaches, need for a common framework • To build a framework to fit in the multiple algorithms on hadoop • Easy framework for users to choose and use Presentation for [CLIENT]
  • 9. Overview Row Cluster Updator Job Row Clusters Input Column Cluster Column Clusters Updator Job Global Model Updator Job Global Model Presentation for [CLIENT]
  • 10. Overview : Core Interfaces • Input vector (type, id, datavec, attributevec, cost, assignment) • Cluster ( vector, len) • Row Cluster • Column Cluster • Distance/Error Function (vector1, vector2) • Model (matrix) • Row Model • Column Model • Group Model • Objective Function (Model1, Model2) Presentation for [CLIENT]
  • 11. Currently we have • Graph Based Bi-clustering • Disco Presentation for [CLIENT]
  • 12. Disco Algorithm 1. Initialization 1.1 row and column clusters 1.2 Compute global model 2. While objective function is met 2.1 For each row in the data, pick the row group which minimizes error 2.2 Update row clusters 2.3 Update global model 2.4 For each column in the data, pick the column group which minimizes error 2.5 Update column clusters 2.6 Update global model 3. Return row and column clusters Presentation for [CLIENT]
  • 13. Pick the Best Row Group/Cluster Presentation for [CLIENT]
  • 14. Example Presentation for [CLIENT]
  • 15. RowCluster Updator Job Presentation for [CLIENT]
  • 16. Example Presentation for [CLIENT]
  • 17. BiClustering Presentation for [CLIENT]
  • 18. Pick the Best Row Group/Cluster Presentation for [CLIENT]
  • 19. Example Presentation for [CLIENT]
  • 20. Row Updator Job key value value RowCluster Mapper key KeyType 0 rowId clickVector attributeVector bestRowClusterId cost rowId DATA clickVector Pick the best row group lineId attributeVector cluster which minimizes cost curRowClusterId or error Best Row Key type clickVector curRowClusterError Cluster Id ROWCLUSTER RowCluster Reducer keyType: DATA: Just Emit rowId clickVector attributeVector bestRowClusterId cost ROW CLUSTER Aggregate Row Cluster Also write Updated Row Clusters Presentation for [CLIENT]
  • 21. Example Presentation for [CLIENT]
  • 22. Conclusions and Future Work • Implementing more algorithms • Easy to use examples and more documentation Presentation for [CLIENT]
  • 23. References [1] J. Herlocker, J. Konstan, A. Borchers, and J. Riedl. An algorithmic framework for performing collaborative filtering. In SIGIR, pages 230–237, 1999 [2] J. Basilico and T. Hofmann. Unifying collaborative and content-based filtering. In ICML ’04, pages 65–72, 2004. [3] T. Hoffman and J. Puzicha. Latent class models for collaborative filtering. In Proc. IJCAI ’99, 1999. [4] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In NIPS ’07, 2007 [5] D. Agarwal and S. Merugu. Predictive discrete latent factor models for large scale dyadic data. In Proc. KDD ’07, pages 26–35, 2007 [6] A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable online collaborative filtering. In WWW ’07, pages 271–280, 2007 [7] Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In KDD ’08, pages 426–434, 2008 [8] H. Ma, H. Yang, M. Lyu, and I. King. Sorec: social recommendation using probabilistic matrix factorization. In CIKM ’08, pages 931–940, 2008 [9] Y. Cheng and G. M. Church. Biclustering of expression data. In Proc ICMB ’00, pages 93–103, 2000 [10] T. George and S. Merugu. A scalable collaborative filtering framework based on co- clustering. In ICDM, pages 625 – 628, 2005 Presentation for [CLIENT]
  • 24. References (contd..) [11] A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, and D. Modha. A generalized maximum entropy approach to bregman co-clustering and matrix approximation. JMLR, 1919-- 1986, 2007. [12] I. Dhillon, S. Mallela, and D. Modha. Information-theoretic co-clustering. In Proc. KDD ’03, pages 89–98, 2003 [13] H. Cho, I. S. Dhillon, Y. Guan, and S. Sra. Minimum sum squared residue co- clustering of gene expression data. In Proc. SDM ’04, 2004 [14] M. Deodhar, G. Gupta, J. Ghosh, H. Cho, and I. Dhillon. A scalable framework for discovering coherent co-clusters in noisy data. In ICML ’08, 2008 [15] S. Papadimitriou and J. Sun. Disco: Distributed co-clustering with mapreduce: A case study towards petabyte-scale end-to-end mining. In ICDM ’08, pages 512–521, 2008 [16] S. C. Madeira and A. L. Oliveira. Biclustering algorithms for biological data analysis: A survey. IEEE Trans. on Computational Biology and Bioinformatics, 1(1):24–45, 2004 Presentation for [CLIENT]
  • 25. Thank you Presentation for [CLIENT]
  • 26. Row Cluster Updator Job key value Value type rowId clickVector attributeVector bestRowClusterId cost RowCluster Mapper DATA key value clickVector attributeVector Pick the best row group Value type rowId rowId Updated rowCluster curRowClusterId cluster which minimizes cost ROWCLUSTER curRowClusterError or error Value type rowId Updated Partial GlobalModel ROW GLOB MODEL RowCluster Reducer ValueType: DATA: rowId clickVector attributeVector bestRowClusterId cost Just Emit ROW CLUSTER Aggregate Row Cluster Also write ROW GLOB CLUSTER Aggregate Partial Global Model Updated Row Clusters for given row cluster Updated Partial Global Model Presentation for [CLIENT]
  • 27. Column Cluster Updator Job key value Value type colId clickVector attributeVector bestColClusterId cost ColCluster Mapper DATA key value clickVector attributeVector Pick the best col group Value type colId colId Updated colCluster curColClusterId cluster which minimizes cost COLCLUSTER curColClusterError or error Value type colId Updated Partial GlobalModel COL GLOB MODEL ColCluster Reducer colId clickVector attributeVector bestColClusterId cost ValueType: DATA: Just Emit COL CLUSTER Also write Aggregate Col Cluster COL GLOB CLUSTER Updated Col Clusters Aggregate Partial Global Model for given col cluster Updated Partial Global Model Presentation for [CLIENT]