SlideShare a Scribd company logo
1 of 24
+
Large-scale Parallel Collaborative Filtering and Clustering
using MapReduce for Recommender Engines
Varad Meru
Software Development Engineer,
Orzota, Inc.
© Varad Meru, 2013
+
Outline
 Introduction
 Introduction to Recommendation Engines
 Algorithms for Recommendation Engines
 Challenges in Recommendation Engines
 What is Hadoop MapReduce?
 What is Netflix prize?
 Block diagram
 System requirement
 Conclusion
© Varad Meru, 2013
+
Recommender Systems
Introduction and Project Scope
© Varad Meru, 2013
+
Introduction
 Scope of our project is to build a Recommender Engine using
Clustering.
 Recommender Engine are used in E-Commerce and other
settings to recommend items to the end users.
 Widely used in companies such as
Amazon, Netflix, Flipkart, Google News, and many others.
 Collaborative Algorithms, Clustering and Matrix Decomposition
is used for finding Recommendations.
© Varad Meru, 2013
+
Recommender
System Example
© Varad Meru, 2013
+ Some other Recommender
Systems
Here are some snapshots of widely
used recommendation engines used
in Amazon.
© Varad Meru, 2013
+
Collaborative Filtering in Action
thms” : “Recommender Systems”, “id” : “Example”}
0! 1! 1! 1!
1! 0! 1! 1!
0! 1! 0! 0!
1! 0! 1! 1!
1! 1! 1! 1!
1! 0! 1! 1!
1! 0! 0! 0!
1! 1! 1! 0!
1! 1! 0! 1!
Binary Values
Recommendation!
Alice!
Bob!
John!
Jane!
Bill!
Steve!
Larry!
Don!
Jack!
 Assuming is Every
one of the names
have seen any of
the above movie
 Let 1 denote seen
 Let 0 denote not
seen
© Varad Meru, 2013
+
Collaborative Filtering in Action
Algorithms” : “Recommender Systems” , “Similarity” : “Tanimoto”}
1! 1/3 –
0.33!
5/8 –
0.625!
5/8 –
0.625!
1/3 –
0.33!
1!
3/8 –
0.375!
3/8 –
0.375!
5/8 –
0.625!
3/8 –
0.375!
1!
5/7 –
0.714!
5/8 –
0.625!
3/8 –
0.375!
5/7 –
0.714! 1!
Tanimoto Coefficient!
NA – Number of Customers
who bought Product A!
NB – Number of Customer who
bought Product B!
Nc – Number of Customer who
bought both Product A and
Product B!
15
: “Recommender Systems” , “Similarity” : “Tanimoto”}
1! 1/3 –
0.33!
5/8 –
0.625!
5/8 –
0.625!
1/3 –
0.33!
1!
3/8 –
0.375!
3/8 –
0.375!
5/8 –
0.625!
3/8 –
0.375!
1! 5/7 –
0.714!
5/8 –
0.625!
3/8 –
0.375!
5/7 –
0.714! 1!
Tanimoto Coefficient!
NA – Number of Customers
who bought Product A!
NB – Number of Customer who
bought Product B!
Nc – Number of Customer who
bought both Product A and
Product B!
© Varad Meru, 2013
+
Collaborative Filtering in Action
Algorithms” : “Recommender Systems” , “Similarity” : “Cosine”}
1! 0.507! 0.772! 0.772!
0.507! 1! 0.707! 0.707!
0.772! 0.707! 1! 0.833!
0.772! 0.707! 0.833! 1!
Cosine Coefficient!
NA – Number of Customers
who bought Product A!
NB – Number of Customer who
bought Product B!
Nc – Number of Customer who
bought both Product A and
Product B!
16
: “Recommender Systems” , “Similarity” : “Cosine”}
1! 0.507! 0.772! 0.772!
0.507! 1! 0.707! 0.707!
0.772! 0.707! 1! 0.833!
0.772! 0.707! 0.833! 1!
Cosine Coefficient!
NA – Number of Customers
who bought Product A!
NB – Number of Customer who
bought Product B!
Nc – Number of Customer who
bought both Product A and
Product B!
© Varad Meru, 2013
+
MinHash Clustering in Action
 We will be implementing a variation of algorithm for our Project
 It’s a technique to findout how similar two sets are.
 The scheme was invented by Andrei Broder (1997)1
 The simplest version of the minhash scheme uses k different
hash functions, where k is a fixed integer parameter, and
represents each set S by the k values of hmin(S) for these k
functions.
 Google is known to have used this method to cluster news
articles for recommending users the news of their tastes2
1Broder, Andrei Z. (1997), "On the resemblance and containment of documents”.
2Mayur Datar et. al. (2007), "Google News Personalization: Scalable Online Collaborative Filtering”.© Varad Meru, 2013
+
MinHash Clustering Flow
Get a Random
Permutation of Product
Catalog, R
Start
Define a hash function h
such that
h(Ui)=min. ranked product
in R
Ui : All the Interaction performed by the User.
An Interaction can be a Click, Purchase, Like, etc.
Pass each user through
the Hash function to get
the Cluster Number
After the Clusters have been
formed, Use Covisitation to
find out Recommendations
Stop
Cache the
Recommendations in
Memory
Memory
© Varad Meru, 2013
+
Some Recommender Systems
Available
 Apache Mahout1
 Easyrec2
 University of Minnisota’s SUGGEST3
 Other, for research, implementations such as UniRecSys and
Taste
1 http://mahout.apache.org
2 http://easyrec.org/
3 http://www-users.cs.umn.edu/~karypis/suggest/
© Varad Meru, 2013
+
MapReduce Paradigm
MapReduce and Hadoop
© Varad Meru, 2013
+
MapReduce Programming
Paradigm
 A core idea behind MapReduce is mapping your data set into a
collection of Key-Value pairs, and then reducing over all pairs
with the same key.
 Hadoop MapReduce is an Open Source implementation of
MapReduce framework on the lines of Google’s MapReduce
software framework.
 Used for writing applications rapidly process vast amounts of
data in parallel on large clusters of compute nodes.
 A Hadoop MapReduce job mainly consists of two user-defined
functions: map and reduce.
© Varad Meru, 2013
+
map() function
 A list of data elements are passed, one at a time, to map()
functions which transform each data element to an individual
output data element.
 A map() produces one or more intermediate <key, values>
pair(s) from the input list.
k1 V1 k2 V2 k5 V5k4 V4k3 V3
MAP MAP MAPMAP
k6 V6 ……
k’1 V’1 k’2 V’2 k’5 V’5k’4 V’4k’3 V’3 k’6 V’6 ……
Input list
Intermediate
output list
© Varad Meru, 2013
+
reduce() function
 After map phase finish, those intermediate values with same
output key are reduced into one or more final values
k’1 V’1 k’2 V’2 k’5 V’5k’4 V’4k’3 V’3 k’6 V’6 ……
Reduce Reduce Reduce
F1 R1 F2 R2 F3 R3 ……
Intermediate
map output
Final
Result
© Varad Meru, 2013
+
Parallelism
 map() functions run in parallel, creating different intermediate
values from different input data elements
 reduce() functions also run in parallel, working with assigned
output key
 All values are processed independently
 Reduce phase can’t start until map phase is completely
finished.
 Its in a way, data parallel implementation and thus works with
humongous amount of data.
© Varad Meru, 2013
+
Hadoop
 Started by Doug Cutting, and then carried ahead by enterprises
such as Yahoo! and Facebook
 It’s a collection of three frameworks – Commons, MapReduce
and DFS.
 Free and Open Source with Apache Software License
 Current Largest Cluster size of 4000 nodes. ( at Yahoo! )
 Whole Ecosystem build around it to process large amounts of
data. (~in GBs, TBs, PBs)
© Varad Meru, 2013
+
Evaluation of
Recommendation Engine
Netflix and Comparison with other frameworks
© Varad Meru, 2013
+
Netflix Dataset
 This dataset was release by Netflix October 2, 2006 for
SIGKDD challenge to build worlds best recommender for
Netflix.
 Netflix provided a training data set of 100,480,507 ratings that
480,189 users gave to 17,770 movies.
 Each training rating is a quadruplet of the form
<user, movie, date of grade, grade>
 Used heavily in Research for Recommender Engine1.
 Used in our project to compare the Implementation of our
Algorithm with other implementations e.g. Mahout
1Google Scholar : About 3,190 results for the search term “netflix prize”© Varad Meru, 2013
+
High-level Architecture
 MapReduce implementation of
Clustering algorithms such as K-
Means and MinHash Clustering.
 Comparative Analysis with
already present frameworks
such as Apache Mahout (Refer
Reference no. 1, 2, and 3)
© Varad Meru, 2013
+
Requisites
 2 Linux Machines (Required, preferred OS - Ubuntu)
 Pentium 4 + Machines (Recommended – Core 2 Duo 2.53
GHz+)
 RAM 1 GB per machine (Recommended – 4 GB per machine)
 Apache Hadoop (from http://hadoop.apache.org )
 Apache Mahout (from http://mahout.apache.org)
 Java IDE ( Eclipse, Preferred)
 Java SDK1.6+
© Varad Meru, 2013
+
References
1. “Scalable Similarity-Based Neighborhood Methods with
MapReduce” by Sebastian Schelter, Christoph Boden and
Volker Markl. – RecSys 2012.
2. “Case Study Evaluation of Mahout as a Recommender Platform”
by Carlos E. Seminario and David C. Wilson - Workshop on
Recommendation Utility Evaluation: Beyond RMSE (RUE 2012)
3. http://mahout.apache.org/ - Apache Mahout Project Page
4. http://www.ibm.com/developerworks/java/library/j-mahout/ -
Introducing Apache Mahout
5. [VIDEO] “Collaborative filtering at scale” by Sean Owen
6. [BOOK] “Mahout in Action” by Owen et. al., Manning Pub.
© Varad Meru, 2013
+
Thank You
© Varad Meru, 2013

More Related Content

What's hot

Recent advances in deep recommender systems
Recent advances in deep recommender systemsRecent advances in deep recommender systems
Recent advances in deep recommender systemsNAVER Engineering
 
Item Based Collaborative Filtering Recommendation Algorithms
Item Based Collaborative Filtering Recommendation AlgorithmsItem Based Collaborative Filtering Recommendation Algorithms
Item Based Collaborative Filtering Recommendation Algorithmsnextlib
 
Hybrid recommender systems
Hybrid recommender systemsHybrid recommender systems
Hybrid recommender systemsrenataghisloti
 
Matrix Factorization Technique for Recommender Systems
Matrix Factorization Technique for Recommender SystemsMatrix Factorization Technique for Recommender Systems
Matrix Factorization Technique for Recommender SystemsAladejubelo Oluwashina
 
Machine Learning with Apache Mahout
Machine Learning with Apache MahoutMachine Learning with Apache Mahout
Machine Learning with Apache MahoutDaniel Glauser
 
Recommender Systems with Apache Spark's ALS Function
Recommender Systems with Apache Spark's ALS FunctionRecommender Systems with Apache Spark's ALS Function
Recommender Systems with Apache Spark's ALS FunctionWill Johnson
 
Tutorial Mahout - Recommendation
Tutorial Mahout - RecommendationTutorial Mahout - Recommendation
Tutorial Mahout - RecommendationCataldo Musto
 
Recommender Systems from A to Z – The Right Dataset
Recommender Systems from A to Z – The Right DatasetRecommender Systems from A to Z – The Right Dataset
Recommender Systems from A to Z – The Right DatasetCrossing Minds
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopGrant Ingersoll
 
REAL-TIME RECOMMENDATION SYSTEMS
REAL-TIME RECOMMENDATION SYSTEMS REAL-TIME RECOMMENDATION SYSTEMS
REAL-TIME RECOMMENDATION SYSTEMS BigDataCloud
 
Collaborative Filtering 2: Item-based CF
Collaborative Filtering 2: Item-based CFCollaborative Filtering 2: Item-based CF
Collaborative Filtering 2: Item-based CFYusuke Yamamoto
 
Introduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache MahoutIntroduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache Mahoutsscdotopen
 
Recommender Systems from A to Z – Real-Time Deployment
Recommender Systems from A to Z – Real-Time DeploymentRecommender Systems from A to Z – Real-Time Deployment
Recommender Systems from A to Z – Real-Time DeploymentCrossing Minds
 
Collaborative Filtering with Spark
Collaborative Filtering with SparkCollaborative Filtering with Spark
Collaborative Filtering with SparkChris Johnson
 
How to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on SparkHow to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on SparkCaserta
 
Intro to Mahout
Intro to MahoutIntro to Mahout
Intro to MahoutUri Lavi
 
Mahout Tutorial and Hands-on (version 2015)
Mahout Tutorial and Hands-on (version 2015)Mahout Tutorial and Hands-on (version 2015)
Mahout Tutorial and Hands-on (version 2015)Cataldo Musto
 
Recommender Systems, Matrices and Graphs
Recommender Systems, Matrices and GraphsRecommender Systems, Matrices and Graphs
Recommender Systems, Matrices and GraphsRoelof Pieters
 

What's hot (20)

Recent advances in deep recommender systems
Recent advances in deep recommender systemsRecent advances in deep recommender systems
Recent advances in deep recommender systems
 
Item Based Collaborative Filtering Recommendation Algorithms
Item Based Collaborative Filtering Recommendation AlgorithmsItem Based Collaborative Filtering Recommendation Algorithms
Item Based Collaborative Filtering Recommendation Algorithms
 
Hybrid recommender systems
Hybrid recommender systemsHybrid recommender systems
Hybrid recommender systems
 
Matrix Factorization Technique for Recommender Systems
Matrix Factorization Technique for Recommender SystemsMatrix Factorization Technique for Recommender Systems
Matrix Factorization Technique for Recommender Systems
 
Machine Learning with Apache Mahout
Machine Learning with Apache MahoutMachine Learning with Apache Mahout
Machine Learning with Apache Mahout
 
Recommender Systems with Apache Spark's ALS Function
Recommender Systems with Apache Spark's ALS FunctionRecommender Systems with Apache Spark's ALS Function
Recommender Systems with Apache Spark's ALS Function
 
Tutorial Mahout - Recommendation
Tutorial Mahout - RecommendationTutorial Mahout - Recommendation
Tutorial Mahout - Recommendation
 
Recommender Systems from A to Z – The Right Dataset
Recommender Systems from A to Z – The Right DatasetRecommender Systems from A to Z – The Right Dataset
Recommender Systems from A to Z – The Right Dataset
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
 
REAL-TIME RECOMMENDATION SYSTEMS
REAL-TIME RECOMMENDATION SYSTEMS REAL-TIME RECOMMENDATION SYSTEMS
REAL-TIME RECOMMENDATION SYSTEMS
 
Collaborative Filtering 2: Item-based CF
Collaborative Filtering 2: Item-based CFCollaborative Filtering 2: Item-based CF
Collaborative Filtering 2: Item-based CF
 
Introduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache MahoutIntroduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache Mahout
 
Recommender Systems from A to Z – Real-Time Deployment
Recommender Systems from A to Z – Real-Time DeploymentRecommender Systems from A to Z – Real-Time Deployment
Recommender Systems from A to Z – Real-Time Deployment
 
Collaborative Filtering with Spark
Collaborative Filtering with SparkCollaborative Filtering with Spark
Collaborative Filtering with Spark
 
How to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on SparkHow to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on Spark
 
Intro to Mahout
Intro to MahoutIntro to Mahout
Intro to Mahout
 
Mahout Tutorial and Hands-on (version 2015)
Mahout Tutorial and Hands-on (version 2015)Mahout Tutorial and Hands-on (version 2015)
Mahout Tutorial and Hands-on (version 2015)
 
Project presentation
Project presentationProject presentation
Project presentation
 
Mahout part2
Mahout part2Mahout part2
Mahout part2
 
Recommender Systems, Matrices and Graphs
Recommender Systems, Matrices and GraphsRecommender Systems, Matrices and Graphs
Recommender Systems, Matrices and Graphs
 

Viewers also liked

Crab: A Python Framework for Building Recommender Systems
Crab: A Python Framework for Building Recommender Systems Crab: A Python Framework for Building Recommender Systems
Crab: A Python Framework for Building Recommender Systems Marcel Caraciolo
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Libraryjeykottalam
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine LearningCarol McDonald
 
Collaborative Filtering using KNN
Collaborative Filtering using KNNCollaborative Filtering using KNN
Collaborative Filtering using KNNŞeyda Hatipoğlu
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Spark Summit
 
Machine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibMachine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibIMC Institute
 

Viewers also liked (6)

Crab: A Python Framework for Building Recommender Systems
Crab: A Python Framework for Building Recommender Systems Crab: A Python Framework for Building Recommender Systems
Crab: A Python Framework for Building Recommender Systems
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine Learning
 
Collaborative Filtering using KNN
Collaborative Filtering using KNNCollaborative Filtering using KNN
Collaborative Filtering using KNN
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
 
Machine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibMachine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlib
 

Similar to MapReduce Recommender Systems

Overview of Movie Recommendation System using Machine learning by R programmi...
Overview of Movie Recommendation System using Machine learning by R programmi...Overview of Movie Recommendation System using Machine learning by R programmi...
Overview of Movie Recommendation System using Machine learning by R programmi...IRJET Journal
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Brian Brazil
 
REX Hadoop et R
REX Hadoop et RREX Hadoop et R
REX Hadoop et Rpkernevez
 
How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
How to build unified Batch & Streaming Pipelines with Apache Beam and DataflowHow to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
How to build unified Batch & Streaming Pipelines with Apache Beam and DataflowDaniel Zivkovic
 
Distributed Database practicals
Distributed Database practicals Distributed Database practicals
Distributed Database practicals Vrushali Lanjewar
 
Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Riccardo Zamana
 
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET Journal
 
An Inter-Wiki Page Data Processor for a M2M System @Matsue, 1sep., Eskm2013
An Inter-Wiki Page Data Processor for a M2M System  @Matsue, 1sep., Eskm2013An Inter-Wiki Page Data Processor for a M2M System  @Matsue, 1sep., Eskm2013
An Inter-Wiki Page Data Processor for a M2M System @Matsue, 1sep., Eskm2013Takashi Yamanoue
 
DevOps Spain 2019. Beatriz Martínez-IBM
DevOps Spain 2019. Beatriz Martínez-IBMDevOps Spain 2019. Beatriz Martínez-IBM
DevOps Spain 2019. Beatriz Martínez-IBMatSistemas
 
RightScale User Conference / Fall / 2010 - Morning Sessions
RightScale User Conference / Fall / 2010 - Morning SessionsRightScale User Conference / Fall / 2010 - Morning Sessions
RightScale User Conference / Fall / 2010 - Morning SessionsRightScale
 
Performance Evaluation of Open Source Data Mining Tools
Performance Evaluation of Open Source Data Mining ToolsPerformance Evaluation of Open Source Data Mining Tools
Performance Evaluation of Open Source Data Mining Toolsijsrd.com
 
Place recommendation system
Place recommendation systemPlace recommendation system
Place recommendation systemIRJET Journal
 
Association Rule based Recommendation System using Big Data
Association Rule based Recommendation System using Big DataAssociation Rule based Recommendation System using Big Data
Association Rule based Recommendation System using Big DataIRJET Journal
 
Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)Brian Brazil
 
Recommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduceRecommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduceIJDKP
 
Disadvantages Of Robotium
Disadvantages Of RobotiumDisadvantages Of Robotium
Disadvantages Of RobotiumSusan Tullis
 
Game Analytics at London Apache Druid Meetup
Game Analytics at London Apache Druid MeetupGame Analytics at London Apache Druid Meetup
Game Analytics at London Apache Druid MeetupJelena Zanko
 

Similar to MapReduce Recommender Systems (20)

50120140505004
5012014050500450120140505004
50120140505004
 
Overview of Movie Recommendation System using Machine learning by R programmi...
Overview of Movie Recommendation System using Machine learning by R programmi...Overview of Movie Recommendation System using Machine learning by R programmi...
Overview of Movie Recommendation System using Machine learning by R programmi...
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
REX Hadoop et R
REX Hadoop et RREX Hadoop et R
REX Hadoop et R
 
How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
How to build unified Batch & Streaming Pipelines with Apache Beam and DataflowHow to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
 
Potter’S Wheel
Potter’S WheelPotter’S Wheel
Potter’S Wheel
 
Distributed Database practicals
Distributed Database practicals Distributed Database practicals
Distributed Database practicals
 
cametrics-report-final
cametrics-report-finalcametrics-report-final
cametrics-report-final
 
Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020
 
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOP
 
An Inter-Wiki Page Data Processor for a M2M System @Matsue, 1sep., Eskm2013
An Inter-Wiki Page Data Processor for a M2M System  @Matsue, 1sep., Eskm2013An Inter-Wiki Page Data Processor for a M2M System  @Matsue, 1sep., Eskm2013
An Inter-Wiki Page Data Processor for a M2M System @Matsue, 1sep., Eskm2013
 
DevOps Spain 2019. Beatriz Martínez-IBM
DevOps Spain 2019. Beatriz Martínez-IBMDevOps Spain 2019. Beatriz Martínez-IBM
DevOps Spain 2019. Beatriz Martínez-IBM
 
RightScale User Conference / Fall / 2010 - Morning Sessions
RightScale User Conference / Fall / 2010 - Morning SessionsRightScale User Conference / Fall / 2010 - Morning Sessions
RightScale User Conference / Fall / 2010 - Morning Sessions
 
Performance Evaluation of Open Source Data Mining Tools
Performance Evaluation of Open Source Data Mining ToolsPerformance Evaluation of Open Source Data Mining Tools
Performance Evaluation of Open Source Data Mining Tools
 
Place recommendation system
Place recommendation systemPlace recommendation system
Place recommendation system
 
Association Rule based Recommendation System using Big Data
Association Rule based Recommendation System using Big DataAssociation Rule based Recommendation System using Big Data
Association Rule based Recommendation System using Big Data
 
Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)
 
Recommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduceRecommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduce
 
Disadvantages Of Robotium
Disadvantages Of RobotiumDisadvantages Of Robotium
Disadvantages Of Robotium
 
Game Analytics at London Apache Druid Meetup
Game Analytics at London Apache Druid MeetupGame Analytics at London Apache Druid Meetup
Game Analytics at London Apache Druid Meetup
 

More from Varad Meru

Predicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensemblesPredicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensemblesVarad Meru
 
Generating Musical Notes and Transcription using Deep Learning
Generating Musical Notes and Transcription using Deep LearningGenerating Musical Notes and Transcription using Deep Learning
Generating Musical Notes and Transcription using Deep LearningVarad Meru
 
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Varad Meru
 
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Varad Meru
 
Kakuro: Solving the Constraint Satisfaction Problem
Kakuro: Solving the Constraint Satisfaction ProblemKakuro: Solving the Constraint Satisfaction Problem
Kakuro: Solving the Constraint Satisfaction ProblemVarad Meru
 
CS295 Week5: Megastore - Providing Scalable, Highly Available Storage for Int...
CS295 Week5: Megastore - Providing Scalable, Highly Available Storage for Int...CS295 Week5: Megastore - Providing Scalable, Highly Available Storage for Int...
CS295 Week5: Megastore - Providing Scalable, Highly Available Storage for Int...Varad Meru
 
Cassandra - A Decentralized Structured Storage System
Cassandra - A Decentralized Structured Storage SystemCassandra - A Decentralized Structured Storage System
Cassandra - A Decentralized Structured Storage SystemVarad Meru
 
Cloud Computing: An Overview
Cloud Computing: An OverviewCloud Computing: An Overview
Cloud Computing: An OverviewVarad Meru
 
Live Wide-Area Migration of Virtual Machines including Local Persistent State.
Live Wide-Area Migration of Virtual Machines including Local Persistent State.Live Wide-Area Migration of Virtual Machines including Local Persistent State.
Live Wide-Area Migration of Virtual Machines including Local Persistent State.Varad Meru
 
Machine Learning and Apache Mahout : An Introduction
Machine Learning and Apache Mahout : An IntroductionMachine Learning and Apache Mahout : An Introduction
Machine Learning and Apache Mahout : An IntroductionVarad Meru
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsVarad Meru
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningVarad Meru
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduceVarad Meru
 
Big Data, Hadoop, NoSQL and more ...
Big Data, Hadoop, NoSQL and more ...Big Data, Hadoop, NoSQL and more ...
Big Data, Hadoop, NoSQL and more ...Varad Meru
 
Final Year Project Guidance
Final Year Project GuidanceFinal Year Project Guidance
Final Year Project GuidanceVarad Meru
 
OpenSourceEducation
OpenSourceEducationOpenSourceEducation
OpenSourceEducationVarad Meru
 

More from Varad Meru (16)

Predicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensemblesPredicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensembles
 
Generating Musical Notes and Transcription using Deep Learning
Generating Musical Notes and Transcription using Deep LearningGenerating Musical Notes and Transcription using Deep Learning
Generating Musical Notes and Transcription using Deep Learning
 
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
 
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
 
Kakuro: Solving the Constraint Satisfaction Problem
Kakuro: Solving the Constraint Satisfaction ProblemKakuro: Solving the Constraint Satisfaction Problem
Kakuro: Solving the Constraint Satisfaction Problem
 
CS295 Week5: Megastore - Providing Scalable, Highly Available Storage for Int...
CS295 Week5: Megastore - Providing Scalable, Highly Available Storage for Int...CS295 Week5: Megastore - Providing Scalable, Highly Available Storage for Int...
CS295 Week5: Megastore - Providing Scalable, Highly Available Storage for Int...
 
Cassandra - A Decentralized Structured Storage System
Cassandra - A Decentralized Structured Storage SystemCassandra - A Decentralized Structured Storage System
Cassandra - A Decentralized Structured Storage System
 
Cloud Computing: An Overview
Cloud Computing: An OverviewCloud Computing: An Overview
Cloud Computing: An Overview
 
Live Wide-Area Migration of Virtual Machines including Local Persistent State.
Live Wide-Area Migration of Virtual Machines including Local Persistent State.Live Wide-Area Migration of Virtual Machines including Local Persistent State.
Live Wide-Area Migration of Virtual Machines including Local Persistent State.
 
Machine Learning and Apache Mahout : An Introduction
Machine Learning and Apache Mahout : An IntroductionMachine Learning and Apache Mahout : An Introduction
Machine Learning and Apache Mahout : An Introduction
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its Applications
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine Learning
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
 
Big Data, Hadoop, NoSQL and more ...
Big Data, Hadoop, NoSQL and more ...Big Data, Hadoop, NoSQL and more ...
Big Data, Hadoop, NoSQL and more ...
 
Final Year Project Guidance
Final Year Project GuidanceFinal Year Project Guidance
Final Year Project Guidance
 
OpenSourceEducation
OpenSourceEducationOpenSourceEducation
OpenSourceEducation
 

Recently uploaded

Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Recently uploaded (20)

Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

MapReduce Recommender Systems

  • 1. + Large-scale Parallel Collaborative Filtering and Clustering using MapReduce for Recommender Engines Varad Meru Software Development Engineer, Orzota, Inc. © Varad Meru, 2013
  • 2. + Outline  Introduction  Introduction to Recommendation Engines  Algorithms for Recommendation Engines  Challenges in Recommendation Engines  What is Hadoop MapReduce?  What is Netflix prize?  Block diagram  System requirement  Conclusion © Varad Meru, 2013
  • 3. + Recommender Systems Introduction and Project Scope © Varad Meru, 2013
  • 4. + Introduction  Scope of our project is to build a Recommender Engine using Clustering.  Recommender Engine are used in E-Commerce and other settings to recommend items to the end users.  Widely used in companies such as Amazon, Netflix, Flipkart, Google News, and many others.  Collaborative Algorithms, Clustering and Matrix Decomposition is used for finding Recommendations. © Varad Meru, 2013
  • 6. + Some other Recommender Systems Here are some snapshots of widely used recommendation engines used in Amazon. © Varad Meru, 2013
  • 7. + Collaborative Filtering in Action thms” : “Recommender Systems”, “id” : “Example”} 0! 1! 1! 1! 1! 0! 1! 1! 0! 1! 0! 0! 1! 0! 1! 1! 1! 1! 1! 1! 1! 0! 1! 1! 1! 0! 0! 0! 1! 1! 1! 0! 1! 1! 0! 1! Binary Values Recommendation! Alice! Bob! John! Jane! Bill! Steve! Larry! Don! Jack!  Assuming is Every one of the names have seen any of the above movie  Let 1 denote seen  Let 0 denote not seen © Varad Meru, 2013
  • 8. + Collaborative Filtering in Action Algorithms” : “Recommender Systems” , “Similarity” : “Tanimoto”} 1! 1/3 – 0.33! 5/8 – 0.625! 5/8 – 0.625! 1/3 – 0.33! 1! 3/8 – 0.375! 3/8 – 0.375! 5/8 – 0.625! 3/8 – 0.375! 1! 5/7 – 0.714! 5/8 – 0.625! 3/8 – 0.375! 5/7 – 0.714! 1! Tanimoto Coefficient! NA – Number of Customers who bought Product A! NB – Number of Customer who bought Product B! Nc – Number of Customer who bought both Product A and Product B! 15 : “Recommender Systems” , “Similarity” : “Tanimoto”} 1! 1/3 – 0.33! 5/8 – 0.625! 5/8 – 0.625! 1/3 – 0.33! 1! 3/8 – 0.375! 3/8 – 0.375! 5/8 – 0.625! 3/8 – 0.375! 1! 5/7 – 0.714! 5/8 – 0.625! 3/8 – 0.375! 5/7 – 0.714! 1! Tanimoto Coefficient! NA – Number of Customers who bought Product A! NB – Number of Customer who bought Product B! Nc – Number of Customer who bought both Product A and Product B! © Varad Meru, 2013
  • 9. + Collaborative Filtering in Action Algorithms” : “Recommender Systems” , “Similarity” : “Cosine”} 1! 0.507! 0.772! 0.772! 0.507! 1! 0.707! 0.707! 0.772! 0.707! 1! 0.833! 0.772! 0.707! 0.833! 1! Cosine Coefficient! NA – Number of Customers who bought Product A! NB – Number of Customer who bought Product B! Nc – Number of Customer who bought both Product A and Product B! 16 : “Recommender Systems” , “Similarity” : “Cosine”} 1! 0.507! 0.772! 0.772! 0.507! 1! 0.707! 0.707! 0.772! 0.707! 1! 0.833! 0.772! 0.707! 0.833! 1! Cosine Coefficient! NA – Number of Customers who bought Product A! NB – Number of Customer who bought Product B! Nc – Number of Customer who bought both Product A and Product B! © Varad Meru, 2013
  • 10. + MinHash Clustering in Action  We will be implementing a variation of algorithm for our Project  It’s a technique to findout how similar two sets are.  The scheme was invented by Andrei Broder (1997)1  The simplest version of the minhash scheme uses k different hash functions, where k is a fixed integer parameter, and represents each set S by the k values of hmin(S) for these k functions.  Google is known to have used this method to cluster news articles for recommending users the news of their tastes2 1Broder, Andrei Z. (1997), "On the resemblance and containment of documents”. 2Mayur Datar et. al. (2007), "Google News Personalization: Scalable Online Collaborative Filtering”.© Varad Meru, 2013
  • 11. + MinHash Clustering Flow Get a Random Permutation of Product Catalog, R Start Define a hash function h such that h(Ui)=min. ranked product in R Ui : All the Interaction performed by the User. An Interaction can be a Click, Purchase, Like, etc. Pass each user through the Hash function to get the Cluster Number After the Clusters have been formed, Use Covisitation to find out Recommendations Stop Cache the Recommendations in Memory Memory © Varad Meru, 2013
  • 12. + Some Recommender Systems Available  Apache Mahout1  Easyrec2  University of Minnisota’s SUGGEST3  Other, for research, implementations such as UniRecSys and Taste 1 http://mahout.apache.org 2 http://easyrec.org/ 3 http://www-users.cs.umn.edu/~karypis/suggest/ © Varad Meru, 2013
  • 13. + MapReduce Paradigm MapReduce and Hadoop © Varad Meru, 2013
  • 14. + MapReduce Programming Paradigm  A core idea behind MapReduce is mapping your data set into a collection of Key-Value pairs, and then reducing over all pairs with the same key.  Hadoop MapReduce is an Open Source implementation of MapReduce framework on the lines of Google’s MapReduce software framework.  Used for writing applications rapidly process vast amounts of data in parallel on large clusters of compute nodes.  A Hadoop MapReduce job mainly consists of two user-defined functions: map and reduce. © Varad Meru, 2013
  • 15. + map() function  A list of data elements are passed, one at a time, to map() functions which transform each data element to an individual output data element.  A map() produces one or more intermediate <key, values> pair(s) from the input list. k1 V1 k2 V2 k5 V5k4 V4k3 V3 MAP MAP MAPMAP k6 V6 …… k’1 V’1 k’2 V’2 k’5 V’5k’4 V’4k’3 V’3 k’6 V’6 …… Input list Intermediate output list © Varad Meru, 2013
  • 16. + reduce() function  After map phase finish, those intermediate values with same output key are reduced into one or more final values k’1 V’1 k’2 V’2 k’5 V’5k’4 V’4k’3 V’3 k’6 V’6 …… Reduce Reduce Reduce F1 R1 F2 R2 F3 R3 …… Intermediate map output Final Result © Varad Meru, 2013
  • 17. + Parallelism  map() functions run in parallel, creating different intermediate values from different input data elements  reduce() functions also run in parallel, working with assigned output key  All values are processed independently  Reduce phase can’t start until map phase is completely finished.  Its in a way, data parallel implementation and thus works with humongous amount of data. © Varad Meru, 2013
  • 18. + Hadoop  Started by Doug Cutting, and then carried ahead by enterprises such as Yahoo! and Facebook  It’s a collection of three frameworks – Commons, MapReduce and DFS.  Free and Open Source with Apache Software License  Current Largest Cluster size of 4000 nodes. ( at Yahoo! )  Whole Ecosystem build around it to process large amounts of data. (~in GBs, TBs, PBs) © Varad Meru, 2013
  • 19. + Evaluation of Recommendation Engine Netflix and Comparison with other frameworks © Varad Meru, 2013
  • 20. + Netflix Dataset  This dataset was release by Netflix October 2, 2006 for SIGKDD challenge to build worlds best recommender for Netflix.  Netflix provided a training data set of 100,480,507 ratings that 480,189 users gave to 17,770 movies.  Each training rating is a quadruplet of the form <user, movie, date of grade, grade>  Used heavily in Research for Recommender Engine1.  Used in our project to compare the Implementation of our Algorithm with other implementations e.g. Mahout 1Google Scholar : About 3,190 results for the search term “netflix prize”© Varad Meru, 2013
  • 21. + High-level Architecture  MapReduce implementation of Clustering algorithms such as K- Means and MinHash Clustering.  Comparative Analysis with already present frameworks such as Apache Mahout (Refer Reference no. 1, 2, and 3) © Varad Meru, 2013
  • 22. + Requisites  2 Linux Machines (Required, preferred OS - Ubuntu)  Pentium 4 + Machines (Recommended – Core 2 Duo 2.53 GHz+)  RAM 1 GB per machine (Recommended – 4 GB per machine)  Apache Hadoop (from http://hadoop.apache.org )  Apache Mahout (from http://mahout.apache.org)  Java IDE ( Eclipse, Preferred)  Java SDK1.6+ © Varad Meru, 2013
  • 23. + References 1. “Scalable Similarity-Based Neighborhood Methods with MapReduce” by Sebastian Schelter, Christoph Boden and Volker Markl. – RecSys 2012. 2. “Case Study Evaluation of Mahout as a Recommender Platform” by Carlos E. Seminario and David C. Wilson - Workshop on Recommendation Utility Evaluation: Beyond RMSE (RUE 2012) 3. http://mahout.apache.org/ - Apache Mahout Project Page 4. http://www.ibm.com/developerworks/java/library/j-mahout/ - Introducing Apache Mahout 5. [VIDEO] “Collaborative filtering at scale” by Sean Owen 6. [BOOK] “Mahout in Action” by Owen et. al., Manning Pub. © Varad Meru, 2013
  • 24. + Thank You © Varad Meru, 2013