SlideShare uma empresa Scribd logo
1 de 29
How to Build a Recommendation
Engine Using Apache Mahout
Viraj Paripatyadar
GS Lab
Contents
• A recommendation problem
• What is a recommender
• Building a recommender using Mahout
 • Tips and tweaks
• Recommender considerations




                                        2
A book store
• Sells books:
  •   By various authors
  •   Of various categories
  •   On different subjects
  •   From various publishers
• Readers/buyers are asked to rate
• Readers/buyers can provide reviews

               You walk into the store
             (buy something for a friend)
The store owner
• Asks you what:
  • your friend reads (already owns)
  • your friend usually likes more
• Has data on what:
  • his customers buy
  • his customers rate and review
• Uses a few strategies
1 - Find similar books
Depending on which books your friend has, pick
books:
• by the same author
• on the same/similar subject/s
• in the same category
• from the same publication

        (those with highest sales numbers)
2 - Find books with similar readership
• Define some similarity
    •   e.g. two books are as similar as the number of readers
        rating both of them
• Define some limit of relevance
    •   e.g. only consider books which are more than 4 readers
        similar
• Look for all books which are similar to books
  your friend owns

  Pick books from this set that you friend doesn’t
                       own
3 - Find people with similar tastes
• Define some similarity
    •   e.g. two people are as similar as the number of books
        they like from the same category
• Define some limit of relevance
    •   e.g. only consider the 3 top people when ordered
        according to how similar they are to your friend
• Look for users similar to your friend and see
  what they read

   Pick books which these people like and your
               friend doesn’t own
Example data
                    1,101,5.0        3,101,2.5        4,106,4.0
                    1,102,3.0        3,104,4.0        5,101,4.0
                    1,103,2.5        3,105,4.5        5,102,3.0
                    2,101,2.0        3,107,5.0        5,103,2.0
                    2,102,2.5        4,101,5.0        5,104,4.0
                    2,103,5.0        4,103,3.0        5,105,3.5
                    2,104,2.0        4,104,4.5        5,106,4.0


• Your friend owns three books:
   • Gave 5 stars to book 101 (likes hugely and talks about it all the time)
   • Gave 3 stars to book 102 (has shown some liking to it)
   • Gave 2.5 stars to book 103 (has read it, but didn’t say bad things about it)


 Now, we need to recommend for your friend books he hasn’t seen
A pictorial representation
           1                    5                  3




     101       102       103   104   105   106   107




                     2               4
Visualize
          1                    5                  3




    101       102       103   104   105   106   107




                    2               4
A (slightly) bigger example
            1,101,5.0   3,111,2.5   6,103,2.0
            1,102,3.0   4,101,5.0   6,106,4.0
            1,103,2.5   4,103,3.0   6,113,3.0
            1,109,3.5   4,104,4.5   6,115,5.0
            1,112,4.0   4,106,4.0   7,103,4.5
            2,101,2.0   4,109,2.0   7,104,2.5
            2,102,2.5   4,111,2.5   7,108,4.0
            2,103,5.0   5,101,4.0   7,109,3.5
            2,104,2.0   5,102,3.0   7,110,3.5
            2,107,4.5   5,103,2.0   7,112,2.5
            2,113,3.5   5,104,4.0   8,101,2.0
            3,101,2.5   5,105,3.5   8,105,4.0
            3,104,4.0   5,106,4.0   8,106,4.5
            3,105,4.5   5,109,3.0   8,110,3.0
            3,107,5.0   5,112,4.0   8,114,5.0
            3,115,4.0   6,101,4.5   8,115,3.5
A pictorial representation

            1                      2                       3                         4




101   102       103   104   105   106   107   108   109   110      111   112   113   114   115




            5                      6                           7                     8




                            Clearly, not a viable option
Mahout to the rescue
What is Apache Mahout
• Apache Mahout
  • A machine learning library
  • Works with Apache Hadoop
• Use cases:
  • Recommenders
  • Clustering
  • Classification
Recommenders in Mahout
• Recommenders use data culled from user
  behavior
• Recommending using Mahout
  • Similarity between users or items
    •   Expressed as a number between 0-1
  • Neighborhood of users/items
  • Recommendation using this info and an algorithm
    •   Generic
    •   Specialized
Similarity
• Various algorithms:
  •   Euclidean distance
  •   Pearson correlation
  •   Cosine measure
  •   Spearman correlation
  •   Tanimoto coefficient
  •   Log-likelyhood
• Effectiveness dependent on the input data
• Influences running time and memory
Neighborhood
• Nearest N neighborhood (say, 4):
                5                       3
                            4


                                U
                    2                       1
• Threshold neighborhood (say, > 0.8):

            5                       3
                        4


                                U
                    2                       1
Recommender
• Recommenders
 • Generic recommender
     •   User based
     •   Item based
 •   Slope-one recommender
 •   Singular Value Decomposition based
 •   Liner Interpolation based
 •   Cluster-based
• Recommender rescorer
• Recommender evaluator
A real-life Web application
• News aggregator-cum-reader
  •   Fetches news from a news service
  •   Shows the news in a uniform UI
  •   Lets readers read, like/dislike and comment on news
  •   Link social networks and share
• Make this a personalized newspaper
  •   Track user actions
  •   Derive and store preferences
  •   Generate recommendations
  •   Leverage social accounts, etc.
Overall design


  Third party                          User, application
                                REST    data (MySQL)
  applications



                                            News
  Phone/tablet     Controller
                                REST   aggregation, stora
  applications     API (REST)
                                          ge (Hbase)


                                       Preferences, Reco
                                REST       mmender
 Web application                           (Mahout)
Recommender


                  REST service        Recommender
              Fetch recommendations    (offline, run
   REST       Input user actions       periodically)
  (Grizzly,
  Tomcat)
                                               Input
                    Database                   table
                                               dump
                     MySQL
How to extract data – one dimension
                            News article readership
10000
        4299




 1000
               511


                     128
  100
                             51
                                                              News article
                                                              readership
                                   13
   10
                                          4      4
                                                          2
                                                      1
    1
         1     2     3      4      5     6       7    8   9
                         Number of News Articles
How to extract data – add dimensions
10000




 1000




  100
                                                                                News article
                                                                                readership
                                                                                Topic
   10                                                                           readership




    1
        1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 42 44 46 51 57
                       Number of News articles / Topics
How more data helps
40

35

30

25                                                                  No. of readers
                                                                    with x articles
20                                                                  each
                                                                    No. of readers
15                                                                  with x topics
                                                                    each
10

5
         1
                        2
0
     0       100   200  300      400      500     600   700   800
                     Number of news articles/topics
How more data helps
9

8

7

6
                                                   No. of readers
5
                                                   with x articles
                                                   each
4
                                                   No. of readers
3
                                                   with x topics
                                                   each
2

1

0
    5   25          45             65         85
             Number of news articles/topics
How more data helps
3.5


 3


2.5

                                                            No. of readers
 2                                                          with x articles
                                                            each
1.5                                                         No. of readers
                                                            with x topics
                                                            each
 1


0.5


 0
      95   145    195      245       295        345   395
                 Number of news articles/topics
Learnings
• Know thy user
  • Frequency of visits
  • Preference logic wrt user
• Know thy items
  •   Should have enough items per user
  •   Maximize items per action
  •   Should have enough intersections
  •   Should not be transient
• Use tweaking abilities
• Sharpen the saw
Questions




            ?
Thank you
        viraj@gslab.com
viraj.paripatyadar@gmail.com

Mais conteúdo relacionado

Semelhante a Indic threads pune12-recommenders-apache-mahout

Info vis 4-22-2013-dc-vis-meetup-shneiderman
Info vis 4-22-2013-dc-vis-meetup-shneidermanInfo vis 4-22-2013-dc-vis-meetup-shneiderman
Info vis 4-22-2013-dc-vis-meetup-shneidermanUniversity of Maryland
 
Growing Intelligence by Properly Storing and Mining Call Center Data
Growing Intelligence by Properly Storing and Mining Call Center DataGrowing Intelligence by Properly Storing and Mining Call Center Data
Growing Intelligence by Properly Storing and Mining Call Center DataBay Bridge Decision Technologies
 
Adding data sources to the reporter
Adding data sources to the reporterAdding data sources to the reporter
Adding data sources to the reporterRogan Hamby
 
MRT 2018: reflecting on the past and the present with temporal graph models
MRT 2018: reflecting on the past and the present with temporal graph modelsMRT 2018: reflecting on the past and the present with temporal graph models
MRT 2018: reflecting on the past and the present with temporal graph modelsAntonio García-Domínguez
 
Adaptable Information Workshop slides
Adaptable Information Workshop slidesAdaptable Information Workshop slides
Adaptable Information Workshop slidesLouis Rosenfeld
 
A recommendation engine for your php application
A recommendation engine for your php applicationA recommendation engine for your php application
A recommendation engine for your php applicationMichele Orselli
 
ASC Marketing Workshop - Mar 2012
ASC Marketing Workshop - Mar 2012ASC Marketing Workshop - Mar 2012
ASC Marketing Workshop - Mar 2012TRG Arts
 
MediaEval 2018: NewsREEL Multimedia at MediaEval 2018: News Recommendation wi...
MediaEval 2018: NewsREEL Multimedia at MediaEval 2018: News Recommendation wi...MediaEval 2018: NewsREEL Multimedia at MediaEval 2018: News Recommendation wi...
MediaEval 2018: NewsREEL Multimedia at MediaEval 2018: News Recommendation wi...multimediaeval
 
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
Tutorial(release)
Tutorial(release)Tutorial(release)
Tutorial(release)Oshin Hung
 
Metabase lj meetup
Metabase lj meetupMetabase lj meetup
Metabase lj meetupSimon Belak
 
IBM MQ: An Introduction to Using and Developing with MQ Publish/Subscribe
IBM MQ: An Introduction to Using and Developing with MQ Publish/SubscribeIBM MQ: An Introduction to Using and Developing with MQ Publish/Subscribe
IBM MQ: An Introduction to Using and Developing with MQ Publish/SubscribeDavid Ware
 
The (very) basics of AI for the Radiology resident
The (very) basics of AI for the Radiology residentThe (very) basics of AI for the Radiology resident
The (very) basics of AI for the Radiology residentPedro Staziaki
 

Semelhante a Indic threads pune12-recommenders-apache-mahout (20)

Deborah Aleyne Lapeyre - How JATS Empowers Scholarly Communication
Deborah Aleyne Lapeyre - How JATS Empowers Scholarly CommunicationDeborah Aleyne Lapeyre - How JATS Empowers Scholarly Communication
Deborah Aleyne Lapeyre - How JATS Empowers Scholarly Communication
 
Info vis 4-22-2013-dc-vis-meetup-shneiderman
Info vis 4-22-2013-dc-vis-meetup-shneidermanInfo vis 4-22-2013-dc-vis-meetup-shneiderman
Info vis 4-22-2013-dc-vis-meetup-shneiderman
 
Growing Intelligence by Properly Storing and Mining Call Center Data
Growing Intelligence by Properly Storing and Mining Call Center DataGrowing Intelligence by Properly Storing and Mining Call Center Data
Growing Intelligence by Properly Storing and Mining Call Center Data
 
Info vis 12-2012-v17-shneiderman
Info vis 12-2012-v17-shneidermanInfo vis 12-2012-v17-shneiderman
Info vis 12-2012-v17-shneiderman
 
Adding data sources to the reporter
Adding data sources to the reporterAdding data sources to the reporter
Adding data sources to the reporter
 
Deborah Aleyne Lapeyre - XML: Why and How JATS
Deborah Aleyne Lapeyre - XML: Why and How JATSDeborah Aleyne Lapeyre - XML: Why and How JATS
Deborah Aleyne Lapeyre - XML: Why and How JATS
 
Lee and Morton "It Can't Stay Here: Print Collection Management During a Majo...
Lee and Morton "It Can't Stay Here: Print Collection Management During a Majo...Lee and Morton "It Can't Stay Here: Print Collection Management During a Majo...
Lee and Morton "It Can't Stay Here: Print Collection Management During a Majo...
 
MRT 2018: reflecting on the past and the present with temporal graph models
MRT 2018: reflecting on the past and the present with temporal graph modelsMRT 2018: reflecting on the past and the present with temporal graph models
MRT 2018: reflecting on the past and the present with temporal graph models
 
375 cc3 a_lindabeebe
375 cc3 a_lindabeebe375 cc3 a_lindabeebe
375 cc3 a_lindabeebe
 
Adaptable Information Workshop slides
Adaptable Information Workshop slidesAdaptable Information Workshop slides
Adaptable Information Workshop slides
 
A recommendation engine for your php application
A recommendation engine for your php applicationA recommendation engine for your php application
A recommendation engine for your php application
 
ASC Marketing Workshop - Mar 2012
ASC Marketing Workshop - Mar 2012ASC Marketing Workshop - Mar 2012
ASC Marketing Workshop - Mar 2012
 
MediaEval 2018: NewsREEL Multimedia at MediaEval 2018: News Recommendation wi...
MediaEval 2018: NewsREEL Multimedia at MediaEval 2018: News Recommendation wi...MediaEval 2018: NewsREEL Multimedia at MediaEval 2018: News Recommendation wi...
MediaEval 2018: NewsREEL Multimedia at MediaEval 2018: News Recommendation wi...
 
Data Mining Lecture_1.pptx
Data Mining Lecture_1.pptxData Mining Lecture_1.pptx
Data Mining Lecture_1.pptx
 
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
 
Tutorial(release)
Tutorial(release)Tutorial(release)
Tutorial(release)
 
NISO Webinar: Keyword Search = "Improve Discovery Systems"
NISO Webinar: Keyword Search = "Improve Discovery Systems"NISO Webinar: Keyword Search = "Improve Discovery Systems"
NISO Webinar: Keyword Search = "Improve Discovery Systems"
 
Metabase lj meetup
Metabase lj meetupMetabase lj meetup
Metabase lj meetup
 
IBM MQ: An Introduction to Using and Developing with MQ Publish/Subscribe
IBM MQ: An Introduction to Using and Developing with MQ Publish/SubscribeIBM MQ: An Introduction to Using and Developing with MQ Publish/Subscribe
IBM MQ: An Introduction to Using and Developing with MQ Publish/Subscribe
 
The (very) basics of AI for the Radiology resident
The (very) basics of AI for the Radiology residentThe (very) basics of AI for the Radiology resident
The (very) basics of AI for the Radiology resident
 

Mais de IndicThreads

Http2 is here! And why the web needs it
Http2 is here! And why the web needs itHttp2 is here! And why the web needs it
Http2 is here! And why the web needs itIndicThreads
 
Understanding Bitcoin (Blockchain) and its Potential for Disruptive Applications
Understanding Bitcoin (Blockchain) and its Potential for Disruptive ApplicationsUnderstanding Bitcoin (Blockchain) and its Potential for Disruptive Applications
Understanding Bitcoin (Blockchain) and its Potential for Disruptive ApplicationsIndicThreads
 
Go Programming Language - Learning The Go Lang way
Go Programming Language - Learning The Go Lang wayGo Programming Language - Learning The Go Lang way
Go Programming Language - Learning The Go Lang wayIndicThreads
 
Building Resilient Microservices
Building Resilient Microservices Building Resilient Microservices
Building Resilient Microservices IndicThreads
 
App using golang indicthreads
App using golang  indicthreadsApp using golang  indicthreads
App using golang indicthreadsIndicThreads
 
Building on quicksand microservices indicthreads
Building on quicksand microservices  indicthreadsBuilding on quicksand microservices  indicthreads
Building on quicksand microservices indicthreadsIndicThreads
 
How to Think in RxJava Before Reacting
How to Think in RxJava Before ReactingHow to Think in RxJava Before Reacting
How to Think in RxJava Before ReactingIndicThreads
 
Iot secure connected devices indicthreads
Iot secure connected devices indicthreadsIot secure connected devices indicthreads
Iot secure connected devices indicthreadsIndicThreads
 
Real world IoT for enterprises
Real world IoT for enterprisesReal world IoT for enterprises
Real world IoT for enterprisesIndicThreads
 
IoT testing and quality assurance indicthreads
IoT testing and quality assurance indicthreadsIoT testing and quality assurance indicthreads
IoT testing and quality assurance indicthreadsIndicThreads
 
Functional Programming Past Present Future
Functional Programming Past Present FutureFunctional Programming Past Present Future
Functional Programming Past Present FutureIndicThreads
 
Harnessing the Power of Java 8 Streams
Harnessing the Power of Java 8 Streams Harnessing the Power of Java 8 Streams
Harnessing the Power of Java 8 Streams IndicThreads
 
Building & scaling a live streaming mobile platform - Gr8 road to fame
Building & scaling a live streaming mobile platform - Gr8 road to fameBuilding & scaling a live streaming mobile platform - Gr8 road to fame
Building & scaling a live streaming mobile platform - Gr8 road to fameIndicThreads
 
Internet of things architecture perspective - IndicThreads Conference
Internet of things architecture perspective - IndicThreads ConferenceInternet of things architecture perspective - IndicThreads Conference
Internet of things architecture perspective - IndicThreads ConferenceIndicThreads
 
Cars and Computers: Building a Java Carputer
 Cars and Computers: Building a Java Carputer Cars and Computers: Building a Java Carputer
Cars and Computers: Building a Java CarputerIndicThreads
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache SparkIndicThreads
 
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
 Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & DockerIndicThreads
 
Speed up your build pipeline for faster feedback
Speed up your build pipeline for faster feedbackSpeed up your build pipeline for faster feedback
Speed up your build pipeline for faster feedbackIndicThreads
 
Unraveling OpenStack Clouds
 Unraveling OpenStack Clouds Unraveling OpenStack Clouds
Unraveling OpenStack CloudsIndicThreads
 
Digital Transformation of the Enterprise. What IT leaders need to know!
Digital Transformation of the Enterprise. What IT  leaders need to know!Digital Transformation of the Enterprise. What IT  leaders need to know!
Digital Transformation of the Enterprise. What IT leaders need to know!IndicThreads
 

Mais de IndicThreads (20)

Http2 is here! And why the web needs it
Http2 is here! And why the web needs itHttp2 is here! And why the web needs it
Http2 is here! And why the web needs it
 
Understanding Bitcoin (Blockchain) and its Potential for Disruptive Applications
Understanding Bitcoin (Blockchain) and its Potential for Disruptive ApplicationsUnderstanding Bitcoin (Blockchain) and its Potential for Disruptive Applications
Understanding Bitcoin (Blockchain) and its Potential for Disruptive Applications
 
Go Programming Language - Learning The Go Lang way
Go Programming Language - Learning The Go Lang wayGo Programming Language - Learning The Go Lang way
Go Programming Language - Learning The Go Lang way
 
Building Resilient Microservices
Building Resilient Microservices Building Resilient Microservices
Building Resilient Microservices
 
App using golang indicthreads
App using golang  indicthreadsApp using golang  indicthreads
App using golang indicthreads
 
Building on quicksand microservices indicthreads
Building on quicksand microservices  indicthreadsBuilding on quicksand microservices  indicthreads
Building on quicksand microservices indicthreads
 
How to Think in RxJava Before Reacting
How to Think in RxJava Before ReactingHow to Think in RxJava Before Reacting
How to Think in RxJava Before Reacting
 
Iot secure connected devices indicthreads
Iot secure connected devices indicthreadsIot secure connected devices indicthreads
Iot secure connected devices indicthreads
 
Real world IoT for enterprises
Real world IoT for enterprisesReal world IoT for enterprises
Real world IoT for enterprises
 
IoT testing and quality assurance indicthreads
IoT testing and quality assurance indicthreadsIoT testing and quality assurance indicthreads
IoT testing and quality assurance indicthreads
 
Functional Programming Past Present Future
Functional Programming Past Present FutureFunctional Programming Past Present Future
Functional Programming Past Present Future
 
Harnessing the Power of Java 8 Streams
Harnessing the Power of Java 8 Streams Harnessing the Power of Java 8 Streams
Harnessing the Power of Java 8 Streams
 
Building & scaling a live streaming mobile platform - Gr8 road to fame
Building & scaling a live streaming mobile platform - Gr8 road to fameBuilding & scaling a live streaming mobile platform - Gr8 road to fame
Building & scaling a live streaming mobile platform - Gr8 road to fame
 
Internet of things architecture perspective - IndicThreads Conference
Internet of things architecture perspective - IndicThreads ConferenceInternet of things architecture perspective - IndicThreads Conference
Internet of things architecture perspective - IndicThreads Conference
 
Cars and Computers: Building a Java Carputer
 Cars and Computers: Building a Java Carputer Cars and Computers: Building a Java Carputer
Cars and Computers: Building a Java Carputer
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
 
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
 Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
 
Speed up your build pipeline for faster feedback
Speed up your build pipeline for faster feedbackSpeed up your build pipeline for faster feedback
Speed up your build pipeline for faster feedback
 
Unraveling OpenStack Clouds
 Unraveling OpenStack Clouds Unraveling OpenStack Clouds
Unraveling OpenStack Clouds
 
Digital Transformation of the Enterprise. What IT leaders need to know!
Digital Transformation of the Enterprise. What IT  leaders need to know!Digital Transformation of the Enterprise. What IT  leaders need to know!
Digital Transformation of the Enterprise. What IT leaders need to know!
 

Indic threads pune12-recommenders-apache-mahout

  • 1. How to Build a Recommendation Engine Using Apache Mahout Viraj Paripatyadar GS Lab
  • 2. Contents • A recommendation problem • What is a recommender • Building a recommender using Mahout • Tips and tweaks • Recommender considerations 2
  • 3. A book store • Sells books: • By various authors • Of various categories • On different subjects • From various publishers • Readers/buyers are asked to rate • Readers/buyers can provide reviews You walk into the store (buy something for a friend)
  • 4. The store owner • Asks you what: • your friend reads (already owns) • your friend usually likes more • Has data on what: • his customers buy • his customers rate and review • Uses a few strategies
  • 5. 1 - Find similar books Depending on which books your friend has, pick books: • by the same author • on the same/similar subject/s • in the same category • from the same publication (those with highest sales numbers)
  • 6. 2 - Find books with similar readership • Define some similarity • e.g. two books are as similar as the number of readers rating both of them • Define some limit of relevance • e.g. only consider books which are more than 4 readers similar • Look for all books which are similar to books your friend owns Pick books from this set that you friend doesn’t own
  • 7. 3 - Find people with similar tastes • Define some similarity • e.g. two people are as similar as the number of books they like from the same category • Define some limit of relevance • e.g. only consider the 3 top people when ordered according to how similar they are to your friend • Look for users similar to your friend and see what they read Pick books which these people like and your friend doesn’t own
  • 8. Example data 1,101,5.0 3,101,2.5 4,106,4.0 1,102,3.0 3,104,4.0 5,101,4.0 1,103,2.5 3,105,4.5 5,102,3.0 2,101,2.0 3,107,5.0 5,103,2.0 2,102,2.5 4,101,5.0 5,104,4.0 2,103,5.0 4,103,3.0 5,105,3.5 2,104,2.0 4,104,4.5 5,106,4.0 • Your friend owns three books: • Gave 5 stars to book 101 (likes hugely and talks about it all the time) • Gave 3 stars to book 102 (has shown some liking to it) • Gave 2.5 stars to book 103 (has read it, but didn’t say bad things about it) Now, we need to recommend for your friend books he hasn’t seen
  • 9. A pictorial representation 1 5 3 101 102 103 104 105 106 107 2 4
  • 10. Visualize 1 5 3 101 102 103 104 105 106 107 2 4
  • 11. A (slightly) bigger example 1,101,5.0 3,111,2.5 6,103,2.0 1,102,3.0 4,101,5.0 6,106,4.0 1,103,2.5 4,103,3.0 6,113,3.0 1,109,3.5 4,104,4.5 6,115,5.0 1,112,4.0 4,106,4.0 7,103,4.5 2,101,2.0 4,109,2.0 7,104,2.5 2,102,2.5 4,111,2.5 7,108,4.0 2,103,5.0 5,101,4.0 7,109,3.5 2,104,2.0 5,102,3.0 7,110,3.5 2,107,4.5 5,103,2.0 7,112,2.5 2,113,3.5 5,104,4.0 8,101,2.0 3,101,2.5 5,105,3.5 8,105,4.0 3,104,4.0 5,106,4.0 8,106,4.5 3,105,4.5 5,109,3.0 8,110,3.0 3,107,5.0 5,112,4.0 8,114,5.0 3,115,4.0 6,101,4.5 8,115,3.5
  • 12. A pictorial representation 1 2 3 4 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 5 6 7 8 Clearly, not a viable option
  • 13. Mahout to the rescue
  • 14. What is Apache Mahout • Apache Mahout • A machine learning library • Works with Apache Hadoop • Use cases: • Recommenders • Clustering • Classification
  • 15. Recommenders in Mahout • Recommenders use data culled from user behavior • Recommending using Mahout • Similarity between users or items • Expressed as a number between 0-1 • Neighborhood of users/items • Recommendation using this info and an algorithm • Generic • Specialized
  • 16. Similarity • Various algorithms: • Euclidean distance • Pearson correlation • Cosine measure • Spearman correlation • Tanimoto coefficient • Log-likelyhood • Effectiveness dependent on the input data • Influences running time and memory
  • 17. Neighborhood • Nearest N neighborhood (say, 4): 5 3 4 U 2 1 • Threshold neighborhood (say, > 0.8): 5 3 4 U 2 1
  • 18. Recommender • Recommenders • Generic recommender • User based • Item based • Slope-one recommender • Singular Value Decomposition based • Liner Interpolation based • Cluster-based • Recommender rescorer • Recommender evaluator
  • 19. A real-life Web application • News aggregator-cum-reader • Fetches news from a news service • Shows the news in a uniform UI • Lets readers read, like/dislike and comment on news • Link social networks and share • Make this a personalized newspaper • Track user actions • Derive and store preferences • Generate recommendations • Leverage social accounts, etc.
  • 20. Overall design Third party User, application REST data (MySQL) applications News Phone/tablet Controller REST aggregation, stora applications API (REST) ge (Hbase) Preferences, Reco REST mmender Web application (Mahout)
  • 21. Recommender REST service Recommender Fetch recommendations (offline, run REST Input user actions periodically) (Grizzly, Tomcat) Input Database table dump MySQL
  • 22. How to extract data – one dimension News article readership 10000 4299 1000 511 128 100 51 News article readership 13 10 4 4 2 1 1 1 2 3 4 5 6 7 8 9 Number of News Articles
  • 23. How to extract data – add dimensions 10000 1000 100 News article readership Topic 10 readership 1 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 42 44 46 51 57 Number of News articles / Topics
  • 24. How more data helps 40 35 30 25 No. of readers with x articles 20 each No. of readers 15 with x topics each 10 5 1 2 0 0 100 200 300 400 500 600 700 800 Number of news articles/topics
  • 25. How more data helps 9 8 7 6 No. of readers 5 with x articles each 4 No. of readers 3 with x topics each 2 1 0 5 25 45 65 85 Number of news articles/topics
  • 26. How more data helps 3.5 3 2.5 No. of readers 2 with x articles each 1.5 No. of readers with x topics each 1 0.5 0 95 145 195 245 295 345 395 Number of news articles/topics
  • 27. Learnings • Know thy user • Frequency of visits • Preference logic wrt user • Know thy items • Should have enough items per user • Maximize items per action • Should have enough intersections • Should not be transient • Use tweaking abilities • Sharpen the saw
  • 29. Thank you viraj@gslab.com viraj.paripatyadar@gmail.com