SlideShare a Scribd company logo
1 of 56
Download to read offline
Mahout becomes
   a researcher




           Kris Jack, PhD
Senior Data Mining Engineer
Overview

โž”
    What's Mendeley?

โž”
    Applications of Mahout's Recommender

โž”
    Under Mahout's Bonnet

โž”
    Mahout's Research Career so Far

โž”
    Conclusions
What's Mendeley?
โž”
    Mendeley is a data platform for researchers
    โž”
        We're bringing together researchers and the research
        that they produce from all over the world

    โž”
        We're structuring this data in a machine readable format

    โž”
        We're opening this data up for you to build applications
        on top of it using our API

    โž”
        These applications help researchers to do even better
        research and become more productive

โž”
    How are we building our community?
Mendeley provides tools to help users...


...organise
their research

                                              โž”
                                               Reference
                                              management

                                              โž”
                                               Cite-as-you-
                                              write

                                              โž”
                                                Full-text
                                              article search

                                              โž”
                                               Digitalised
                                              annotations
Mendeley provides tools to help users...
                 ...collaborate with
                     one another
...organise
their research

                                        โž”
                                            Research network

                                        โž”
                                          Professional
                                        research groups
Mendeley provides tools to help users...
                 ...collaborate with
                     one another
...organise                                ...discover new
their research                                    research

                                       โž”
                                           Mendeley Suggest

                                       โž”
                                         Personalised article
                                       recommendations

                                       โž”
                                         Weekly batch of 10
                                       recommended articles

                                       โž”
                                           Collaborative Filtering

                                       โž”
                                        The more data, the
                                       better
1.5 million+ users; the 20 largest user bases:
                            University of Cambridge
                                 Stanford University
                                                   MIT
                                 University of Michigan
                                       Harvard University
                                       University of Oxford
                                      Sao Paulo University
                                    Imperial College London
                                      University of Edinburgh
                                            Cornell University
                              University of California at Berkeley
                                                      RWTH Aachen
                                               Columbia University
                                                           Georgia Tech
                                               University of Wisconsin
                                                            UC San Diego
                                              University of California at LA
                                                        University of Florida

50m research articles                              University of North Carolina
Mendeley provides tools to help users...
                 ...collaborate with
                     one another
...organise                            ...discover new
their research                                research



            We need a recommender
           that scales up, coping with
           our data and future growth
Applications of Mahout's
          Recommender
Mahout use cases:
                          โž”
                              Retrieve related items in
                              large collections




http://www.slideshare.net/kryton/the-data-layer
Mahout use cases:
                          โž”
                              Retrieve related items in
                              large collections

                          โž”
                              Discover relevant items that
                              you may have overlooked




http://engineering.foursquare.com/2011/03/22/build
ing-a-recommendation-engine-foursquare-style/
Mahout use cases:
                               โž”
                                   Retrieve related items in
                                   large collections

                               โž”
                                   Discover relevant items that
                                   you may have overlooked

                               โž”
                                   Find love!
                                   โž”
                                       Mahout implements collaborative
                                       filtering, a surprisingly powerful
                                       algorithm




http://www.speeddate.com/apps/site/views/mp/technology.php
Mahout use cases:
                                  โž”
                                      Retrieve related items in
                                      large collections

                                  โž”
                                      Discover relevant items that
                                      you may have overlooked

                                  โž”
                                      Find love!
                                      โž”
                                          Mahout implements collaborative
                                          filtering, a surprisingly powerful
                                          algorithm

                                  โž”
                                      Mendeley Suggest
                                      โž”
                                          Discover new research
                                      โž”
                                          Fill in gaps in your library
                                      โž”
                                          Your personal advisor

http://krisjack.blogspot.co.uk/2012/02/your-very-own-
personalised-research.html
Under Mahout's
       Bonnet
Generating recommendations
through matrix multiplication

                                                          This is item-based
                                                          recommendations as
                                                          similarity is based on
                                                          items, not users




Not convinced? Try reading these...
 Adomavicius, G., & Tuzhilin, A. (2005). Toward the next generation of recommender
 systems: a survey of the state-of-the-art and possible extensions. IEEE Transactions
 on Knowledge and Data Engineering, 17(6), 734-749. Piscataway, NJ, USA.

 http://www.slideshare.net/srowen/collaborative-filtering-at-scale-2
 http://krisjack.blogspot.co.uk/2012/04/under-bonnet-of-mahouts-item-based.html
Researchers
                                      Turing Babbage Einstein   Newton




                    Comp Sci 1
Research Articles



                    Comp Sci 2



                      Physics 1



                      Physics 2



                                  Input (all user preferences)
Researchers
                                      Turing Babbage Einstein   Newton
                                                                         1.5M



                    Comp Sci 1
Research Articles



                    Comp Sci 2



                      Physics 1



                      Physics 2
                                                                          300M
                                                                          prefs

                                   50M

                                  Input (all user preferences)
Researchers




                               Research
                               Articles
item.RecommenderJob
 1. Prep. pref. matrix (1-3)
 2. Gen. sim. matrix (4-6)
 3. Multiply matrices (7-10)              All User Preferences
                                              (item x user)
Researchers




                                   Research
                                   Articles
item.RecommenderJob
 1. Prep. pref. matrix (1-3)
 2. Gen. sim. matrix (4-6)
 3. Multiply matrices (7-10)                  All User Preferences
                                                  (item x user)




                               Research       Turing
                               Articles




                               A User's Preferences
                                  (item x user)
Researchers




                                    Research
                                    Articles
item.RecommenderJob
  1. Prep. pref. matrix (1-3)
  2. Gen. sim. matrix (4-6)
  3. Multiply matrices (7-10)                  All User Preferences
                                                   (item x user)


                Research
                Articles                       Turing


            2   1    0     0
                                Research
Research




                     0     0
                                Articles


            1   1
Articles




            0   0    2     2
            0   0    2     2
           Item Similarity      A User's Preferences
            (item x item)          (item x user)
Researchers




                                                                          Research
                                                                          Articles
                                          Research Articles
                                  Comp Sci 1         Physics 1
                                           Comp Sci 2         Physics 2
                                                                                     Input (all user
                                                                                     preferences)



                    Comp Sci 1       2        1         0        0
Research Articles




                    Comp Sci 2       1        1         0        0
                      Physics 1
                                     0         0        2        2
                      Physics 2
                                     0         0        2        2
Researchers




                                       Research
                                       Articles
item.RecommenderJob
  1. Prep. pref. matrix (1-3)
  2. Gen. sim. matrix (4-6)
  3. Multiply matrices (7-10)                     All User Preferences
                                                      (item x user)


                Research
                Articles                          Turing                       Turing


            2   1    0     0
                                   Research




                                                                    Research
Research




                     0     0
                                   Articles




                                                                    Articles
            1   1
Articles




            0   0    2     2   X                             =
            0   0    2     2
           Item Similarity         A User's Preferences               Recommendations
            (item x item)             (item x user)                     (item x user)
Running on Amazon's Elastic Map Reduce




                On demand use and easy to cost
Mahout's Research
    Career so Far
Mendeley Suggest
Mahout's
Normalised Amazon Hours          Performance




                          No. Good Recommendations/10
Mahout's
               Costly & Bad
Normalised Amazon Hours              Performance            Costly & Good




           Cheap & Bad        No. Good Recommendations/10   Cheap & Good
Mahout's
               Costly & Bad
Normalised Amazon Hours              Performance            Costly & Good




           Cheap & Bad        No. Good Recommendations/10   Cheap & Good
Mahout's
               Costly & Bad
Normalised Amazon Hours              Performance            Costly & Good




           Cheap & Bad        No. Good Recommendations/10   Cheap & Good
Mahout's
               Costly & Bad        Performance           Costly & Good
                          7K
Normalised Amazon Hours


                          6K

                          5K

                          4K

                          3K

                          2K

                          1K

                           0
                       0.5     0
                               1      1.5   2      2.5         3
           Cheap & Bad   No. Good Recommendations/10     Cheap & Good
Mahout's
               Costly & Bad          Performance         Costly & Good
                          7K
                                   6.5K, 1.5
Normalised Amazon Hours


                          6K       Orig. item-based


                          5K

                          4K

                          3K

                          2K

                          1K

                           0
                       0.5     0
                               1      1.5   2      2.5         3
           Cheap & Bad   No. Good Recommendations/10     Cheap & Good
Mahout's
               Costly & Bad              Performance      Costly & Good
                          7K
                                       6.5K, 1.5
Normalised Amazon Hours


                          6K           Orig. item-based


                          5K

                          4K

                          3K           Cust. item-based
                                   โž”
                                       2.4K, 1.5
                          2K

                          1K

                           0
                       0.5     0
                               1      1.5   2      2.5          3
           Cheap & Bad   No. Good Recommendations/10      Cheap & Good
Mahout's
               Costly & Bad              Performance              Costly & Good
                          7K
                                       6.5K, 1.5
Normalised Amazon Hours


                          6K           Orig. item-based


                          5K
                                                          -4.1K
                                                          (63%)
                          4K

                          3K           Cust. item-based
                                   โž”
                                       2.4K, 1.5
                          2K

                          1K

                           0
                       0.5     0
                               1      1.5   2      2.5                  3
           Cheap & Bad   No. Good Recommendations/10              Cheap & Good
Reducing processing time and cost

โž”
    Mahout's recommender is already efficient
    โž”
        but your data may have unusual properties
โž”
    We got improvements by:
    โž”
        tuning Hadoop's mapper and reducer allocation over the 10
        steps in the RecommenderJob
    โž”
        using an appropriate partitioner
Task Allocation              37 hours to complete




    1 reducer allocated, despite having 48 available...
Task Allocation

Allocating more reducers on a per job basis

                job.getConfiguration().setInt(
                    "mapred.reduce.tasks",
                    numMappers);



Allocating more mappers on a per job basis

                job.getConfiguration().set(
                    "mapred.max.split.size",
                    String.valueOf(splitSize));
Task Allocation   37 hours to complete
                      14 hours




                      From 1 โ†’ 40
                      reducers
Partitioners   14 hours to complete
Partitioners   14 hours to complete

                                      ~50KB




                            ~500MB
InputSampler.Sampler<IntWritable, Text> sampler =
      new InputSampler.RandomSampler<IntWritable, Text>(...);
  InputSampler.writePartitionFile(conf, sampler);
  conf.setPartitionerClass(TotalOrderPartitioner.class);




http://www.philippeadjiman.com/blog/2009/12/20/hadoop-tutorial-
series-issue-2-getting-started-with-customized-partitioning/
Partitioners        14 hours to complete
                   2 hours




               Evenly
               distributed
Mahout's
               Costly & Bad              Performance              Costly & Good
                          7K
                                       6.5K, 1.5
Normalised Amazon Hours


                          6K           Orig. item-based


                          5K
                                                          -4.1K
                                                          (63%)
                          4K

                          3K           Cust. item-based
                                   โž”
                                       2.4K, 1.5
                          2K

                          1K

                           0
                       0.5     0
                               1      1.5   2      2.5                  3
           Cheap & Bad   No. Good Recommendations/10              Cheap & Good
Researchers




                                       Research
                                       Articles
item.RecommenderJob
  1. Prep. pref. matrix (1-3)
  2. Gen. sim. matrix (4-6)
  3. Multiply matrices (7-10)                     All User Preferences
                                                      (item x user)


                Research
                Articles                          Turing                       Turing


            2   1    0     0
                                   Research




                                                                    Research
Research




                     0     0
                                   Articles




                                                                    Articles
            1   1
Articles




            0   0    2     2   X                             =
            0   0    2     2
           Item Similarity         A User's Preferences               Recommendations
            (item x item)             (item x user)                     (item x user)
Researchers


   user




                                         Research
                                         Articles
   item.RecommenderJob
      1. Prep. pref. matrix (1-3)
      2. Gen. sim. matrix (4-6)
      3. Multiply matrices (7-10)                   All User Preferences
                                                        (item x user)

                Researchers
                  Research
                  Articles                          Turing                       Turing


               2   1    0   0
Researchers




                                     Research




                                                                      Research
  Research




                        0   0
                                     Articles




                                                                      Articles
               1   1
  Articles




               0   0    2   2   X                              =
               0   0    2   2
              Item Similarity        A User's Preferences               Recommendations
               (item x item)            (item x user)                     (item x user)
     User Similarity (user x user)
Mahout's
               Costly & Bad              Performance                        Costly & Good
                          7K
                                       6.5K, 1.5
Normalised Amazon Hours


                          6K           Orig. item-based


                          5K

                          4K

                          3K           Cust. item-based
                                   โž”
                                       2.4K, 1.5
                          2K
                                                              Orig. user-based
                          1K
                                                          โž”
                                                              1K, 2.5


                           0
                       0.5     0
                               1      1.5   2      2.5                            3
           Cheap & Bad   No. Good Recommendations/10                         Cheap & Good
Mahout's
               Costly & Bad              Performance                        Costly & Good
                          7K
                                       6.5K, 1.5
Normalised Amazon Hours


                          6K           Orig. item-based


                          5K

                          4K

                          3K           Cust. item-based
                                                          +1 (67%)
                                   โž”
                                       2.4K, 1.5
                          2K              -1.4K
                                                              Orig. user-based
                                          (58%)
                          1K
                                                          โž”
                                                              1K, 2.5


                           0
                       0.5     0
                               1      1.5   2      2.5                            3
           Cheap & Bad   No. Good Recommendations/10                         Cheap & Good
Mahout's
               Costly & Bad              Performance                      Costly & Good
                          7K
                                       6.5K, 1.5
Normalised Amazon Hours


                          6K           Orig. item-based


                          5K

                          4K

                          3K           Cust. item-based
                                   โž”
                                       2.4K, 1.5
                          2K
                                                            Orig. user-based
                          1K
                                                          โž”
                                                            1K, 2.5
                                                            Cust. user-based
                                                          โž”
                                                            0.3K, 2.5
                           0
                       0.5     0
                               1      1.5   2      2.5                          3
           Cheap & Bad   No. Good Recommendations/10                       Cheap & Good
Mahout's
               Costly & Bad              Performance                   Costly & Good
                          7K
                                       6.5K, 1.5
Normalised Amazon Hours


                          6K           Orig. item-based


                          5K
                                                          -4.1K
                                                          (63%)
                          4K

                          3K           Cust. item-based
                                   โž”
                                       2.4K, 1.5
                          2K
                                                         Orig. user-based
                          1K                             1K, 2.5
                                                           โž”


                                                  -0.7K  Cust. user-based
                                                  (70%) โž”0.3K, 2.5
                           0
                       0.5     0
                               1      1.5   2      2.5                       3
           Cheap & Bad   No. Good Recommendations/10                    Cheap & Good
Mahout's
               Costly & Bad              Performance                      Costly & Good
                          7K                              +1 (67%)
                                       6.5K, 1.5
Normalised Amazon Hours


                          6K           Orig. item-based


                          5K

                          4K
                                                                     -6.2K
                                                                     (95%)
                          3K           Cust. item-based
                                   โž”
                                       2.4K, 1.5
                          2K
                                                            Orig. user-based
                          1K
                                                          โž”
                                                            1K, 2.5
                                                            Cust. user-based
                                                          โž”
                                                            0.3K, 2.5
                           0
                       0.5     0
                               1      1.5   2      2.5                          3
           Cheap & Bad   No. Good Recommendations/10                       Cheap & Good
Conclusions
Conclusions
โž”
    Mahout is doing a great job of powering Mendeley Suggest
    โž”
        Large scale data set
    โž”
        Excellent for batch processing requirements
โž”
 We'll soon be feeding our user-based implementation into
Mahout
    โž”
        User-based can outperform item-based
    โž”
        Makes Mahout's offering more rounded
โž”
    Save resources and money by understanding your data
    โž”
        Help Hadoop with task allocation if necessary
    โž”
        Paritition your data appropriately
We're Hiring!
โž”
    Hadoop Data Architect
    โž”
        design a coherent data model across the company
    โž”
        take ownership of our data
    โž”
        hands on Hadoop administration
โž”
    Marie Curie Senior Research Fellow
    โž”
        ensure that Mendeleyโ€™s research catalogue is of high quality
    โž”
        research and development opportunity
โž”
    ยฃ500 Finder's Fee if you find someone who we hire
โž”
    http://www.mendeley.com/careers/
www.mendeley.com

More Related Content

Similar to Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

Lynch & Dirks - Platforms for Open Research - Charleston Conference 2011
Lynch & Dirks  - Platforms for Open Research - Charleston Conference 2011Lynch & Dirks  - Platforms for Open Research - Charleston Conference 2011
Lynch & Dirks - Platforms for Open Research - Charleston Conference 2011Lee Dirks
ย 
DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Men...
DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Men...DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Men...
DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Men...datascience_at
ย 
Using Linked Data as the basis for Learning Resource Recommendation
Using Linked Data as the basis for Learning Resource RecommendationUsing Linked Data as the basis for Learning Resource Recommendation
Using Linked Data as the basis for Learning Resource RecommendationChris Clarke
ย 
Cloud Elephants and Witches: A Big Data Tale from Mendeley
Cloud Elephants and Witches: A Big Data Tale from MendeleyCloud Elephants and Witches: A Big Data Tale from Mendeley
Cloud Elephants and Witches: A Big Data Tale from MendeleyKris Jack
ย 
Teaching with Technology Institute Training
Teaching with Technology Institute TrainingTeaching with Technology Institute Training
Teaching with Technology Institute TrainingEmily Puckett Rodgers
ย 
Wiser Pku Lecture@Life Science School Pku
Wiser Pku Lecture@Life Science School PkuWiser Pku Lecture@Life Science School Pku
Wiser Pku Lecture@Life Science School Pkuguest8ed46d
ย 
Wiserpku Lecture@Life Science School Pku
Wiserpku Lecture@Life Science School PkuWiserpku Lecture@Life Science School Pku
Wiserpku Lecture@Life Science School Pkuwiser pku
ย 
Mendeley: crowdsourcing and recommending research on a large scale
Mendeley: crowdsourcing and recommending research on a large scaleMendeley: crowdsourcing and recommending research on a large scale
Mendeley: crowdsourcing and recommending research on a large scaleKris Jack
ย 
Libraries meet research 2.0
Libraries meet research 2.0Libraries meet research 2.0
Libraries meet research 2.0Guus van den Brekel
ย 
Effective Literature Searching 2011
Effective Literature Searching 2011Effective Literature Searching 2011
Effective Literature Searching 2011Middlesex University
ย 
Prizing Open and Enhancing Research Corpora for Language Teaching
Prizing Open and Enhancing Research Corpora for Language TeachingPrizing Open and Enhancing Research Corpora for Language Teaching
Prizing Open and Enhancing Research Corpora for Language TeachingAlannah Fitzgerald
ย 
Towards a Cloud Library
Towards a Cloud LibraryTowards a Cloud Library
Towards a Cloud LibraryRachel Frick
ย 
Virtual Research Networks : Towards Research 2.0
Virtual Research Networks : Towards Research 2.0Virtual Research Networks : Towards Research 2.0
Virtual Research Networks : Towards Research 2.0Guus van den Brekel
ย 
P2Pvalue Directory: A collaborative resource to map common-based peer produc...
P2Pvalue Directory:  A collaborative resource to map common-based peer produc...P2Pvalue Directory:  A collaborative resource to map common-based peer produc...
P2Pvalue Directory: A collaborative resource to map common-based peer produc...P2Pvalue
ย 
Learning Registry Overview Aug 2 2012
Learning Registry Overview Aug 2 2012Learning Registry Overview Aug 2 2012
Learning Registry Overview Aug 2 2012Jeanne Kitchens
ย 
21stcenturye learningslideshare
21stcenturye learningslideshare21stcenturye learningslideshare
21stcenturye learningslidesharetsimatsima
ย 
Let Your Conscience Be Your Guide: Taming Online Research Guides at the NCSU ...
Let Your Conscience Be Your Guide: Taming Online Research Guides at the NCSU ...Let Your Conscience Be Your Guide: Taming Online Research Guides at the NCSU ...
Let Your Conscience Be Your Guide: Taming Online Research Guides at the NCSU ...Lillian Rigling
ย 
2collab London Online web2.0 after the buzz
2collab London Online web2.0 after the buzz2collab London Online web2.0 after the buzz
2collab London Online web2.0 after the buzzf kersten
ย 

Similar to Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley (20)

Lynch & Dirks - Platforms for Open Research - Charleston Conference 2011
Lynch & Dirks  - Platforms for Open Research - Charleston Conference 2011Lynch & Dirks  - Platforms for Open Research - Charleston Conference 2011
Lynch & Dirks - Platforms for Open Research - Charleston Conference 2011
ย 
DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Men...
DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Men...DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Men...
DataScience Meeting I - Cloud Elephants and Witches: A Big Data Tale from Men...
ย 
Using Linked Data as the basis for Learning Resource Recommendation
Using Linked Data as the basis for Learning Resource RecommendationUsing Linked Data as the basis for Learning Resource Recommendation
Using Linked Data as the basis for Learning Resource Recommendation
ย 
Cloud Elephants and Witches: A Big Data Tale from Mendeley
Cloud Elephants and Witches: A Big Data Tale from MendeleyCloud Elephants and Witches: A Big Data Tale from Mendeley
Cloud Elephants and Witches: A Big Data Tale from Mendeley
ย 
Teaching with Technology Institute Training
Teaching with Technology Institute TrainingTeaching with Technology Institute Training
Teaching with Technology Institute Training
ย 
Wiser Pku Lecture@Life Science School Pku
Wiser Pku Lecture@Life Science School PkuWiser Pku Lecture@Life Science School Pku
Wiser Pku Lecture@Life Science School Pku
ย 
Wiserpku Lecture@Life Science School Pku
Wiserpku Lecture@Life Science School PkuWiserpku Lecture@Life Science School Pku
Wiserpku Lecture@Life Science School Pku
ย 
Mendeley: crowdsourcing and recommending research on a large scale
Mendeley: crowdsourcing and recommending research on a large scaleMendeley: crowdsourcing and recommending research on a large scale
Mendeley: crowdsourcing and recommending research on a large scale
ย 
Libraries meet research 2.0
Libraries meet research 2.0Libraries meet research 2.0
Libraries meet research 2.0
ย 
Effective Literature Searching 2011
Effective Literature Searching 2011Effective Literature Searching 2011
Effective Literature Searching 2011
ย 
Prizing Open and Enhancing Research Corpora for Language Teaching
Prizing Open and Enhancing Research Corpora for Language TeachingPrizing Open and Enhancing Research Corpora for Language Teaching
Prizing Open and Enhancing Research Corpora for Language Teaching
ย 
Towards a Cloud Library
Towards a Cloud LibraryTowards a Cloud Library
Towards a Cloud Library
ย 
Virtual Research Networks : Towards Research 2.0
Virtual Research Networks : Towards Research 2.0Virtual Research Networks : Towards Research 2.0
Virtual Research Networks : Towards Research 2.0
ย 
Final Johnson Research Libraries and Computational Research
Final Johnson Research Libraries and Computational ResearchFinal Johnson Research Libraries and Computational Research
Final Johnson Research Libraries and Computational Research
ย 
P2Pvalue Directory: A collaborative resource to map common-based peer produc...
P2Pvalue Directory:  A collaborative resource to map common-based peer produc...P2Pvalue Directory:  A collaborative resource to map common-based peer produc...
P2Pvalue Directory: A collaborative resource to map common-based peer produc...
ย 
Learning Registry Overview Aug 2 2012
Learning Registry Overview Aug 2 2012Learning Registry Overview Aug 2 2012
Learning Registry Overview Aug 2 2012
ย 
21stcenturye learningslideshare
21stcenturye learningslideshare21stcenturye learningslideshare
21stcenturye learningslideshare
ย 
University 2.0
University 2.0University 2.0
University 2.0
ย 
Let Your Conscience Be Your Guide: Taming Online Research Guides at the NCSU ...
Let Your Conscience Be Your Guide: Taming Online Research Guides at the NCSU ...Let Your Conscience Be Your Guide: Taming Online Research Guides at the NCSU ...
Let Your Conscience Be Your Guide: Taming Online Research Guides at the NCSU ...
ย 
2collab London Online web2.0 after the buzz
2collab London Online web2.0 after the buzz2collab London Online web2.0 after the buzz
2collab London Online web2.0 after the buzz
ย 

More from Kris Jack

Modern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyModern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyKris Jack
ย 
Machine Learning @ Mendeley
Machine Learning @ MendeleyMachine Learning @ Mendeley
Machine Learning @ MendeleyKris Jack
ย 
Mendeleyโ€™s Research Catalogue: building it, opening it up and making it even ...
Mendeleyโ€™s Research Catalogue: building it, opening it up and making it even ...Mendeleyโ€™s Research Catalogue: building it, opening it up and making it even ...
Mendeleyโ€™s Research Catalogue: building it, opening it up and making it even ...Kris Jack
ย 
Mendeley Suggest: What will you read next?
Mendeley Suggest: What will you read next?Mendeley Suggest: What will you read next?
Mendeley Suggest: What will you read next?Kris Jack
ย 
Mendeley Suggest: Engineering a Personalised Article Recommender System
Mendeley Suggest: Engineering a Personalised Article Recommender SystemMendeley Suggest: Engineering a Personalised Article Recommender System
Mendeley Suggest: Engineering a Personalised Article Recommender SystemKris Jack
ย 
Scientific Article Recommendation with Mahout
Scientific Article Recommendation with MahoutScientific Article Recommendation with Mahout
Scientific Article Recommendation with MahoutKris Jack
ย 
improving explicit preference entry by visualising data similarities
improving explicit preference entry by visualising data similaritiesimproving explicit preference entry by visualising data similarities
improving explicit preference entry by visualising data similaritiesKris Jack
ย 
Etude de la pertinence de critรจres de recherche en recherche d'informations s...
Etude de la pertinence de critรจres de recherche en recherche d'informations s...Etude de la pertinence de critรจres de recherche en recherche d'informations s...
Etude de la pertinence de critรจres de recherche en recherche d'informations s...Kris Jack
ย 
A Computational Model of Staged Language Acquisition
A Computational Model of Staged Language AcquisitionA Computational Model of Staged Language Acquisition
A Computational Model of Staged Language AcquisitionKris Jack
ย 
From Syllables to Syntax: Investigating Staged Linguistic Development through...
From Syllables to Syntax: Investigating Staged Linguistic Development through...From Syllables to Syntax: Investigating Staged Linguistic Development through...
From Syllables to Syntax: Investigating Staged Linguistic Development through...Kris Jack
ย 
A Collaborative Tool for the Computational Modelling of Child Language Acquis...
A Collaborative Tool for the Computational Modelling of Child Language Acquis...A Collaborative Tool for the Computational Modelling of Child Language Acquis...
A Collaborative Tool for the Computational Modelling of Child Language Acquis...Kris Jack
ย 
Mendeley, putting data into the hands of researchers
Mendeley, putting data into the hands of researchersMendeley, putting data into the hands of researchers
Mendeley, putting data into the hands of researchersKris Jack
ย 
Recommendation Engines for Scientific Literature
Recommendation Engines for Scientific LiteratureRecommendation Engines for Scientific Literature
Recommendation Engines for Scientific LiteratureKris Jack
ย 

More from Kris Jack (13)

Modern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyModern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in Mendeley
ย 
Machine Learning @ Mendeley
Machine Learning @ MendeleyMachine Learning @ Mendeley
Machine Learning @ Mendeley
ย 
Mendeleyโ€™s Research Catalogue: building it, opening it up and making it even ...
Mendeleyโ€™s Research Catalogue: building it, opening it up and making it even ...Mendeleyโ€™s Research Catalogue: building it, opening it up and making it even ...
Mendeleyโ€™s Research Catalogue: building it, opening it up and making it even ...
ย 
Mendeley Suggest: What will you read next?
Mendeley Suggest: What will you read next?Mendeley Suggest: What will you read next?
Mendeley Suggest: What will you read next?
ย 
Mendeley Suggest: Engineering a Personalised Article Recommender System
Mendeley Suggest: Engineering a Personalised Article Recommender SystemMendeley Suggest: Engineering a Personalised Article Recommender System
Mendeley Suggest: Engineering a Personalised Article Recommender System
ย 
Scientific Article Recommendation with Mahout
Scientific Article Recommendation with MahoutScientific Article Recommendation with Mahout
Scientific Article Recommendation with Mahout
ย 
improving explicit preference entry by visualising data similarities
improving explicit preference entry by visualising data similaritiesimproving explicit preference entry by visualising data similarities
improving explicit preference entry by visualising data similarities
ย 
Etude de la pertinence de critรจres de recherche en recherche d'informations s...
Etude de la pertinence de critรจres de recherche en recherche d'informations s...Etude de la pertinence de critรจres de recherche en recherche d'informations s...
Etude de la pertinence de critรจres de recherche en recherche d'informations s...
ย 
A Computational Model of Staged Language Acquisition
A Computational Model of Staged Language AcquisitionA Computational Model of Staged Language Acquisition
A Computational Model of Staged Language Acquisition
ย 
From Syllables to Syntax: Investigating Staged Linguistic Development through...
From Syllables to Syntax: Investigating Staged Linguistic Development through...From Syllables to Syntax: Investigating Staged Linguistic Development through...
From Syllables to Syntax: Investigating Staged Linguistic Development through...
ย 
A Collaborative Tool for the Computational Modelling of Child Language Acquis...
A Collaborative Tool for the Computational Modelling of Child Language Acquis...A Collaborative Tool for the Computational Modelling of Child Language Acquis...
A Collaborative Tool for the Computational Modelling of Child Language Acquis...
ย 
Mendeley, putting data into the hands of researchers
Mendeley, putting data into the hands of researchersMendeley, putting data into the hands of researchers
Mendeley, putting data into the hands of researchers
ย 
Recommendation Engines for Scientific Literature
Recommendation Engines for Scientific LiteratureRecommendation Engines for Scientific Literature
Recommendation Engines for Scientific Literature
ย 

Recently uploaded

FILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinoFILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinojohnmickonozaleda
ย 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
ย 
Culture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptxCulture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptxPoojaSen20
ย 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
ย 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
ย 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
ย 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
ย 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
ย 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
ย 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
ย 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
ย 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
ย 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
ย 
USPSยฎ Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPSยฎ Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPSยฎ Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPSยฎ Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
ย 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
ย 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beรฑa
ย 

Recently uploaded (20)

FILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinoFILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipino
ย 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
ย 
Culture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptxCulture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptx
ย 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
ย 
Model Call Girl in Tilak Nagar Delhi reach out to us at ๐Ÿ”9953056974๐Ÿ”
Model Call Girl in Tilak Nagar Delhi reach out to us at ๐Ÿ”9953056974๐Ÿ”Model Call Girl in Tilak Nagar Delhi reach out to us at ๐Ÿ”9953056974๐Ÿ”
Model Call Girl in Tilak Nagar Delhi reach out to us at ๐Ÿ”9953056974๐Ÿ”
ย 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
ย 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
ย 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ย 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
ย 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ย 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
ย 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
ย 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
ย 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
ย 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
ย 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
ย 
USPSยฎ Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPSยฎ Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPSยฎ Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPSยฎ Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
ย 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
ย 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
ย 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
ย 

Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley

  • 1. Mahout becomes a researcher Kris Jack, PhD Senior Data Mining Engineer
  • 2. Overview โž” What's Mendeley? โž” Applications of Mahout's Recommender โž” Under Mahout's Bonnet โž” Mahout's Research Career so Far โž” Conclusions
  • 4. โž” Mendeley is a data platform for researchers โž” We're bringing together researchers and the research that they produce from all over the world โž” We're structuring this data in a machine readable format โž” We're opening this data up for you to build applications on top of it using our API โž” These applications help researchers to do even better research and become more productive โž” How are we building our community?
  • 5. Mendeley provides tools to help users... ...organise their research โž” Reference management โž” Cite-as-you- write โž” Full-text article search โž” Digitalised annotations
  • 6. Mendeley provides tools to help users... ...collaborate with one another ...organise their research โž” Research network โž” Professional research groups
  • 7. Mendeley provides tools to help users... ...collaborate with one another ...organise ...discover new their research research โž” Mendeley Suggest โž” Personalised article recommendations โž” Weekly batch of 10 recommended articles โž” Collaborative Filtering โž” The more data, the better
  • 8. 1.5 million+ users; the 20 largest user bases: University of Cambridge Stanford University MIT University of Michigan Harvard University University of Oxford Sao Paulo University Imperial College London University of Edinburgh Cornell University University of California at Berkeley RWTH Aachen Columbia University Georgia Tech University of Wisconsin UC San Diego University of California at LA University of Florida 50m research articles University of North Carolina
  • 9. Mendeley provides tools to help users... ...collaborate with one another ...organise ...discover new their research research We need a recommender that scales up, coping with our data and future growth
  • 11.
  • 12.
  • 13. Mahout use cases: โž” Retrieve related items in large collections http://www.slideshare.net/kryton/the-data-layer
  • 14. Mahout use cases: โž” Retrieve related items in large collections โž” Discover relevant items that you may have overlooked http://engineering.foursquare.com/2011/03/22/build ing-a-recommendation-engine-foursquare-style/
  • 15. Mahout use cases: โž” Retrieve related items in large collections โž” Discover relevant items that you may have overlooked โž” Find love! โž” Mahout implements collaborative filtering, a surprisingly powerful algorithm http://www.speeddate.com/apps/site/views/mp/technology.php
  • 16. Mahout use cases: โž” Retrieve related items in large collections โž” Discover relevant items that you may have overlooked โž” Find love! โž” Mahout implements collaborative filtering, a surprisingly powerful algorithm โž” Mendeley Suggest โž” Discover new research โž” Fill in gaps in your library โž” Your personal advisor http://krisjack.blogspot.co.uk/2012/02/your-very-own- personalised-research.html
  • 17. Under Mahout's Bonnet
  • 18. Generating recommendations through matrix multiplication This is item-based recommendations as similarity is based on items, not users Not convinced? Try reading these... Adomavicius, G., & Tuzhilin, A. (2005). Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge and Data Engineering, 17(6), 734-749. Piscataway, NJ, USA. http://www.slideshare.net/srowen/collaborative-filtering-at-scale-2 http://krisjack.blogspot.co.uk/2012/04/under-bonnet-of-mahouts-item-based.html
  • 19. Researchers Turing Babbage Einstein Newton Comp Sci 1 Research Articles Comp Sci 2 Physics 1 Physics 2 Input (all user preferences)
  • 20. Researchers Turing Babbage Einstein Newton 1.5M Comp Sci 1 Research Articles Comp Sci 2 Physics 1 Physics 2 300M prefs 50M Input (all user preferences)
  • 21. Researchers Research Articles item.RecommenderJob 1. Prep. pref. matrix (1-3) 2. Gen. sim. matrix (4-6) 3. Multiply matrices (7-10) All User Preferences (item x user)
  • 22. Researchers Research Articles item.RecommenderJob 1. Prep. pref. matrix (1-3) 2. Gen. sim. matrix (4-6) 3. Multiply matrices (7-10) All User Preferences (item x user) Research Turing Articles A User's Preferences (item x user)
  • 23. Researchers Research Articles item.RecommenderJob 1. Prep. pref. matrix (1-3) 2. Gen. sim. matrix (4-6) 3. Multiply matrices (7-10) All User Preferences (item x user) Research Articles Turing 2 1 0 0 Research Research 0 0 Articles 1 1 Articles 0 0 2 2 0 0 2 2 Item Similarity A User's Preferences (item x item) (item x user)
  • 24. Researchers Research Articles Research Articles Comp Sci 1 Physics 1 Comp Sci 2 Physics 2 Input (all user preferences) Comp Sci 1 2 1 0 0 Research Articles Comp Sci 2 1 1 0 0 Physics 1 0 0 2 2 Physics 2 0 0 2 2
  • 25. Researchers Research Articles item.RecommenderJob 1. Prep. pref. matrix (1-3) 2. Gen. sim. matrix (4-6) 3. Multiply matrices (7-10) All User Preferences (item x user) Research Articles Turing Turing 2 1 0 0 Research Research Research 0 0 Articles Articles 1 1 Articles 0 0 2 2 X = 0 0 2 2 Item Similarity A User's Preferences Recommendations (item x item) (item x user) (item x user)
  • 26. Running on Amazon's Elastic Map Reduce On demand use and easy to cost
  • 27. Mahout's Research Career so Far
  • 29. Mahout's Normalised Amazon Hours Performance No. Good Recommendations/10
  • 30. Mahout's Costly & Bad Normalised Amazon Hours Performance Costly & Good Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 31. Mahout's Costly & Bad Normalised Amazon Hours Performance Costly & Good Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 32. Mahout's Costly & Bad Normalised Amazon Hours Performance Costly & Good Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 33. Mahout's Costly & Bad Performance Costly & Good 7K Normalised Amazon Hours 6K 5K 4K 3K 2K 1K 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 34. Mahout's Costly & Bad Performance Costly & Good 7K 6.5K, 1.5 Normalised Amazon Hours 6K Orig. item-based 5K 4K 3K 2K 1K 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 35. Mahout's Costly & Bad Performance Costly & Good 7K 6.5K, 1.5 Normalised Amazon Hours 6K Orig. item-based 5K 4K 3K Cust. item-based โž” 2.4K, 1.5 2K 1K 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 36. Mahout's Costly & Bad Performance Costly & Good 7K 6.5K, 1.5 Normalised Amazon Hours 6K Orig. item-based 5K -4.1K (63%) 4K 3K Cust. item-based โž” 2.4K, 1.5 2K 1K 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 37. Reducing processing time and cost โž” Mahout's recommender is already efficient โž” but your data may have unusual properties โž” We got improvements by: โž” tuning Hadoop's mapper and reducer allocation over the 10 steps in the RecommenderJob โž” using an appropriate partitioner
  • 38. Task Allocation 37 hours to complete 1 reducer allocated, despite having 48 available...
  • 39. Task Allocation Allocating more reducers on a per job basis job.getConfiguration().setInt( "mapred.reduce.tasks", numMappers); Allocating more mappers on a per job basis job.getConfiguration().set( "mapred.max.split.size", String.valueOf(splitSize));
  • 40. Task Allocation 37 hours to complete 14 hours From 1 โ†’ 40 reducers
  • 41. Partitioners 14 hours to complete
  • 42. Partitioners 14 hours to complete ~50KB ~500MB
  • 43. InputSampler.Sampler<IntWritable, Text> sampler = new InputSampler.RandomSampler<IntWritable, Text>(...); InputSampler.writePartitionFile(conf, sampler); conf.setPartitionerClass(TotalOrderPartitioner.class); http://www.philippeadjiman.com/blog/2009/12/20/hadoop-tutorial- series-issue-2-getting-started-with-customized-partitioning/
  • 44. Partitioners 14 hours to complete 2 hours Evenly distributed
  • 45. Mahout's Costly & Bad Performance Costly & Good 7K 6.5K, 1.5 Normalised Amazon Hours 6K Orig. item-based 5K -4.1K (63%) 4K 3K Cust. item-based โž” 2.4K, 1.5 2K 1K 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 46. Researchers Research Articles item.RecommenderJob 1. Prep. pref. matrix (1-3) 2. Gen. sim. matrix (4-6) 3. Multiply matrices (7-10) All User Preferences (item x user) Research Articles Turing Turing 2 1 0 0 Research Research Research 0 0 Articles Articles 1 1 Articles 0 0 2 2 X = 0 0 2 2 Item Similarity A User's Preferences Recommendations (item x item) (item x user) (item x user)
  • 47. Researchers user Research Articles item.RecommenderJob 1. Prep. pref. matrix (1-3) 2. Gen. sim. matrix (4-6) 3. Multiply matrices (7-10) All User Preferences (item x user) Researchers Research Articles Turing Turing 2 1 0 0 Researchers Research Research Research 0 0 Articles Articles 1 1 Articles 0 0 2 2 X = 0 0 2 2 Item Similarity A User's Preferences Recommendations (item x item) (item x user) (item x user) User Similarity (user x user)
  • 48. Mahout's Costly & Bad Performance Costly & Good 7K 6.5K, 1.5 Normalised Amazon Hours 6K Orig. item-based 5K 4K 3K Cust. item-based โž” 2.4K, 1.5 2K Orig. user-based 1K โž” 1K, 2.5 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 49. Mahout's Costly & Bad Performance Costly & Good 7K 6.5K, 1.5 Normalised Amazon Hours 6K Orig. item-based 5K 4K 3K Cust. item-based +1 (67%) โž” 2.4K, 1.5 2K -1.4K Orig. user-based (58%) 1K โž” 1K, 2.5 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 50. Mahout's Costly & Bad Performance Costly & Good 7K 6.5K, 1.5 Normalised Amazon Hours 6K Orig. item-based 5K 4K 3K Cust. item-based โž” 2.4K, 1.5 2K Orig. user-based 1K โž” 1K, 2.5 Cust. user-based โž” 0.3K, 2.5 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 51. Mahout's Costly & Bad Performance Costly & Good 7K 6.5K, 1.5 Normalised Amazon Hours 6K Orig. item-based 5K -4.1K (63%) 4K 3K Cust. item-based โž” 2.4K, 1.5 2K Orig. user-based 1K 1K, 2.5 โž” -0.7K Cust. user-based (70%) โž”0.3K, 2.5 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 52. Mahout's Costly & Bad Performance Costly & Good 7K +1 (67%) 6.5K, 1.5 Normalised Amazon Hours 6K Orig. item-based 5K 4K -6.2K (95%) 3K Cust. item-based โž” 2.4K, 1.5 2K Orig. user-based 1K โž” 1K, 2.5 Cust. user-based โž” 0.3K, 2.5 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 54. Conclusions โž” Mahout is doing a great job of powering Mendeley Suggest โž” Large scale data set โž” Excellent for batch processing requirements โž” We'll soon be feeding our user-based implementation into Mahout โž” User-based can outperform item-based โž” Makes Mahout's offering more rounded โž” Save resources and money by understanding your data โž” Help Hadoop with task allocation if necessary โž” Paritition your data appropriately
  • 55. We're Hiring! โž” Hadoop Data Architect โž” design a coherent data model across the company โž” take ownership of our data โž” hands on Hadoop administration โž” Marie Curie Senior Research Fellow โž” ensure that Mendeleyโ€™s research catalogue is of high quality โž” research and development opportunity โž” ยฃ500 Finder's Fee if you find someone who we hire โž” http://www.mendeley.com/careers/