SlideShare uma empresa Scribd logo
1 de 48
Baixar para ler offline
Scientific Article
 Recommendation
      with Mahout




           Kris Jack, PhD
Senior Data Mining Engineer
Use Case
➔
    Good researchers are on top of their game
➔
    Large amount of research produced
➔
    Takes time to get at what you need




➔
    Help researchers by recommending relevant research
1.5 million+ users; the 20 largest user bases:
                            University of Cambridge
                                 Stanford University
                                                   MIT
                                 University of Michigan
                                       Harvard University
                                       University of Oxford
                                      Sao Paulo University
                                    Imperial College London
                                      University of Edinburgh
                                            Cornell University
                              University of California at Berkeley
                                                      RWTH Aachen
                                               Columbia University
                                                           Georgia Tech
                                               University of Wisconsin
                                                            UC San Diego
                                              University of California at LA
                                                        University of Florida

50m research articles                              University of North Carolina
1.5 million+ users; the 20 largest user bases:
                            University of Cambridge
                                 Stanford University
                                                   MIT
                                 University of Michigan
        We need a                      Harvard University
                                       University of Oxford
    recommender that                  Sao Paulo University
  scales up, coping with            Imperial College London
                                      University of Edinburgh
   our data and future                      Cornell University
                              University of California at Berkeley
          growth                                      RWTH Aachen
                                               Columbia University
                                                           Georgia Tech
                                               University of Wisconsin
                                                            UC San Diego
                                              University of California at LA
                                                        University of Florida

50m research articles                              University of North Carolina
Questions

➔
    How does Mahout's recommender work?

➔
    How well does it perform out of the box?

➔
    How well does it perform after some tuning?
Mahout's
Recommender
Generating recommendations
through matrix multiplication

                                  This is item-based
                                  recommendations as
                                  similarity is based on
                                  items, not users

org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
Researchers
                                      Turing Babbage Einstein   Newton




                    Comp Sci 1
Research Articles



                    Comp Sci 2



                      Physics 1



                      Physics 2



                                  Input (all user preferences)
Researchers
                                      Turing Babbage Einstein   Newton
                                                                         1.5M



                    Comp Sci 1
Research Articles



                    Comp Sci 2



                      Physics 1



                      Physics 2
                                                                          300M
                                                                          prefs

                                   50M

                                  Input (all user preferences)
Researchers




                               Research
                               Articles
item.RecommenderJob
 1. Prep. pref. matrix (1-3)
 2. Gen. sim. matrix (4-6)
 3. Multiply matrices (7-10)              All User Preferences
                                              (item x user)
Researchers




                                   Research
                                   Articles
item.RecommenderJob
 1. Prep. pref. matrix (1-3)
 2. Gen. sim. matrix (4-6)
 3. Multiply matrices (7-10)                  All User Preferences
                                                  (item x user)




                               Research       Turing
                               Articles




                               A User's Preferences
                                  (item x user)
Researchers




                                    Research
                                    Articles
item.RecommenderJob
  1. Prep. pref. matrix (1-3)
  2. Gen. sim. matrix (4-6)
  3. Multiply matrices (7-10)                  All User Preferences
                                                   (item x user)


                Research
                Articles                       Turing


            2   1    0     0
                                Research
Research




                     0     0
                                Articles


            1   1
Articles




            0   0    2     2
            0   0    2     2
           Item Similarity      A User's Preferences
            (item x item)          (item x user)
Researchers




                                       Research
                                       Articles
item.RecommenderJob
  1. Prep. pref. matrix (1-3)
  2. Gen. sim. matrix (4-6)
  3. Multiply matrices (7-10)                     All User Preferences
                                                      (item x user)


                Research
                Articles                          Turing                       Turing


            2   1    0     0
                                   Research




                                                                    Research
Research




                     0     0
                                   Articles




                                                                    Articles
            1   1
Articles




            0   0    2     2   X                             =
            0   0    2     2
           Item Similarity         A User's Preferences               Recommendations
            (item x item)             (item x user)                     (item x user)
How well does
     it work?
Mendeley Suggest
Running on Amazon's Elastic Map Reduce




                On demand use and easy to cost
Mahout's
Normalised Amazon Hours          Performance




                          No. Good Recommendations/10
Mahout's
               Costly & Bad
Normalised Amazon Hours              Performance            Costly & Good




           Cheap & Bad        No. Good Recommendations/10   Cheap & Good
Mahout's
               Costly & Bad
Normalised Amazon Hours              Performance            Costly & Good




           Cheap & Bad        No. Good Recommendations/10   Cheap & Good
Mahout's
               Costly & Bad
Normalised Amazon Hours              Performance            Costly & Good




           Cheap & Bad        No. Good Recommendations/10   Cheap & Good
Mahout's
               Costly & Bad        Performance           Costly & Good
                          7K
Normalised Amazon Hours


                          6K

                          5K

                          4K

                          3K

                          2K

                          1K

                           0
                       0.5     0
                               1      1.5   2      2.5         3
           Cheap & Bad   No. Good Recommendations/10     Cheap & Good
Mahout's
               Costly & Bad          Performance         Costly & Good
                          7K
                                   6.5K, 1.5
Normalised Amazon Hours


                          6K       Orig. item-based


                          5K

                          4K

                          3K

                          2K

                          1K

                           0
                       0.5     0
                               1      1.5   2      2.5         3
           Cheap & Bad   No. Good Recommendations/10     Cheap & Good
Let's tune it!
1. Reduce processing time

2. Improve quality
1. Reduce processing time
➔
    Mahout's recommender is already efficient
➔
    But your data may have unusual properties
➔
    Hadoop may need a helping hand
➔
    Let's see what's going on...
Task Allocation              37 hours to complete




    1 reducer allocated, despite having 48 available...
Task Allocation

Allocating more reducers on a per job basis

                job.getConfiguration().setInt(
                    "mapred.reduce.tasks",
                    numReducers);



Allocating more mappers on a per job basis

                job.getConfiguration().set(
                    "mapred.max.split.size",
                    String.valueOf(splitSize));
Task Allocation   37 hours to complete
                      14 hours




                      From 1 → 40
                      reducers
Partitioners   14 hours to complete
Partitioners   14 hours to complete

                                      ~50KB




                            ~500MB
InputSampler.Sampler<IntWritable, Text> sampler =
      new InputSampler.RandomSampler<IntWritable, Text>(...);
  InputSampler.writePartitionFile(conf, sampler);
  conf.setPartitionerClass(TotalOrderPartitioner.class);




http://www.philippeadjiman.com/blog/2009/12/20/hadoop-tutorial-
series-issue-2-getting-started-with-customized-partitioning/
Partitioners        14 hours to complete
                   2 hours




               Evenly
               distributed
Mahout's
               Costly & Bad          Performance         Costly & Good
                          7K
                                   6.5K, 1.5
Normalised Amazon Hours


                          6K       Orig. item-based


                          5K

                          4K

                          3K

                          2K

                          1K

                           0
                       0.5     0
                               1      1.5   2      2.5         3
           Cheap & Bad   No. Good Recommendations/10     Cheap & Good
Mahout's
               Costly & Bad              Performance      Costly & Good
                          7K
                                       6.5K, 1.5
Normalised Amazon Hours


                          6K           Orig. item-based


                          5K

                          4K

                          3K           Cust. item-based
                                   ➔
                                       2.4K, 1.5
                          2K

                          1K

                           0
                       0.5     0
                               1      1.5   2      2.5          3
           Cheap & Bad   No. Good Recommendations/10      Cheap & Good
Mahout's
               Costly & Bad              Performance              Costly & Good
                          7K
                                       6.5K, 1.5
Normalised Amazon Hours


                          6K           Orig. item-based


                          5K
                                                          -4.1K
                                                          (63%)
                          4K

                          3K           Cust. item-based
                                   ➔
                                       2.4K, 1.5
                          2K

                          1K

                           0
                       0.5     0
                               1      1.5   2      2.5                  3
           Cheap & Bad   No. Good Recommendations/10              Cheap & Good
2. Improve quality
➔
    Mahout provides item-based CF
➔
    We have many more items than users
➔
    Typically, user-based is more appropriate
    ➔
        So let's make one!
Researchers




                                       Research
                                       Articles
item.RecommenderJob
  1. Prep. pref. matrix (1-3)
  2. Gen. sim. matrix (4-6)
  3. Multiply matrices (7-10)                     All User Preferences
                                                      (item x user)


                Research
                Articles                          Turing                       Turing


            2   1    0     0
                                   Research




                                                                    Research
Research




                     0     0
                                   Articles




                                                                    Articles
            1   1
Articles




            0   0    2     2   X                             =
            0   0    2     2
           Item Similarity         A User's Preferences               Recommendations
            (item x item)             (item x user)                     (item x user)
Researchers


   user




                                         Research
                                         Articles
   item.RecommenderJob
      1. Prep. pref. matrix (1-3)
      2. Gen. sim. matrix (4-6)
      3. Multiply matrices (7-10)                   All User Preferences
                                                        (item x user)

                Researchers
                  Research
                  Articles                          Turing                       Turing


               2   1    0   0
Researchers




                                     Research




                                                                      Research
  Research




                        0   0
                                     Articles




                                                                      Articles
               1   1
  Articles




               0   0    2   2   X                              =
               0   0    2   2
              Item Similarity        A User's Preferences               Recommendations
               (item x item)            (item x user)                     (item x user)
     User Similarity (user x user)
Mahout's
               Costly & Bad              Performance      Costly & Good
                          7K
                                       6.5K, 1.5
Normalised Amazon Hours


                          6K           Orig. item-based


                          5K

                          4K

                          3K           Cust. item-based
                                   ➔
                                       2.4K, 1.5
                          2K

                          1K

                           0
                       0.5     0
                               1      1.5   2      2.5          3
           Cheap & Bad   No. Good Recommendations/10      Cheap & Good
Mahout's
               Costly & Bad              Performance                        Costly & Good
                          7K
                                       6.5K, 1.5
Normalised Amazon Hours


                          6K           Orig. item-based


                          5K

                          4K

                          3K           Cust. item-based
                                   ➔
                                       2.4K, 1.5
                          2K
                                                              Orig. user-based
                          1K
                                                          ➔
                                                              1K, 2.5


                           0
                       0.5     0
                               1      1.5   2      2.5                            3
           Cheap & Bad   No. Good Recommendations/10                         Cheap & Good
Mahout's
               Costly & Bad              Performance                        Costly & Good
                          7K
                                       6.5K, 1.5
Normalised Amazon Hours


                          6K           Orig. item-based


                          5K

                          4K

                          3K           Cust. item-based
                                                          +1 (67%)
                                   ➔
                                       2.4K, 1.5
                          2K              -1.4K
                                                              Orig. user-based
                                          (58%)
                          1K
                                                          ➔
                                                              1K, 2.5


                           0
                       0.5     0
                               1      1.5   2      2.5                            3
           Cheap & Bad   No. Good Recommendations/10                         Cheap & Good
Mahout's
               Costly & Bad              Performance                      Costly & Good
                          7K
                                       6.5K, 1.5
Normalised Amazon Hours


                          6K           Orig. item-based


                          5K

                          4K

                          3K           Cust. item-based
                                   ➔
                                       2.4K, 1.5
                          2K
                                                            Orig. user-based
                          1K
                                                          ➔
                                                            1K, 2.5
                                                            Cust. user-based
                                                          ➔
                                                            0.3K, 2.5
                           0
                       0.5     0
                               1      1.5   2      2.5                          3
           Cheap & Bad   No. Good Recommendations/10                       Cheap & Good
Mahout's
               Costly & Bad              Performance                   Costly & Good
                          7K
                                       6.5K, 1.5
Normalised Amazon Hours


                          6K           Orig. item-based


                          5K
                                                          -4.1K
                                                          (63%)
                          4K

                          3K           Cust. item-based
                                   ➔
                                       2.4K, 1.5
                          2K
                                                         Orig. user-based
                          1K                             1K, 2.5
                                                           ➔


                                                  -0.7K  Cust. user-based
                                                  (70%) ➔0.3K, 2.5
                           0
                       0.5     0
                               1      1.5   2      2.5                       3
           Cheap & Bad   No. Good Recommendations/10                    Cheap & Good
Mahout's
               Costly & Bad              Performance                      Costly & Good
                          7K                              +1 (67%)
                                       6.5K, 1.5
Normalised Amazon Hours


                          6K           Orig. item-based


                          5K

                          4K
                                                                     -6.2K
                                                                     (95%)
                          3K           Cust. item-based
                                   ➔
                                       2.4K, 1.5
                          2K
                                                            Orig. user-based
                          1K
                                                          ➔
                                                            1K, 2.5
                                                            Cust. user-based
                                                          ➔
                                                            0.3K, 2.5
                           0
                       0.5     0
                               1      1.5   2      2.5                          3
           Cheap & Bad   No. Good Recommendations/10                       Cheap & Good
Conclusions
Conclusions
➔
    Mahout is doing a great job of powering Mendeley Suggest
    ➔
        Large scale data set
    ➔
        Good quality recommendations
➔
    Tuning helps
    ➔
        Help Hadoop with task allocation if necessary
    ➔
        Partition your data appropriately
    ➔
        We save 95% resources
➔
    Use an appropriate algorithm
    ➔
        Item- vs user-based (MAHOUT-1004)
    ➔
        We increase precision by 66.6%
Mahout's
               Costly & Bad                         Performance                      Costly & Good
                          7K                                         +1 (67%)
                                                  6.5K, 1.5
Normalised Amazon Hours


                          6K                      Orig. item-based


                          5K

                          4K
                                                                                -6.2K
                                                                                (95%)
                          3K                      Cust. item-based
                                              ➔
                                                  2.4K, 1.5
                          2K
                                                                       Orig. user-based
                          1K
                                                                     ➔
                                                                       1K, 2.5
                                                                       Cust. user-based
                                                                     ➔
                                                                       0.3K, 2.5
                           0
                       0.5     0
                               1      1.5   2      2.5                                     3
           Cheap & Bad   No. Good Recommendations/10                                  Cheap & Good

                                   http://www.mendeley.com/profiles/kris-jack/

Mais conteúdo relacionado

Semelhante a Scientific Article Recommendation with Mahout

Mendeley: Recommendation Systems for Academic Literature
Mendeley: Recommendation Systems for Academic LiteratureMendeley: Recommendation Systems for Academic Literature
Mendeley: Recommendation Systems for Academic LiteratureKris Jack
 
Mendeley, putting data into the hands of researchers
Mendeley, putting data into the hands of researchersMendeley, putting data into the hands of researchers
Mendeley, putting data into the hands of researchersKris Jack
 
Cyberinfrastructure Day 2010: Applications in Biocomputing
Cyberinfrastructure Day 2010: Applications in BiocomputingCyberinfrastructure Day 2010: Applications in Biocomputing
Cyberinfrastructure Day 2010: Applications in BiocomputingJeremy Yang
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
 
Can machines understand the scientific literature
Can machines understand the scientific literatureCan machines understand the scientific literature
Can machines understand the scientific literaturepetermurrayrust
 
Foundations for the Future of Science
Foundations for the Future of ScienceFoundations for the Future of Science
Foundations for the Future of ScienceGlobus
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for ScienceIan Foster
 
DLint: dynamically checking bad coding practices in JavaScript (ISSTA'15 Slides)
DLint: dynamically checking bad coding practices in JavaScript (ISSTA'15 Slides)DLint: dynamically checking bad coding practices in JavaScript (ISSTA'15 Slides)
DLint: dynamically checking bad coding practices in JavaScript (ISSTA'15 Slides)Liang Gong
 
Introduction to Next Generation Sequencing
Introduction to Next Generation SequencingIntroduction to Next Generation Sequencing
Introduction to Next Generation SequencingEdizonJambormias2
 
BITS: Basics of sequence databases
BITS: Basics of sequence databasesBITS: Basics of sequence databases
BITS: Basics of sequence databasesBITS
 
Making Linked Data SPARQL with the InterMine Biological Data Warehouse
Making Linked Data SPARQL with the InterMine Biological Data WarehouseMaking Linked Data SPARQL with the InterMine Biological Data Warehouse
Making Linked Data SPARQL with the InterMine Biological Data WarehouseJustin Clark-Casey
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsAnubhav Jain
 
Preserving the Inputs and Outputs of Scholarship
Preserving the Inputs and Outputs of ScholarshipPreserving the Inputs and Outputs of Scholarship
Preserving the Inputs and Outputs of Scholarshiptsbbbu
 
生態文化ニッチモデリングによる分布推定
生態文化ニッチモデリングによる分布推定生態文化ニッチモデリングによる分布推定
生態文化ニッチモデリングによる分布推定Yasuhisa Kondo
 

Semelhante a Scientific Article Recommendation with Mahout (20)

Mendeley: Recommendation Systems for Academic Literature
Mendeley: Recommendation Systems for Academic LiteratureMendeley: Recommendation Systems for Academic Literature
Mendeley: Recommendation Systems for Academic Literature
 
Mendeley, putting data into the hands of researchers
Mendeley, putting data into the hands of researchersMendeley, putting data into the hands of researchers
Mendeley, putting data into the hands of researchers
 
Cyberinfrastructure Day 2010: Applications in Biocomputing
Cyberinfrastructure Day 2010: Applications in BiocomputingCyberinfrastructure Day 2010: Applications in Biocomputing
Cyberinfrastructure Day 2010: Applications in Biocomputing
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
Multiscale Modeling
Multiscale ModelingMultiscale Modeling
Multiscale Modeling
 
Can machines understand the scientific literature
Can machines understand the scientific literatureCan machines understand the scientific literature
Can machines understand the scientific literature
 
Foundations for the Future of Science
Foundations for the Future of ScienceFoundations for the Future of Science
Foundations for the Future of Science
 
2015_CV_J_SHELTON_linked
2015_CV_J_SHELTON_linked2015_CV_J_SHELTON_linked
2015_CV_J_SHELTON_linked
 
2013-01-17 Research Object
2013-01-17 Research Object2013-01-17 Research Object
2013-01-17 Research Object
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
Bme451 Fall07 Final
Bme451 Fall07 FinalBme451 Fall07 Final
Bme451 Fall07 Final
 
DLint: dynamically checking bad coding practices in JavaScript (ISSTA'15 Slides)
DLint: dynamically checking bad coding practices in JavaScript (ISSTA'15 Slides)DLint: dynamically checking bad coding practices in JavaScript (ISSTA'15 Slides)
DLint: dynamically checking bad coding practices in JavaScript (ISSTA'15 Slides)
 
Introduction to Next Generation Sequencing
Introduction to Next Generation SequencingIntroduction to Next Generation Sequencing
Introduction to Next Generation Sequencing
 
BITS: Basics of sequence databases
BITS: Basics of sequence databasesBITS: Basics of sequence databases
BITS: Basics of sequence databases
 
Making Linked Data SPARQL with the InterMine Biological Data Warehouse
Making Linked Data SPARQL with the InterMine Biological Data WarehouseMaking Linked Data SPARQL with the InterMine Biological Data Warehouse
Making Linked Data SPARQL with the InterMine Biological Data Warehouse
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data sets
 
Preserving the Inputs and Outputs of Scholarship
Preserving the Inputs and Outputs of ScholarshipPreserving the Inputs and Outputs of Scholarship
Preserving the Inputs and Outputs of Scholarship
 
生態文化ニッチモデリングによる分布推定
生態文化ニッチモデリングによる分布推定生態文化ニッチモデリングによる分布推定
生態文化ニッチモデリングによる分布推定
 

Mais de Kris Jack

Modern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyModern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyKris Jack
 
Machine Learning @ Mendeley
Machine Learning @ MendeleyMachine Learning @ Mendeley
Machine Learning @ MendeleyKris Jack
 
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Kris Jack
 
Mendeley Suggest: What will you read next?
Mendeley Suggest: What will you read next?Mendeley Suggest: What will you read next?
Mendeley Suggest: What will you read next?Kris Jack
 
Mendeley Suggest: Engineering a Personalised Article Recommender System
Mendeley Suggest: Engineering a Personalised Article Recommender SystemMendeley Suggest: Engineering a Personalised Article Recommender System
Mendeley Suggest: Engineering a Personalised Article Recommender SystemKris Jack
 
Mendeley's Data and Perspectives on Data Challenges
Mendeley's Data and Perspectives on Data ChallengesMendeley's Data and Perspectives on Data Challenges
Mendeley's Data and Perspectives on Data ChallengesKris Jack
 
Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley
Mahout Becomes a Researcher: Large Scale Recommendations at MendeleyMahout Becomes a Researcher: Large Scale Recommendations at Mendeley
Mahout Becomes a Researcher: Large Scale Recommendations at MendeleyKris Jack
 
improving explicit preference entry by visualising data similarities
improving explicit preference entry by visualising data similaritiesimproving explicit preference entry by visualising data similarities
improving explicit preference entry by visualising data similaritiesKris Jack
 
Etude de la pertinence de critères de recherche en recherche d'informations s...
Etude de la pertinence de critères de recherche en recherche d'informations s...Etude de la pertinence de critères de recherche en recherche d'informations s...
Etude de la pertinence de critères de recherche en recherche d'informations s...Kris Jack
 
A Computational Model of Staged Language Acquisition
A Computational Model of Staged Language AcquisitionA Computational Model of Staged Language Acquisition
A Computational Model of Staged Language AcquisitionKris Jack
 
From Syllables to Syntax: Investigating Staged Linguistic Development through...
From Syllables to Syntax: Investigating Staged Linguistic Development through...From Syllables to Syntax: Investigating Staged Linguistic Development through...
From Syllables to Syntax: Investigating Staged Linguistic Development through...Kris Jack
 
A Collaborative Tool for the Computational Modelling of Child Language Acquis...
A Collaborative Tool for the Computational Modelling of Child Language Acquis...A Collaborative Tool for the Computational Modelling of Child Language Acquis...
A Collaborative Tool for the Computational Modelling of Child Language Acquis...Kris Jack
 
Mendeley: crowdsourcing and recommending research on a large scale
Mendeley: crowdsourcing and recommending research on a large scaleMendeley: crowdsourcing and recommending research on a large scale
Mendeley: crowdsourcing and recommending research on a large scaleKris Jack
 
Recommendation Engines for Scientific Literature
Recommendation Engines for Scientific LiteratureRecommendation Engines for Scientific Literature
Recommendation Engines for Scientific LiteratureKris Jack
 
Cloud Elephants and Witches: A Big Data Tale from Mendeley
Cloud Elephants and Witches: A Big Data Tale from MendeleyCloud Elephants and Witches: A Big Data Tale from Mendeley
Cloud Elephants and Witches: A Big Data Tale from MendeleyKris Jack
 

Mais de Kris Jack (15)

Modern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyModern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in Mendeley
 
Machine Learning @ Mendeley
Machine Learning @ MendeleyMachine Learning @ Mendeley
Machine Learning @ Mendeley
 
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
 
Mendeley Suggest: What will you read next?
Mendeley Suggest: What will you read next?Mendeley Suggest: What will you read next?
Mendeley Suggest: What will you read next?
 
Mendeley Suggest: Engineering a Personalised Article Recommender System
Mendeley Suggest: Engineering a Personalised Article Recommender SystemMendeley Suggest: Engineering a Personalised Article Recommender System
Mendeley Suggest: Engineering a Personalised Article Recommender System
 
Mendeley's Data and Perspectives on Data Challenges
Mendeley's Data and Perspectives on Data ChallengesMendeley's Data and Perspectives on Data Challenges
Mendeley's Data and Perspectives on Data Challenges
 
Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley
Mahout Becomes a Researcher: Large Scale Recommendations at MendeleyMahout Becomes a Researcher: Large Scale Recommendations at Mendeley
Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley
 
improving explicit preference entry by visualising data similarities
improving explicit preference entry by visualising data similaritiesimproving explicit preference entry by visualising data similarities
improving explicit preference entry by visualising data similarities
 
Etude de la pertinence de critères de recherche en recherche d'informations s...
Etude de la pertinence de critères de recherche en recherche d'informations s...Etude de la pertinence de critères de recherche en recherche d'informations s...
Etude de la pertinence de critères de recherche en recherche d'informations s...
 
A Computational Model of Staged Language Acquisition
A Computational Model of Staged Language AcquisitionA Computational Model of Staged Language Acquisition
A Computational Model of Staged Language Acquisition
 
From Syllables to Syntax: Investigating Staged Linguistic Development through...
From Syllables to Syntax: Investigating Staged Linguistic Development through...From Syllables to Syntax: Investigating Staged Linguistic Development through...
From Syllables to Syntax: Investigating Staged Linguistic Development through...
 
A Collaborative Tool for the Computational Modelling of Child Language Acquis...
A Collaborative Tool for the Computational Modelling of Child Language Acquis...A Collaborative Tool for the Computational Modelling of Child Language Acquis...
A Collaborative Tool for the Computational Modelling of Child Language Acquis...
 
Mendeley: crowdsourcing and recommending research on a large scale
Mendeley: crowdsourcing and recommending research on a large scaleMendeley: crowdsourcing and recommending research on a large scale
Mendeley: crowdsourcing and recommending research on a large scale
 
Recommendation Engines for Scientific Literature
Recommendation Engines for Scientific LiteratureRecommendation Engines for Scientific Literature
Recommendation Engines for Scientific Literature
 
Cloud Elephants and Witches: A Big Data Tale from Mendeley
Cloud Elephants and Witches: A Big Data Tale from MendeleyCloud Elephants and Witches: A Big Data Tale from Mendeley
Cloud Elephants and Witches: A Big Data Tale from Mendeley
 

Último

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 

Último (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 

Scientific Article Recommendation with Mahout

  • 1. Scientific Article Recommendation with Mahout Kris Jack, PhD Senior Data Mining Engineer
  • 2. Use Case ➔ Good researchers are on top of their game ➔ Large amount of research produced ➔ Takes time to get at what you need ➔ Help researchers by recommending relevant research
  • 3. 1.5 million+ users; the 20 largest user bases: University of Cambridge Stanford University MIT University of Michigan Harvard University University of Oxford Sao Paulo University Imperial College London University of Edinburgh Cornell University University of California at Berkeley RWTH Aachen Columbia University Georgia Tech University of Wisconsin UC San Diego University of California at LA University of Florida 50m research articles University of North Carolina
  • 4. 1.5 million+ users; the 20 largest user bases: University of Cambridge Stanford University MIT University of Michigan We need a Harvard University University of Oxford recommender that Sao Paulo University scales up, coping with Imperial College London University of Edinburgh our data and future Cornell University University of California at Berkeley growth RWTH Aachen Columbia University Georgia Tech University of Wisconsin UC San Diego University of California at LA University of Florida 50m research articles University of North Carolina
  • 5.
  • 6. Questions ➔ How does Mahout's recommender work? ➔ How well does it perform out of the box? ➔ How well does it perform after some tuning?
  • 8. Generating recommendations through matrix multiplication This is item-based recommendations as similarity is based on items, not users org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
  • 9. Researchers Turing Babbage Einstein Newton Comp Sci 1 Research Articles Comp Sci 2 Physics 1 Physics 2 Input (all user preferences)
  • 10. Researchers Turing Babbage Einstein Newton 1.5M Comp Sci 1 Research Articles Comp Sci 2 Physics 1 Physics 2 300M prefs 50M Input (all user preferences)
  • 11. Researchers Research Articles item.RecommenderJob 1. Prep. pref. matrix (1-3) 2. Gen. sim. matrix (4-6) 3. Multiply matrices (7-10) All User Preferences (item x user)
  • 12. Researchers Research Articles item.RecommenderJob 1. Prep. pref. matrix (1-3) 2. Gen. sim. matrix (4-6) 3. Multiply matrices (7-10) All User Preferences (item x user) Research Turing Articles A User's Preferences (item x user)
  • 13. Researchers Research Articles item.RecommenderJob 1. Prep. pref. matrix (1-3) 2. Gen. sim. matrix (4-6) 3. Multiply matrices (7-10) All User Preferences (item x user) Research Articles Turing 2 1 0 0 Research Research 0 0 Articles 1 1 Articles 0 0 2 2 0 0 2 2 Item Similarity A User's Preferences (item x item) (item x user)
  • 14. Researchers Research Articles item.RecommenderJob 1. Prep. pref. matrix (1-3) 2. Gen. sim. matrix (4-6) 3. Multiply matrices (7-10) All User Preferences (item x user) Research Articles Turing Turing 2 1 0 0 Research Research Research 0 0 Articles Articles 1 1 Articles 0 0 2 2 X = 0 0 2 2 Item Similarity A User's Preferences Recommendations (item x item) (item x user) (item x user)
  • 15. How well does it work?
  • 17. Running on Amazon's Elastic Map Reduce On demand use and easy to cost
  • 18. Mahout's Normalised Amazon Hours Performance No. Good Recommendations/10
  • 19. Mahout's Costly & Bad Normalised Amazon Hours Performance Costly & Good Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 20. Mahout's Costly & Bad Normalised Amazon Hours Performance Costly & Good Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 21. Mahout's Costly & Bad Normalised Amazon Hours Performance Costly & Good Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 22. Mahout's Costly & Bad Performance Costly & Good 7K Normalised Amazon Hours 6K 5K 4K 3K 2K 1K 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 23. Mahout's Costly & Bad Performance Costly & Good 7K 6.5K, 1.5 Normalised Amazon Hours 6K Orig. item-based 5K 4K 3K 2K 1K 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 25. 1. Reduce processing time 2. Improve quality
  • 26. 1. Reduce processing time ➔ Mahout's recommender is already efficient ➔ But your data may have unusual properties ➔ Hadoop may need a helping hand ➔ Let's see what's going on...
  • 27. Task Allocation 37 hours to complete 1 reducer allocated, despite having 48 available...
  • 28. Task Allocation Allocating more reducers on a per job basis job.getConfiguration().setInt( "mapred.reduce.tasks", numReducers); Allocating more mappers on a per job basis job.getConfiguration().set( "mapred.max.split.size", String.valueOf(splitSize));
  • 29. Task Allocation 37 hours to complete 14 hours From 1 → 40 reducers
  • 30. Partitioners 14 hours to complete
  • 31. Partitioners 14 hours to complete ~50KB ~500MB
  • 32. InputSampler.Sampler<IntWritable, Text> sampler = new InputSampler.RandomSampler<IntWritable, Text>(...); InputSampler.writePartitionFile(conf, sampler); conf.setPartitionerClass(TotalOrderPartitioner.class); http://www.philippeadjiman.com/blog/2009/12/20/hadoop-tutorial- series-issue-2-getting-started-with-customized-partitioning/
  • 33. Partitioners 14 hours to complete 2 hours Evenly distributed
  • 34. Mahout's Costly & Bad Performance Costly & Good 7K 6.5K, 1.5 Normalised Amazon Hours 6K Orig. item-based 5K 4K 3K 2K 1K 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 35. Mahout's Costly & Bad Performance Costly & Good 7K 6.5K, 1.5 Normalised Amazon Hours 6K Orig. item-based 5K 4K 3K Cust. item-based ➔ 2.4K, 1.5 2K 1K 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 36. Mahout's Costly & Bad Performance Costly & Good 7K 6.5K, 1.5 Normalised Amazon Hours 6K Orig. item-based 5K -4.1K (63%) 4K 3K Cust. item-based ➔ 2.4K, 1.5 2K 1K 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 37. 2. Improve quality ➔ Mahout provides item-based CF ➔ We have many more items than users ➔ Typically, user-based is more appropriate ➔ So let's make one!
  • 38. Researchers Research Articles item.RecommenderJob 1. Prep. pref. matrix (1-3) 2. Gen. sim. matrix (4-6) 3. Multiply matrices (7-10) All User Preferences (item x user) Research Articles Turing Turing 2 1 0 0 Research Research Research 0 0 Articles Articles 1 1 Articles 0 0 2 2 X = 0 0 2 2 Item Similarity A User's Preferences Recommendations (item x item) (item x user) (item x user)
  • 39. Researchers user Research Articles item.RecommenderJob 1. Prep. pref. matrix (1-3) 2. Gen. sim. matrix (4-6) 3. Multiply matrices (7-10) All User Preferences (item x user) Researchers Research Articles Turing Turing 2 1 0 0 Researchers Research Research Research 0 0 Articles Articles 1 1 Articles 0 0 2 2 X = 0 0 2 2 Item Similarity A User's Preferences Recommendations (item x item) (item x user) (item x user) User Similarity (user x user)
  • 40. Mahout's Costly & Bad Performance Costly & Good 7K 6.5K, 1.5 Normalised Amazon Hours 6K Orig. item-based 5K 4K 3K Cust. item-based ➔ 2.4K, 1.5 2K 1K 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 41. Mahout's Costly & Bad Performance Costly & Good 7K 6.5K, 1.5 Normalised Amazon Hours 6K Orig. item-based 5K 4K 3K Cust. item-based ➔ 2.4K, 1.5 2K Orig. user-based 1K ➔ 1K, 2.5 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 42. Mahout's Costly & Bad Performance Costly & Good 7K 6.5K, 1.5 Normalised Amazon Hours 6K Orig. item-based 5K 4K 3K Cust. item-based +1 (67%) ➔ 2.4K, 1.5 2K -1.4K Orig. user-based (58%) 1K ➔ 1K, 2.5 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 43. Mahout's Costly & Bad Performance Costly & Good 7K 6.5K, 1.5 Normalised Amazon Hours 6K Orig. item-based 5K 4K 3K Cust. item-based ➔ 2.4K, 1.5 2K Orig. user-based 1K ➔ 1K, 2.5 Cust. user-based ➔ 0.3K, 2.5 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 44. Mahout's Costly & Bad Performance Costly & Good 7K 6.5K, 1.5 Normalised Amazon Hours 6K Orig. item-based 5K -4.1K (63%) 4K 3K Cust. item-based ➔ 2.4K, 1.5 2K Orig. user-based 1K 1K, 2.5 ➔ -0.7K Cust. user-based (70%) ➔0.3K, 2.5 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 45. Mahout's Costly & Bad Performance Costly & Good 7K +1 (67%) 6.5K, 1.5 Normalised Amazon Hours 6K Orig. item-based 5K 4K -6.2K (95%) 3K Cust. item-based ➔ 2.4K, 1.5 2K Orig. user-based 1K ➔ 1K, 2.5 Cust. user-based ➔ 0.3K, 2.5 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 47. Conclusions ➔ Mahout is doing a great job of powering Mendeley Suggest ➔ Large scale data set ➔ Good quality recommendations ➔ Tuning helps ➔ Help Hadoop with task allocation if necessary ➔ Partition your data appropriately ➔ We save 95% resources ➔ Use an appropriate algorithm ➔ Item- vs user-based (MAHOUT-1004) ➔ We increase precision by 66.6%
  • 48. Mahout's Costly & Bad Performance Costly & Good 7K +1 (67%) 6.5K, 1.5 Normalised Amazon Hours 6K Orig. item-based 5K 4K -6.2K (95%) 3K Cust. item-based ➔ 2.4K, 1.5 2K Orig. user-based 1K ➔ 1K, 2.5 Cust. user-based ➔ 0.3K, 2.5 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good http://www.mendeley.com/profiles/kris-jack/