SlideShare uma empresa Scribd logo
1 de 214
The Problem
                Strategies
   Some Funny New Science




        The Netflix Prize:
yet another million dollar problem

                   David Bessis


 Ecole Normale Sup´rieure, 27/01/2010
                  e




              David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                               Strategies
                  Some Funny New Science



7 + 1 Million Dollar Problems
   Millenium Prize Problems:




                             David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                               Strategies
                  Some Funny New Science



7 + 1 Million Dollar Problems
   Millenium Prize Problems:
       Funded in 2000 by the Clay Mathematical Institute.




                             David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                               Strategies
                  Some Funny New Science



7 + 1 Million Dollar Problems
   Millenium Prize Problems:
       Funded in 2000 by the Clay Mathematical Institute.
       Seven classical open problems in Mathematics.




                             David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                               Strategies
                  Some Funny New Science



7 + 1 Million Dollar Problems
   Millenium Prize Problems:
        Funded in 2000 by the Clay Mathematical Institute.
        Seven classical open problems in Mathematics.
        Solutions must




                             David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                Strategies
                   Some Funny New Science



7 + 1 Million Dollar Problems
   Millenium Prize Problems:
        Funded in 2000 by the Clay Mathematical Institute.
        Seven classical open problems in Mathematics.
        Solutions must
            ”be published in a refereed mathematics publication of
            worldwide repute”




                              David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                               Strategies
                  Some Funny New Science



7 + 1 Million Dollar Problems
   Millenium Prize Problems:
        Funded in 2000 by the Clay Mathematical Institute.
        Seven classical open problems in Mathematics.
        Solutions must
            ”be published in a refereed mathematics publication of
            worldwide repute”
            ”have general acceptance in the mathematics community two
            years after”




                             David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                               Strategies
                  Some Funny New Science



7 + 1 Million Dollar Problems
   Millenium Prize Problems:
       Funded in 2000 by the Clay Mathematical Institute.
       Seven classical open problems in Mathematics.
       Fuzzy rules.




                             David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                               Strategies
                  Some Funny New Science



7 + 1 Million Dollar Problems
   Millenium Prize Problems:
       Funded in 2000 by the Clay Mathematical Institute.
       Seven classical open problems in Mathematics.
       Fuzzy rules.
       The Poincar´ conjecture was solved by Perelman in 2003.
                  e




                             David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                               Strategies
                  Some Funny New Science



7 + 1 Million Dollar Problems
   Millenium Prize Problems:
       Funded in 2000 by the Clay Mathematical Institute.
       Seven classical open problems in Mathematics.
       Fuzzy rules.
       The Poincar´ conjecture was solved by Perelman in 2003.
                  e
       No award yet.




                             David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                Strategies
                   Some Funny New Science



7 + 1 Million Dollar Problems
   Millenium Prize Problems:
       Funded in 2000 by the Clay Mathematical Institute.
       Seven classical open problems in Mathematics.
       Fuzzy rules.
       The Poincar´ conjecture was solved by Perelman in 2003.
                  e
       No award yet.
   Netflix Prize:




                              David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                Strategies
                   Some Funny New Science



7 + 1 Million Dollar Problems
   Millenium Prize Problems:
       Funded in 2000 by the Clay Mathematical Institute.
       Seven classical open problems in Mathematics.
       Fuzzy rules.
       The Poincar´ conjecture was solved by Perelman in 2003.
                  e
       No award yet.
   Netflix Prize:
       Funded in 2006 by the DVD rental company Netflix.




                              David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                Strategies
                   Some Funny New Science



7 + 1 Million Dollar Problems
   Millenium Prize Problems:
       Funded in 2000 by the Clay Mathematical Institute.
       Seven classical open problems in Mathematics.
       Fuzzy rules.
       The Poincar´ conjecture was solved by Perelman in 2003.
                  e
       No award yet.
   Netflix Prize:
       Funded in 2006 by the DVD rental company Netflix.
       A problem in Applied Mathematics.




                              David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                Strategies
                   Some Funny New Science



7 + 1 Million Dollar Problems
   Millenium Prize Problems:
       Funded in 2000 by the Clay Mathematical Institute.
       Seven classical open problems in Mathematics.
       Fuzzy rules.
       The Poincar´ conjecture was solved by Perelman in 2003.
                  e
       No award yet.
   Netflix Prize:
       Funded in 2006 by the DVD rental company Netflix.
       A problem in Applied Mathematics Computer Science.




                              David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                               Strategies
                  Some Funny New Science



7 + 1 Million Dollar Problems
   Millenium Prize Problems:
        Funded in 2000 by the Clay Mathematical Institute.
        Seven classical open problems in Mathematics.
        Fuzzy rules.
        The Poincar´ conjecture was solved by Perelman in 2003.
                     e
        No award yet.
   Netflix Prize:
       Funded in 2006 by the DVD rental company Netflix.
       A problem in Applied Mathematics Computer Science
       Psychology.



                             David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                               Strategies
                  Some Funny New Science



7 + 1 Million Dollar Problems
   Millenium Prize Problems:
        Funded in 2000 by the Clay Mathematical Institute.
        Seven classical open problems in Mathematics.
        Fuzzy rules.
        The Poincar´ conjecture was solved by Perelman in 2003.
                     e
        No award yet.
   Netflix Prize:
       Funded in 2006 by the DVD rental company Netflix.
       A problem in Applied Mathematics Computer Science
       Psychology (do we really care?)



                             David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                Strategies
                   Some Funny New Science



7 + 1 Million Dollar Problems
   Millenium Prize Problems:
       Funded in 2000 by the Clay Mathematical Institute.
       Seven classical open problems in Mathematics.
       Fuzzy rules.
       The Poincar´ conjecture was solved by Perelman in 2003.
                  e
       No award yet.
   Netflix Prize:
       Funded in 2006 by the DVD rental company Netflix.
       A problem in Some Funny New Science.




                              David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                   Strategies
                      Some Funny New Science



7 + 1 Million Dollar Problems
   Millenium Prize Problems:
       Funded in 2000 by the Clay Mathematical Institute.
       Seven classical open problems in Mathematics.
       Fuzzy rules.
       The Poincar´ conjecture was solved by Perelman in 2003.
                  e
       No award yet.
   Netflix Prize:
       Funded in 2006 by the DVD rental company Netflix.
       A problem in Some Funny New Science.
       Clear rules.


                                 David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                Strategies
                   Some Funny New Science



7 + 1 Million Dollar Problems
   Millenium Prize Problems:
       Funded in 2000 by the Clay Mathematical Institute.
       Seven classical open problems in Mathematics.
       Fuzzy rules.
       The Poincar´ conjecture was solved by Perelman in 2003.
                  e
       No award yet.
   Netflix Prize:
       Funded in 2006 by the DVD rental company Netflix.
       A problem in Some Funny New Science.
       Reasonably clear rules.


                              David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                Strategies
                   Some Funny New Science



7 + 1 Million Dollar Problems
   Millenium Prize Problems:
       Funded in 2000 by the Clay Mathematical Institute.
       Seven classical open problems in Mathematics.
       Fuzzy rules.
       The Poincar´ conjecture was solved by Perelman in 2003.
                  e
       No award yet.
   Netflix Prize:
       Funded in 2006 by the DVD rental company Netflix.
       A problem in Some Funny New Science.
       Reasonably clear rules.
       Prize awarded in September 2009.

                              David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                           Rules
                              Strategies
                                           Competition
                 Some Funny New Science



Context


      Netflix has an “all-you-can-eat” pricing model.




                            David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                           Rules
                              Strategies
                                           Competition
                 Some Funny New Science



Context


      Netflix has an “all-you-can-eat” pricing model.
      They need their users to watch a lot of movies.




                            David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                           Rules
                              Strategies
                                           Competition
                 Some Funny New Science



Context


      Netflix has an “all-you-can-eat” pricing model.
      They need their users to watch a lot of movies.
      Beyond a few obvious choices, people don’t know what they
      want to watch.




                            David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                           Rules
                              Strategies
                                           Competition
                 Some Funny New Science



Context


      Netflix has an “all-you-can-eat” pricing model.
      They need their users to watch a lot of movies.
      Beyond a few obvious choices, people don’t know what they
      want to watch.
      Collaborative filtering: recommending products based on prior
      evaluations by other users (just like Amazon does).




                            David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                           Rules
                              Strategies
                                           Competition
                 Some Funny New Science



Context


      Netflix has an “all-you-can-eat” pricing model.
      They need their users to watch a lot of movies.
      Beyond a few obvious choices, people don’t know what they
      want to watch.
      Collaborative filtering: recommending products based on prior
      evaluations by other users (just like Amazon does).
      The Netflix prize is a collaborative filtering competition:




                            David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                           Rules
                              Strategies
                                           Competition
                 Some Funny New Science



Context


      Netflix has an “all-you-can-eat” pricing model.
      They need their users to watch a lot of movies.
      Beyond a few obvious choices, people don’t know what they
      want to watch.
      Collaborative filtering: recommending products based on prior
      evaluations by other users (just like Amazon does).
      The Netflix prize is a collaborative filtering competition:
          Based on a huge dataset of actual ratings by Netflix users.




                            David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                           Rules
                              Strategies
                                           Competition
                 Some Funny New Science



Context


      Netflix has an “all-you-can-eat” pricing model.
      They need their users to watch a lot of movies.
      Beyond a few obvious choices, people don’t know what they
      want to watch.
      Collaborative filtering: recommending products based on prior
      evaluations by other users (just like Amazon does).
      The Netflix prize is a collaborative filtering competition:
          Based on a huge dataset of actual ratings by Netflix users.
          Open to almost everyone.




                            David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                           Rules
                              Strategies
                                           Competition
                 Some Funny New Science



Context


      Netflix has an “all-you-can-eat” pricing model.
      They need their users to watch a lot of movies.
      Beyond a few obvious choices, people don’t know what they
      want to watch.
      Collaborative filtering: recommending products based on prior
      evaluations by other users (just like Amazon does).
      The Netflix prize is a collaborative filtering competition:
          Based on a huge dataset of actual ratings by Netflix users.
          Open to almost everyone.
          Endowed with a $1.000.000 prize.



                            David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                            Rules
                               Strategies
                                            Competition
                  Some Funny New Science



The Dataset

      The user space U consists of 480 189 users
      (identified by a meaningless non-sequential integral id).




                             David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                             Rules
                                Strategies
                                             Competition
                   Some Funny New Science



The Dataset

      The user space U consists of 480 189 users
      (identified by a meaningless non-sequential integral id).
      The movie space M consists of 17 770 movies
      (identified by integers 1, . . . , 17 770, and the associated list of titles
      and release years is provided – this data is meaningful and minable).




                              David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                             Rules
                                Strategies
                                             Competition
                   Some Funny New Science



The Dataset

      The user space U consists of 480 189 users
      (identified by a meaningless non-sequential integral id).
      The movie space M consists of 17 770 movies
      (identified by integers 1, . . . , 17 770, and the associated list of titles
      and release years is provided – this data is meaningful and minable).
      The date space D spans the period Oct. 1998 – Dec. 2005
      (extremely meaningful data; no time of day is provided).




                              David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                             Rules
                                Strategies
                                             Competition
                   Some Funny New Science



The Dataset

      The user space U consists of 480 189 users
      (identified by a meaningless non-sequential integral id).
      The movie space M consists of 17 770 movies
      (identified by integers 1, . . . , 17 770, and the associated list of titles
      and release years is provided – this data is meaningful and minable).
      The date space D spans the period Oct. 1998 – Dec. 2005
      (extremely meaningful data; no time of day is provided).
      The rating space R is {1, 2, 3, 4, 5} (”stars”).




                              David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                             Rules
                                Strategies
                                             Competition
                   Some Funny New Science



The Dataset

      The user space U consists of 480 189 users
      (identified by a meaningless non-sequential integral id).
      The movie space M consists of 17 770 movies
      (identified by integers 1, . . . , 17 770, and the associated list of titles
      and release years is provided – this data is meaningful and minable).
      The date space D spans the period Oct. 1998 – Dec. 2005
      (extremely meaningful data; no time of day is provided).
      The rating space R is {1, 2, 3, 4, 5} (”stars”).
      The training dataset T contains 100 480 507 quadruples
      (u, m, d, r ) ∈ U × M × D × R.




                              David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                             Rules
                                Strategies
                                             Competition
                   Some Funny New Science



The Dataset

      The user space U consists of 480 189 users
      (identified by a meaningless non-sequential integral id).
      The movie space M consists of 17 770 movies
      (identified by integers 1, . . . , 17 770, and the associated list of titles
      and release years is provided – this data is meaningful and minable).
      The date space D spans the period Oct. 1998 – Dec. 2005
      (extremely meaningful data; no time of day is provided).
      The rating space R is {1, 2, 3, 4, 5} (”stars”).
      The training dataset T contains 100 480 507 quadruples
      (u, m, d, r ) ∈ U × M × D × R.
      The qualifying dataset Q contains 2 817 131 triples
      (u, m, d) ∈ U × M × D.

                              David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                          Rules
                             Strategies
                                          Competition
                Some Funny New Science



The Challenge

      Open to everyone




                           David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                           Rules
                              Strategies
                                           Competition
                 Some Funny New Science



The Challenge

      Open to everyone except Netflix employees and their relatives




                            David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                           Rules
                              Strategies
                                           Competition
                 Some Funny New Science



The Challenge

      Open to everyone except Netflix employees and their relatives and
      residents of Cuba, Iran, Syria, North Korea, Myanmar and Sudan.




                            David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                           Rules
                              Strategies
                                           Competition
                 Some Funny New Science



The Challenge

      Open to everyone except Netflix employees and their relatives and
      residents of Cuba, Iran, Syria, North Korea, Myanmar and Sudan.
      Participants can join efforts in teams.




                            David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                           Rules
                              Strategies
                                           Competition
                 Some Funny New Science



The Challenge

      Open to everyone except Netflix employees and their relatives and
      residents of Cuba, Iran, Syria, North Korea, Myanmar and Sudan.
      Participants can join efforts in teams.
      They can upload their predictions up to once a day.




                            David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                           Rules
                              Strategies
                                           Competition
                 Some Funny New Science



The Challenge

      Open to everyone except Netflix employees and their relatives and
      residents of Cuba, Iran, Syria, North Korea, Myanmar and Sudan.
      Participants can join efforts in teams.
      They can upload their predictions up to once a day.
      Predictions are maps from the qualifying set Q to the interval
      [1, 5].




                            David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                           Rules
                              Strategies
                                           Competition
                 Some Funny New Science



The Challenge

      Open to everyone except Netflix employees and their relatives and
      residents of Cuba, Iran, Syria, North Korea, Myanmar and Sudan.
      Participants can join efforts in teams.
      They can upload their predictions up to once a day.
      Predictions are maps from the qualifying set Q to the interval
      [1, 5].
      The metric used to benchmark predictions is the RMSE (”root
      of mean square error”)

                    1
       RMSE =                  |predicted rating for q − actual rating for q|2
                   |Q|
                         q∈Q




                            David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                          Rules
                             Strategies
                                          Competition
                Some Funny New Science



Typical RMSEs

      Theoretically, the RMSE cannot be greater than 2.




                           David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                           Rules
                              Strategies
                                           Competition
                 Some Funny New Science



Typical RMSEs

      Theoretically, the RMSE cannot be greater than 2.
      Users tend to view and rate movies they like, so they typically
      give 3, 4 or 5 stars rather than 1 or 2 (the above upper bound
      is unrealistically pessimistic).




                            David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                           Rules
                              Strategies
                                           Competition
                 Some Funny New Science



Typical RMSEs

      Theoretically, the RMSE cannot be greater than 2.
      Users tend to view and rate movies they like, so they typically
      give 3, 4 or 5 stars rather than 1 or 2 (the above upper bound
      is unrealistically pessimistic).
      A basic prediction consists of mapping a triple (u, m, d) to the
      average rating obtained by the movie m.




                            David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                           Rules
                              Strategies
                                           Competition
                 Some Funny New Science



Typical RMSEs

      Theoretically, the RMSE cannot be greater than 2.
      Users tend to view and rate movies they like, so they typically
      give 3, 4 or 5 stars rather than 1 or 2 (the above upper bound
      is unrealistically pessimistic).
      A basic prediction consists of mapping a triple (u, m, d) to the
      average rating obtained by the movie m. It achieves 1.0540.




                            David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                           Rules
                              Strategies
                                           Competition
                 Some Funny New Science



Typical RMSEs

      Theoretically, the RMSE cannot be greater than 2.
      Users tend to view and rate movies they like, so they typically
      give 3, 4 or 5 stars rather than 1 or 2 (the above upper bound
      is unrealistically pessimistic).
      A basic prediction consists of mapping a triple (u, m, d) to the
      average rating obtained by the movie m. It achieves 1.0540.
      At the beginning of the Challenge, Netflix’s in-house
      prediction system Cinematch achieved 0.9514
      (roughly a 10% improvement).




                            David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                           Rules
                              Strategies
                                           Competition
                 Some Funny New Science



Typical RMSEs

      Theoretically, the RMSE cannot be greater than 2.
      Users tend to view and rate movies they like, so they typically
      give 3, 4 or 5 stars rather than 1 or 2 (the above upper bound
      is unrealistically pessimistic).
      A basic prediction consists of mapping a triple (u, m, d) to the
      average rating obtained by the movie m. It achieves 1.0540.
      At the beginning of the Challenge, Netflix’s in-house
      prediction system Cinematch achieved 0.9514
      (roughly a 10% improvement).
      Netflix set the following target: obtain a further 10%
      improvement over Cinematch.


                            David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                           Rules
                              Strategies
                                           Competition
                 Some Funny New Science



Very Smart Rules 1: a Cryptographic Trick


      Netflix has secretly partitioned the qualifying set

                                    Q = Q1        Q2

      into two subsets of (approximately) equal sizes.




                            David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                            Rules
                               Strategies
                                            Competition
                  Some Funny New Science



Very Smart Rules 1: a Cryptographic Trick


      Netflix has secretly partitioned the qualifying set

                                     Q = Q1        Q2

      into two subsets of (approximately) equal sizes.
      The RMSE achieved on Q1 is revealed to participants
      (there is a public leaderboard).




                             David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                            Rules
                               Strategies
                                            Competition
                  Some Funny New Science



Very Smart Rules 1: a Cryptographic Trick


      Netflix has secretly partitioned the qualifying set

                                     Q = Q1        Q2

      into two subsets of (approximately) equal sizes.
      The RMSE achieved on Q1 is revealed to participants
      (there is a public leaderboard).
      The RMSE achieved on Q2 is used to determine the winner.




                             David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                            Rules
                               Strategies
                                            Competition
                  Some Funny New Science



Very Smart Rules 1: a Cryptographic Trick


      Netflix has secretly partitioned the qualifying set

                                     Q = Q1        Q2

      into two subsets of (approximately) equal sizes.
      The RMSE achieved on Q1 is revealed to participants
      (there is a public leaderboard).
      The RMSE achieved on Q2 is used to determine the winner.
      This prevented participants from “learning from the oracle”.




                             David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                            Rules
                               Strategies
                                            Competition
                  Some Funny New Science



Very Smart Rules 1: a Cryptographic Trick


      Netflix has secretly partitioned the qualifying set

                                     Q = Q1        Q2

      into two subsets of (approximately) equal sizes.
      The RMSE achieved on Q1 is revealed to participants
      (there is a public leaderboard).
      The RMSE achieved on Q2 is used to determine the winner.
      This prevented participants from “learning from the oracle”.
      The goal was to achieve 0.8572.



                             David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                          Rules
                             Strategies
                                          Competition
                Some Funny New Science



Very Smart Rules 2: Crowd Psychology Tricks

      The Challenged opened on October 2, 2006.




                           David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                           Rules
                              Strategies
                                           Competition
                 Some Funny New Science



Very Smart Rules 2: Crowd Psychology Tricks

      The Challenged opened on October 2, 2006.
      Annual $50.000 prizes were offered to current leaders




                            David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                           Rules
                              Strategies
                                           Competition
                 Some Funny New Science



Very Smart Rules 2: Crowd Psychology Tricks

      The Challenged opened on October 2, 2006.
      Annual $50.000 prizes were offered to current leaders provided
      they made their current methodology public.




                            David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                           Rules
                              Strategies
                                           Competition
                 Some Funny New Science



Very Smart Rules 2: Crowd Psychology Tricks

      The Challenged opened on October 2, 2006.
      Annual $50.000 prizes were offered to current leaders provided
      they made their current methodology public.
      The Challenge was to last for 30 more days after the goal was
      achieved.




                            David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                           Rules
                              Strategies
                                           Competition
                 Some Funny New Science



Very Smart Rules 2: Crowd Psychology Tricks

      The Challenged opened on October 2, 2006.
      Annual $50.000 prizes were offered to current leaders provided
      they made their current methodology public.
      The Challenge was to last for 30 more days after the goal was
      achieved.
      The winner would be the team with the best RMSE after this
      30 days period




                            David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                            Rules
                               Strategies
                                            Competition
                  Some Funny New Science



Very Smart Rules 2: Crowd Psychology Tricks

      The Challenged opened on October 2, 2006.
      Annual $50.000 prizes were offered to current leaders provided
      they made their current methodology public.
      The Challenge was to last for 30 more days after the goal was
      achieved.
      The winner would be the team with the best RMSE after this
      30 days period (no backstabbing arXiv-style “I posted first” effect).




                             David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                            Rules
                               Strategies
                                            Competition
                  Some Funny New Science



Very Smart Rules 2: Crowd Psychology Tricks

      The Challenged opened on October 2, 2006.
      Annual $50.000 prizes were offered to current leaders provided
      they made their current methodology public.
      The Challenge was to last for 30 more days after the goal was
      achieved.
      The winner would be the team with the best RMSE after this
      30 days period (no backstabbing arXiv-style “I posted first” effect).
      Every detail was carefully anticipated (even the possibility of a
      tie).




                             David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                            Rules
                               Strategies
                                            Competition
                  Some Funny New Science



Very Smart Rules 2: Crowd Psychology Tricks

      The Challenged opened on October 2, 2006.
      Annual $50.000 prizes were offered to current leaders provided
      they made their current methodology public.
      The Challenge was to last for 30 more days after the goal was
      achieved.
      The winner would be the team with the best RMSE after this
      30 days period (no backstabbing arXiv-style “I posted first” effect).
      Every detail was carefully anticipated (even the possibility of a
      tie).
      These smart rules, together with the $1.000.000 prize,
      attracted thousands of participants.


                             David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                         Rules
                            Strategies
                                         Competition
               Some Funny New Science



Timeline


      October 2006: Cinematch RMSE = 0.9514.




                          David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                          Rules
                             Strategies
                                          Competition
                Some Funny New Science



Timeline


      October 2006: Cinematch RMSE = 0.9514.
      October 2007: team KorBell leads with 0.8712 (8.43%
      improvement).




                           David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                          Rules
                             Strategies
                                          Competition
                Some Funny New Science



Timeline


      October 2006: Cinematch RMSE = 0.9514.
      October 2007: team KorBell leads with 0.8712 (8.43%
      improvement).
      October 2008: team “BellKor in BigChaos” (two teams
      merging efforts) leads with 0.8616 (9.44% improvement).




                           David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                           Rules
                              Strategies
                                           Competition
                 Some Funny New Science



Timeline


      October 2006: Cinematch RMSE = 0.9514.
      October 2007: team KorBell leads with 0.8712 (8.43%
      improvement).
      October 2008: team “BellKor in BigChaos” (two teams
      merging efforts) leads with 0.8616 (9.44% improvement).
      June 26, 2009: the goal is achieved.




                            David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                           Rules
                              Strategies
                                           Competition
                 Some Funny New Science



Timeline


      October 2006: Cinematch RMSE = 0.9514.
      October 2007: team KorBell leads with 0.8712 (8.43%
      improvement).
      October 2008: team “BellKor in BigChaos” (two teams
      merging efforts) leads with 0.8616 (9.44% improvement).
      June 26, 2009: the goal is achieved.
      July 26, 2009: Netflix stops gathering solutions.




                            David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                           Rules
                              Strategies
                                           Competition
                 Some Funny New Science



Timeline


      October 2006: Cinematch RMSE = 0.9514.
      October 2007: team KorBell leads with 0.8712 (8.43%
      improvement).
      October 2008: team “BellKor in BigChaos” (two teams
      merging efforts) leads with 0.8616 (9.44% improvement).
      June 26, 2009: the goal is achieved.
      July 26, 2009: Netflix stops gathering solutions.
      The winner is announced on September 18, 2009.




                            David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                           Rules
                              Strategies
                                           Competition
                 Some Funny New Science



The winning team
  Three teams combined their results to win the competition:
      BellKor
           Bob Bell (AT&T)
           Yehuda Koren (Yahoo)
           Chris Volinsky (AT&T)
      BigChaos
           Michael Jahrer (Commendo research and consulting)
           Andreas T¨scher (Commendo research and consulting)
                     o
      Pragmatic Theory
           Martin Chabbert (Pragmatic Theory)
           Martin Piotte (Pragmatic Theory)




                            David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                           Rules
                              Strategies
                                           Competition
                 Some Funny New Science



The winning team
  Three teams combined their results to win the competition:
      BellKor
           Bob Bell (AT&T)
           Yehuda Koren (Yahoo)
           Chris Volinsky (AT&T)
      BigChaos
           Michael Jahrer (Commendo research and consulting)
           Andreas T¨scher (Commendo research and consulting)
                     o
      Pragmatic Theory
           Martin Chabbert (Pragmatic Theory)
           Martin Piotte (Pragmatic Theory)
  Their winnning submission achieved a RMSE of 0.8567 (10.06%
  improvement over Cinematch.)


                            David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                           Rules
                              Strategies
                                           Competition
                 Some Funny New Science



The winning team
  Three teams combined their results to win the competition:
      BellKor
           Bob Bell (AT&T)
           Yehuda Koren (Yahoo)
           Chris Volinsky (AT&T)
      BigChaos
           Michael Jahrer (Commendo research and consulting)
           Andreas T¨scher (Commendo research and consulting)
                     o
      Pragmatic Theory
           Martin Chabbert (Pragmatic Theory)
           Martin Piotte (Pragmatic Theory)
  Their winnning submission achieved a RMSE of 0.8567 (10.06%
  improvement over Cinematch.)
  Another team, The Ensemble, achieved the same RMSE...

                            David Bessis   The Netflix Prize: yet another million dollar problem
The Problem
                                            Rules
                               Strategies
                                            Competition
                  Some Funny New Science



The winning team
  Three teams combined their results to win the competition:
      BellKor
           Bob Bell (AT&T)
           Yehuda Koren (Yahoo)
           Chris Volinsky (AT&T)
      BigChaos
           Michael Jahrer (Commendo research and consulting)
           Andreas T¨scher (Commendo research and consulting)
                     o
      Pragmatic Theory
           Martin Chabbert (Pragmatic Theory)
           Martin Piotte (Pragmatic Theory)
  Their winnning submission achieved a RMSE of 0.8567 (10.06%
  improvement over Cinematch.)
  Another team, The Ensemble, achieved the same RMSE...
  ...and lost because their submission was posted 24 minutes later!
                             David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                           The Problem
                                          Regressions
                             Strategies
                                          Latent factors
                Some Funny New Science
                                          Tuning and Blending


Computer implementation


  Memory requirements:
      Movies can be encoded on 2 bytes (17770 < 2562 ).




                           David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                           The Problem
                                          Regressions
                             Strategies
                                          Latent factors
                Some Funny New Science
                                          Tuning and Blending


Computer implementation


  Memory requirements:
      Movies can be encoded on 2 bytes (17770 < 2562 ).
      Viewers can be encoded on 3 bytes (480189 < 2563 ).




                           David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                           The Problem
                                          Regressions
                             Strategies
                                          Latent factors
                Some Funny New Science
                                          Tuning and Blending


Computer implementation


  Memory requirements:
      Movies can be encoded on 2 bytes (17770 < 2562 ).
      Viewers can be encoded on 3 bytes (480189 < 2563 ).
      Dates can be encoded on 2 bytes.




                           David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                            The Problem
                                           Regressions
                              Strategies
                                           Latent factors
                 Some Funny New Science
                                           Tuning and Blending


Computer implementation


  Memory requirements:
      Movies can be encoded on 2 bytes (17770 < 2562 ).
      Viewers can be encoded on 3 bytes (480189 < 2563 ).
      Dates can be encoded on 2 bytes.
      A triple (m, v , d) can be encoded on 7 bytes.




                            David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                            The Problem
                                           Regressions
                              Strategies
                                           Latent factors
                 Some Funny New Science
                                           Tuning and Blending


Computer implementation


  Memory requirements:
      Movies can be encoded on 2 bytes (17770 < 2562 ).
      Viewers can be encoded on 3 bytes (480189 < 2563 ).
      Dates can be encoded on 2 bytes.
      A triple (m, v , d) can be encoded on 7 bytes.
      700 MB suffice to store the dataset.




                            David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                            The Problem
                                           Regressions
                              Strategies
                                           Latent factors
                 Some Funny New Science
                                           Tuning and Blending


Computer implementation


  Memory requirements:
      Movies can be encoded on 2 bytes (17770 < 2562 ).
      Viewers can be encoded on 3 bytes (480189 < 2563 ).
      Dates can be encoded on 2 bytes.
      A triple (m, v , d) can be encoded on 7 bytes.
      700 MB suffice to store the dataset.
      It is possible (necessary) to work in RAM.




                            David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                            The Problem
                                           Regressions
                              Strategies
                                           Latent factors
                 Some Funny New Science
                                           Tuning and Blending


Computer implementation


  Memory requirements:
      Movies can be encoded on 2 bytes (17770 < 2562 ).
      Viewers can be encoded on 3 bytes (480189 < 2563 ).
      Dates can be encoded on 2 bytes.
      A triple (m, v , d) can be encoded on 7 bytes.
      700 MB suffice to store the dataset.
      It is possible (necessary) to work in RAM.
      Commodity hardware is sufficient.




                            David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                             The Problem
                                            Regressions
                               Strategies
                                            Latent factors
                  Some Funny New Science
                                            Tuning and Blending


Computer implementation


  Memory requirements:
      Movies can be encoded on 2 bytes (17770 < 2562 ).
      Viewers can be encoded on 3 bytes (480189 < 2563 ).
      Dates can be encoded on 2 bytes.
      A triple (m, v , d) can be encoded on 7 bytes.
      700 MB suffice to store the dataset.
      It is possible (necessary) to work in RAM.
      Commodity hardware is sufficient.
  (I have some Ruby code to interactively play with the dataset.)


                             David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                           The Problem
                                          Regressions
                             Strategies
                                          Latent factors
                Some Funny New Science
                                          Tuning and Blending


Remarks
     About 200 ratings per users.




                           David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                            The Problem
                                           Regressions
                              Strategies
                                           Latent factors
                 Some Funny New Science
                                           Tuning and Blending


Remarks
     About 200 ratings per users.
          This is likely caused by Cinematch’s data gathering procedure:




                            David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                            The Problem
                                           Regressions
                              Strategies
                                           Latent factors
                 Some Funny New Science
                                           Tuning and Blending


Remarks
     About 200 ratings per users.
          This is likely caused by Cinematch’s data gathering procedure:
          users sometime rate tens of movies on a single day.




                            David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                            The Problem
                                           Regressions
                              Strategies
                                           Latent factors
                 Some Funny New Science
                                           Tuning and Blending


Remarks
     About 200 ratings per users.
          This is likely caused by Cinematch’s data gathering procedure:
          users sometime rate tens of movies on a single day.
          This causes an insanely huge bias within the dataset (movies
          are perceived differently when rated individually or within a
          rating spree), not fully exploited by the winners.




                            David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                            The Problem
                                           Regressions
                              Strategies
                                           Latent factors
                 Some Funny New Science
                                           Tuning and Blending


Remarks
     About 200 ratings per users.
          This is likely caused by Cinematch’s data gathering procedure:
          users sometime rate tens of movies on a single day.
          This causes an insanely huge bias within the dataset (movies
          are perceived differently when rated individually or within a
          rating spree), not fully exploited by the winners.
          Netflix, do you read me?




                            David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                            The Problem
                                           Regressions
                              Strategies
                                           Latent factors
                 Some Funny New Science
                                           Tuning and Blending


Remarks
     About 200 ratings per users.
          This is likely caused by Cinematch’s data gathering procedure:
          users sometime rate tens of movies on a single day.
          This causes an insanely huge bias within the dataset (movies
          are perceived differently when rated individually or within a
          rating spree), not fully exploited by the winners.
          Netflix, do you read me?
     Some movies were rated by hundreds of thousands viewers,
     some by just a few (long-tail distribution).




                            David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                            The Problem
                                           Regressions
                              Strategies
                                           Latent factors
                 Some Funny New Science
                                           Tuning and Blending


Remarks
     About 200 ratings per users.
          This is likely caused by Cinematch’s data gathering procedure:
          users sometime rate tens of movies on a single day.
          This causes an insanely huge bias within the dataset (movies
          are perceived differently when rated individually or within a
          rating spree), not fully exploited by the winners.
          Netflix, do you read me?
     Some movies were rated by hundreds of thousands viewers,
     some by just a few (long-tail distribution).
     Similarly, a user rated all the movies, and many just a few.




                            David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                            The Problem
                                           Regressions
                              Strategies
                                           Latent factors
                 Some Funny New Science
                                           Tuning and Blending


Remarks
     About 200 ratings per users.
          This is likely caused by Cinematch’s data gathering procedure:
          users sometime rate tens of movies on a single day.
          This causes an insanely huge bias within the dataset (movies
          are perceived differently when rated individually or within a
          rating spree), not fully exploited by the winners.
          Netflix, do you read me?
     Some movies were rated by hundreds of thousands viewers,
     some by just a few (long-tail distribution).
     Similarly, a user rated all the movies, and many just a few.
     Let F be the set of all final 9 ratings for all individual users.




                            David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                            The Problem
                                           Regressions
                              Strategies
                                           Latent factors
                 Some Funny New Science
                                           Tuning and Blending


Remarks
     About 200 ratings per users.
          This is likely caused by Cinematch’s data gathering procedure:
          users sometime rate tens of movies on a single day.
          This causes an insanely huge bias within the dataset (movies
          are perceived differently when rated individually or within a
          rating spree), not fully exploited by the winners.
          Netflix, do you read me?
     Some movies were rated by hundreds of thousands viewers,
     some by just a few (long-tail distribution).
     Similarly, a user rated all the movies, and many just a few.
     Let F be the set of all final 9 ratings for all individual users.
          Then F = Q        P, with P ⊂ T publicly tagged by Netflix.



                            David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                            The Problem
                                           Regressions
                              Strategies
                                           Latent factors
                 Some Funny New Science
                                           Tuning and Blending


Remarks
     About 200 ratings per users.
          This is likely caused by Cinematch’s data gathering procedure:
          users sometime rate tens of movies on a single day.
          This causes an insanely huge bias within the dataset (movies
          are perceived differently when rated individually or within a
          rating spree), not fully exploited by the winners.
          Netflix, do you read me?
     Some movies were rated by hundreds of thousands viewers,
     some by just a few (long-tail distribution).
     Similarly, a user rated all the movies, and many just a few.
     Let F be the set of all final 9 ratings for all individual users.
          Then F = Q P, with P ⊂ T publicly tagged by Netflix.
          Q is a random draw of 2/3 of F .


                            David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                            The Problem
                                           Regressions
                              Strategies
                                           Latent factors
                 Some Funny New Science
                                           Tuning and Blending


Remarks
     About 200 ratings per users.
          This is likely caused by Cinematch’s data gathering procedure:
          users sometime rate tens of movies on a single day.
          This causes an insanely huge bias within the dataset (movies
          are perceived differently when rated individually or within a
          rating spree), not fully exploited by the winners.
          Netflix, do you read me?
     Some movies were rated by hundreds of thousands viewers,
     some by just a few (long-tail distribution).
     Similarly, a user rated all the movies, and many just a few.
     Let F be the set of all final 9 ratings for all individual users.
          Then F = Q P, with P ⊂ T publicly tagged by Netflix.
          Q is a random draw of 2/3 of F .
          Q resembles P but is very dissimilar from T .
                            David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                             The Problem
                                            Regressions
                               Strategies
                                            Latent factors
                  Some Funny New Science
                                            Tuning and Blending


Algorithms


   The machine learning toolbox consists of many methods:
       Clustering methods.




                             David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                             The Problem
                                            Regressions
                               Strategies
                                            Latent factors
                  Some Funny New Science
                                            Tuning and Blending


Algorithms


   The machine learning toolbox consists of many methods:
       Clustering methods.
       Regressions.




                             David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                             The Problem
                                            Regressions
                               Strategies
                                            Latent factors
                  Some Funny New Science
                                            Tuning and Blending


Algorithms


   The machine learning toolbox consists of many methods:
       Clustering methods.
       Regressions.
       Latent parameters methods (SVD).




                             David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                             The Problem
                                            Regressions
                               Strategies
                                            Latent factors
                  Some Funny New Science
                                            Tuning and Blending


Algorithms


   The machine learning toolbox consists of many methods:
       Clustering methods.
       Regressions.
       Latent parameters methods (SVD).
       Neural networks.




                             David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                             The Problem
                                            Regressions
                               Strategies
                                            Latent factors
                  Some Funny New Science
                                            Tuning and Blending


Algorithms


   The machine learning toolbox consists of many methods:
       Clustering methods.
       Regressions.
       Latent parameters methods (SVD).
       Neural networks.
       SVM




                             David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                             The Problem
                                            Regressions
                               Strategies
                                            Latent factors
                  Some Funny New Science
                                            Tuning and Blending


Algorithms


   The machine learning toolbox consists of many methods:
       Clustering methods.
       Regressions.
       Latent parameters methods (SVD).
       Neural networks.
       SVM
       ...




                             David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                           The Problem
                                          Regressions
                             Strategies
                                          Latent factors
                Some Funny New Science
                                          Tuning and Blending


Beginner’s mistakes


      Underestimate the volume effect.




                           David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                            The Problem
                                           Regressions
                              Strategies
                                           Latent factors
                 Some Funny New Science
                                           Tuning and Blending


Beginner’s mistakes


      Underestimate the volume effect.
      Think conceptually and discretely rather than globally and
      continuously.




                            David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                            The Problem
                                           Regressions
                              Strategies
                                           Latent factors
                 Some Funny New Science
                                           Tuning and Blending


Beginner’s mistakes


      Underestimate the volume effect.
      Think conceptually and discretely rather than globally and
      continuously.
      Put users and movies into categories (clustering introduces
      unwanted discontinuities).




                            David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                            The Problem
                                           Regressions
                              Strategies
                                           Latent factors
                 Some Funny New Science
                                           Tuning and Blending


Beginner’s mistakes


      Underestimate the volume effect.
      Think conceptually and discretely rather than globally and
      continuously.
      Put users and movies into categories (clustering introduces
      unwanted discontinuities).
      Learn from the probe.




                            David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                              The Problem
                                             Regressions
                                Strategies
                                             Latent factors
                   Some Funny New Science
                                             Tuning and Blending


Beginner’s mistakes


       Underestimate the volume effect.
       Think conceptually and discretely rather than globally and
       continuously.
       Put users and movies into categories (clustering introduces
       unwanted discontinuities).
       Learn from the probe.
   Dealing with 100 000 000 data isn’t a logic puzzle.




                              David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                              The Problem
                                             Regressions
                                Strategies
                                             Latent factors
                   Some Funny New Science
                                             Tuning and Blending


Beginner’s mistakes


       Underestimate the volume effect.
       Think conceptually and discretely rather than globally and
       continuously.
       Put users and movies into categories (clustering introduces
       unwanted discontinuities).
       Learn from the probe.
   Dealing with 100 000 000 data isn’t a logic puzzle.
   It resembles Thermodynamics.



                              David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                              The Problem
                                             Regressions
                                Strategies
                                             Latent factors
                   Some Funny New Science
                                             Tuning and Blending


Linear regression
   Suppose all viewers in X have rated all movies in Y : the rating
   matrix is
                            (rx,y )(x,y )∈X ×Y .




                              David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                              The Problem
                                             Regressions
                                Strategies
                                             Latent factors
                   Some Funny New Science
                                             Tuning and Blending


Linear regression
   Suppose all viewers in X have rated all movies in Y : the rating
   matrix is
                            (rx,y )(x,y )∈X ×Y .
   Suppose you want to model the ratings given to a particular movie
   y0 based on the ratings given to the movies in Y = Y − {y0 }.




                              David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                              The Problem
                                             Regressions
                                Strategies
                                             Latent factors
                   Some Funny New Science
                                             Tuning and Blending


Linear regression
   Suppose all viewers in X have rated all movies in Y : the rating
   matrix is
                            (rx,y )(x,y )∈X ×Y .
   Suppose you want to model the ratings given to a particular movie
   y0 based on the ratings given to the movies in Y = Y − {y0 }.
   A linear regression is a natural way to do that.




                              David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                              The Problem
                                             Regressions
                                Strategies
                                             Latent factors
                   Some Funny New Science
                                             Tuning and Blending


Linear regression
   Suppose all viewers in X have rated all movies in Y : the rating
   matrix is
                            (rx,y )(x,y )∈X ×Y .
   Suppose you want to model the ratings given to a particular movie
   y0 based on the ratings given to the movies in Y = Y − {y0 }.
   A linear regression is a natural way to do that.
   Write (rx,y ) = (Cy )y ∈Y where the Cy are the column vectors.




                              David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                              The Problem
                                             Regressions
                                Strategies
                                             Latent factors
                   Some Funny New Science
                                             Tuning and Blending


Linear regression
   Suppose all viewers in X have rated all movies in Y : the rating
   matrix is
                            (rx,y )(x,y )∈X ×Y .
   Suppose you want to model the ratings given to a particular movie
   y0 based on the ratings given to the movies in Y = Y − {y0 }.
   A linear regression is a natural way to do that.
   Write (rx,y ) = (Cy )y ∈Y where the Cy are the column vectors.
   Performing the linear regression consists of approximating Cy0 by
                              ˆ
   its orthogonal projection Cy0 on the hyperplane generated by the
   (Cy )y ∈Y .




                              David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                              The Problem
                                             Regressions
                                Strategies
                                             Latent factors
                   Some Funny New Science
                                             Tuning and Blending


Linear regression
   Suppose all viewers in X have rated all movies in Y : the rating
   matrix is
                            (rx,y )(x,y )∈X ×Y .
   Suppose you want to model the ratings given to a particular movie
   y0 based on the ratings given to the movies in Y = Y − {y0 }.
   A linear regression is a natural way to do that.
   Write (rx,y ) = (Cy )y ∈Y where the Cy are the column vectors.
   Performing the linear regression consists of approximating Cy0 by
                              ˆ
   its orthogonal projection Cy0 on the hyperplane generated by the
   (Cy )y ∈Y .
       Clearly, there exists a unique solution.



                              David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                              The Problem
                                             Regressions
                                Strategies
                                             Latent factors
                   Some Funny New Science
                                             Tuning and Blending


Linear regression
   Suppose all viewers in X have rated all movies in Y : the rating
   matrix is
                            (rx,y )(x,y )∈X ×Y .
   Suppose you want to model the ratings given to a particular movie
   y0 based on the ratings given to the movies in Y = Y − {y0 }.
   A linear regression is a natural way to do that.
   Write (rx,y ) = (Cy )y ∈Y where the Cy are the column vectors.
   Performing the linear regression consists of approximating Cy0 by
                              ˆ
   its orthogonal projection Cy0 on the hyperplane generated by the
   (Cy )y ∈Y .
       Clearly, there exists a unique solution.
       It optimizes RMSE.

                              David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                              The Problem
                                             Regressions
                                Strategies
                                             Latent factors
                   Some Funny New Science
                                             Tuning and Blending


Linear regression
   Suppose all viewers in X have rated all movies in Y : the rating
   matrix is
                            (rx,y )(x,y )∈X ×Y .
   Suppose you want to model the ratings given to a particular movie
   y0 based on the ratings given to the movies in Y = Y − {y0 }.
   A linear regression is a natural way to do that.
   Write (rx,y ) = (Cy )y ∈Y where the Cy are the column vectors.
   Performing the linear regression consists of approximating Cy0 by
                              ˆ
   its orthogonal projection Cy0 on the hyperplane generated by the
   (Cy )y ∈Y .
       Clearly, there exists a unique solution.
       It optimizes RMSE.
       Write the formula!
                              David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                             The Problem
                                            Regressions
                               Strategies
                                            Latent factors
                  Some Funny New Science
                                            Tuning and Blending


Real life problems 1: missing data

       Not all viewers have seen all movies.




                             David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                             The Problem
                                            Regressions
                               Strategies
                                            Latent factors
                  Some Funny New Science
                                            Tuning and Blending


Real life problems 1: missing data

       Not all viewers have seen all movies.
       Worse, there are virtually no complete rectangular blocks
       within the dataset.




                             David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                             The Problem
                                            Regressions
                               Strategies
                                            Latent factors
                  Some Funny New Science
                                            Tuning and Blending


Real life problems 1: missing data

       Not all viewers have seen all movies.
       Worse, there are virtually no complete rectangular blocks
       within the dataset.
       Regression by viewers or by movies?




                             David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                              The Problem
                                             Regressions
                                Strategies
                                             Latent factors
                   Some Funny New Science
                                             Tuning and Blending


Real life problems 1: missing data

       Not all viewers have seen all movies.
       Worse, there are virtually no complete rectangular blocks
       within the dataset.
       Regression by viewers or by movies?
   It is better to do regression by movies.




                              David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                              The Problem
                                             Regressions
                                Strategies
                                             Latent factors
                   Some Funny New Science
                                             Tuning and Blending


Real life problems 1: missing data

       Not all viewers have seen all movies.
       Worse, there are virtually no complete rectangular blocks
       within the dataset.
       Regression by viewers or by movies?
   It is better to do regression by movies.
   Normalize ratings:




                              David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                               The Problem
                                              Regressions
                                 Strategies
                                              Latent factors
                    Some Funny New Science
                                              Tuning and Blending


Real life problems 1: missing data

       Not all viewers have seen all movies.
       Worse, there are virtually no complete rectangular blocks
       within the dataset.
       Regression by viewers or by movies?
   It is better to do regression by movies.
   Normalize ratings:
       replace the rating rv ,m by the meaningful signal, i.e., the difference
       r v ,m between rv ,m and the average rating for m.




                               David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                               The Problem
                                              Regressions
                                 Strategies
                                              Latent factors
                    Some Funny New Science
                                              Tuning and Blending


Real life problems 1: missing data

       Not all viewers have seen all movies.
       Worse, there are virtually no complete rectangular blocks
       within the dataset.
       Regression by viewers or by movies?
   It is better to do regression by movies.
   Normalize ratings:
       replace the rating rv ,m by the meaningful signal, i.e., the difference
       r v ,m between rv ,m and the average rating for m.
       Then it becomes natural to set r v ,m to 0 when v hasn’t rated m.




                               David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                               The Problem
                                              Regressions
                                 Strategies
                                              Latent factors
                    Some Funny New Science
                                              Tuning and Blending


Real life problems 1: missing data

       Not all viewers have seen all movies.
       Worse, there are virtually no complete rectangular blocks
       within the dataset.
       Regression by viewers or by movies?
   It is better to do regression by movies.
   Normalize ratings:
       replace the rating rv ,m by the meaningful signal, i.e., the difference
       r v ,m between rv ,m and the average rating for m.
       Then it becomes natural to set r v ,m to 0 when v hasn’t rated m.
       Actually, whether or not v has rated m is a meaningful information!



                               David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                               The Problem
                                              Regressions
                                 Strategies
                                              Latent factors
                    Some Funny New Science
                                              Tuning and Blending


Real life problems 1: missing data

       Not all viewers have seen all movies.
       Worse, there are virtually no complete rectangular blocks
       within the dataset.
       Regression by viewers or by movies?
   It is better to do regression by movies.
   Normalize ratings:
       replace the rating rv ,m by the meaningful signal, i.e., the difference
       r v ,m between rv ,m and the average rating for m.
       Then it becomes natural to set r v ,m to 0 when v hasn’t rated m.
       Actually, whether or not v has rated m is a meaningful information!
       Add normalized bit columns to account for that.

                               David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                              The Problem
                                             Regressions
                                Strategies
                                             Latent factors
                   Some Funny New Science
                                             Tuning and Blending


Real life problems 2: the curse of dimensionality

   We all know that Lagrange interpolators are not to be used on
   noisy data. Rather, one should look at best-fitting polynomials of a
   given low degree.




                              David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                              The Problem
                                             Regressions
                                Strategies
                                             Latent factors
                   Some Funny New Science
                                             Tuning and Blending


Real life problems 2: the curse of dimensionality

   We all know that Lagrange interpolators are not to be used on
   noisy data. Rather, one should look at best-fitting polynomials of a
   given low degree.
   Similarly, the curse of dimensionality asserts that:
       With high-dimensionality datasets, one will always find stupid
       predictors, making perfect predictions on the dataset, and
       failing to generalize.




                              David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                              The Problem
                                             Regressions
                                Strategies
                                             Latent factors
                   Some Funny New Science
                                             Tuning and Blending


Real life problems 2: the curse of dimensionality

   We all know that Lagrange interpolators are not to be used on
   noisy data. Rather, one should look at best-fitting polynomials of a
   given low degree.
   Similarly, the curse of dimensionality asserts that:
       With high-dimensionality datasets, one will always find stupid
       predictors, making perfect predictions on the dataset, and
       failing to generalize.
   By looking at my audience today, what should I be able to infer?




                              David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                              The Problem
                                             Regressions
                                Strategies
                                             Latent factors
                   Some Funny New Science
                                             Tuning and Blending


Real life problems 2: the curse of dimensionality

   We all know that Lagrange interpolators are not to be used on
   noisy data. Rather, one should look at best-fitting polynomials of a
   given low degree.
   Similarly, the curse of dimensionality asserts that:
       With high-dimensionality datasets, one will always find stupid
       predictors, making perfect predictions on the dataset, and
       failing to generalize.
   By looking at my audience today, what should I be able to infer?
       That having long hair is a reasonably good gender predictor?




                              David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                              The Problem
                                             Regressions
                                Strategies
                                             Latent factors
                   Some Funny New Science
                                             Tuning and Blending


Real life problems 2: the curse of dimensionality

   We all know that Lagrange interpolators are not to be used on
   noisy data. Rather, one should look at best-fitting polynomials of a
   given low degree.
   Similarly, the curse of dimensionality asserts that:
       With high-dimensionality datasets, one will always find stupid
       predictors, making perfect predictions on the dataset, and
       failing to generalize.
   By looking at my audience today, what should I be able to infer?
       That having long hair is a reasonably good gender predictor?
       That wearing a grey sweater is a reasonably good gender
       predictor?


                              David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                              The Problem
                                             Regressions
                                Strategies
                                             Latent factors
                   Some Funny New Science
                                             Tuning and Blending


Real life problems 2: the curse of dimensionality

   We all know that Lagrange interpolators are not to be used on
   noisy data. Rather, one should look at best-fitting polynomials of a
   given low degree.
   Similarly, the curse of dimensionality asserts that:
       With high-dimensionality datasets, one will always find stupid
       predictors, making perfect predictions on the dataset, and
       failing to generalize.
   By looking at my audience today, what should I be able to infer?
       That having long hair is a reasonably good gender predictor?
       That wearing a grey sweater is a reasonably good gender
       predictor?
   Dilemma: overlearning vs underlearning.

                              David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                                The Problem
                                               Regressions
                                  Strategies
                                               Latent factors
                     Some Funny New Science
                                               Tuning and Blending


Ridge regression (aka Tikhonov regularization)

   Linear regression: given vectors x, y1 , . . . , yn ∈ Rm , find λ1 , . . . , λn
   that minimize
                             ||x −     λi yi ||2 .




                                David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                                The Problem
                                               Regressions
                                  Strategies
                                               Latent factors
                     Some Funny New Science
                                               Tuning and Blending


Ridge regression (aka Tikhonov regularization)

   Linear regression: given vectors x, y1 , . . . , yn ∈ Rm , find λ1 , . . . , λn
   that minimize
                             ||x −     λi yi ||2 .
   When n is large (with respect to m), the linear system is
   overdetermined. Overfitting occurs.




                                David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                                The Problem
                                               Regressions
                                  Strategies
                                               Latent factors
                     Some Funny New Science
                                               Tuning and Blending


Ridge regression (aka Tikhonov regularization)

   Linear regression: given vectors x, y1 , . . . , yn ∈ Rm , find λ1 , . . . , λn
   that minimize
                             ||x −     λi yi ||2 .
   When n is large (with respect to m), the linear system is
   overdetermined. Overfitting occurs.
   A telltale sign of overfitting is the presence of λi ’s with huge norms
   compensating each other.




                                David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                                The Problem
                                               Regressions
                                  Strategies
                                               Latent factors
                     Some Funny New Science
                                               Tuning and Blending


Ridge regression (aka Tikhonov regularization)

   Linear regression: given vectors x, y1 , . . . , yn ∈ Rm , find λ1 , . . . , λn
   that minimize
                             ||x −     λi yi ||2 .
   When n is large (with respect to m), the linear system is
   overdetermined. Overfitting occurs.
   A telltale sign of overfitting is the presence of λi ’s with huge norms
   compensating each other.
   Ridge regression (Tikhonov regularization): find λ1 , . . . , λn that
   minimize

                         ||x −         λi yi ||2 + ε       |λi |2
   where ε is a well-adjusted (small) penalty term.

                                David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                             The Problem
                                            Regressions
                               Strategies
                                            Latent factors
                  Some Funny New Science
                                            Tuning and Blending


Assigning attributes to movies

   Assume that movies differ by their amount of certain qualities:




                             David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                              The Problem
                                             Regressions
                                Strategies
                                             Latent factors
                   Some Funny New Science
                                             Tuning and Blending


Assigning attributes to movies

   Assume that movies differ by their amount of certain qualities:
       Violence.




                              David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                              The Problem
                                             Regressions
                                Strategies
                                             Latent factors
                   Some Funny New Science
                                             Tuning and Blending


Assigning attributes to movies

   Assume that movies differ by their amount of certain qualities:
       Violence.
       Sex.




                              David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                              The Problem
                                             Regressions
                                Strategies
                                             Latent factors
                   Some Funny New Science
                                             Tuning and Blending


Assigning attributes to movies

   Assume that movies differ by their amount of certain qualities:
       Violence.
       Sex.
       Anything else?




                              David Bessis   The Netflix Prize: yet another million dollar problem
Practical issues
                              The Problem
                                             Regressions
                                Strategies
                                             Latent factors
                   Some Funny New Science
                                             Tuning and Blending


Assigning attributes to movies

   Assume that movies differ by their amount of certain qualities:
       Violence.
       Sex.
       Maybe not.




                              David Bessis   The Netflix Prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem
The Netflix prize: yet another million dollar problem

Mais conteúdo relacionado

Último

Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 

Último (20)

Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 

Destaque

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 

Destaque (20)

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 

The Netflix prize: yet another million dollar problem

  • 1. The Problem Strategies Some Funny New Science The Netflix Prize: yet another million dollar problem David Bessis Ecole Normale Sup´rieure, 27/01/2010 e David Bessis The Netflix Prize: yet another million dollar problem
  • 2. The Problem Strategies Some Funny New Science 7 + 1 Million Dollar Problems Millenium Prize Problems: David Bessis The Netflix Prize: yet another million dollar problem
  • 3. The Problem Strategies Some Funny New Science 7 + 1 Million Dollar Problems Millenium Prize Problems: Funded in 2000 by the Clay Mathematical Institute. David Bessis The Netflix Prize: yet another million dollar problem
  • 4. The Problem Strategies Some Funny New Science 7 + 1 Million Dollar Problems Millenium Prize Problems: Funded in 2000 by the Clay Mathematical Institute. Seven classical open problems in Mathematics. David Bessis The Netflix Prize: yet another million dollar problem
  • 5. The Problem Strategies Some Funny New Science 7 + 1 Million Dollar Problems Millenium Prize Problems: Funded in 2000 by the Clay Mathematical Institute. Seven classical open problems in Mathematics. Solutions must David Bessis The Netflix Prize: yet another million dollar problem
  • 6. The Problem Strategies Some Funny New Science 7 + 1 Million Dollar Problems Millenium Prize Problems: Funded in 2000 by the Clay Mathematical Institute. Seven classical open problems in Mathematics. Solutions must ”be published in a refereed mathematics publication of worldwide repute” David Bessis The Netflix Prize: yet another million dollar problem
  • 7. The Problem Strategies Some Funny New Science 7 + 1 Million Dollar Problems Millenium Prize Problems: Funded in 2000 by the Clay Mathematical Institute. Seven classical open problems in Mathematics. Solutions must ”be published in a refereed mathematics publication of worldwide repute” ”have general acceptance in the mathematics community two years after” David Bessis The Netflix Prize: yet another million dollar problem
  • 8. The Problem Strategies Some Funny New Science 7 + 1 Million Dollar Problems Millenium Prize Problems: Funded in 2000 by the Clay Mathematical Institute. Seven classical open problems in Mathematics. Fuzzy rules. David Bessis The Netflix Prize: yet another million dollar problem
  • 9. The Problem Strategies Some Funny New Science 7 + 1 Million Dollar Problems Millenium Prize Problems: Funded in 2000 by the Clay Mathematical Institute. Seven classical open problems in Mathematics. Fuzzy rules. The Poincar´ conjecture was solved by Perelman in 2003. e David Bessis The Netflix Prize: yet another million dollar problem
  • 10. The Problem Strategies Some Funny New Science 7 + 1 Million Dollar Problems Millenium Prize Problems: Funded in 2000 by the Clay Mathematical Institute. Seven classical open problems in Mathematics. Fuzzy rules. The Poincar´ conjecture was solved by Perelman in 2003. e No award yet. David Bessis The Netflix Prize: yet another million dollar problem
  • 11. The Problem Strategies Some Funny New Science 7 + 1 Million Dollar Problems Millenium Prize Problems: Funded in 2000 by the Clay Mathematical Institute. Seven classical open problems in Mathematics. Fuzzy rules. The Poincar´ conjecture was solved by Perelman in 2003. e No award yet. Netflix Prize: David Bessis The Netflix Prize: yet another million dollar problem
  • 12. The Problem Strategies Some Funny New Science 7 + 1 Million Dollar Problems Millenium Prize Problems: Funded in 2000 by the Clay Mathematical Institute. Seven classical open problems in Mathematics. Fuzzy rules. The Poincar´ conjecture was solved by Perelman in 2003. e No award yet. Netflix Prize: Funded in 2006 by the DVD rental company Netflix. David Bessis The Netflix Prize: yet another million dollar problem
  • 13. The Problem Strategies Some Funny New Science 7 + 1 Million Dollar Problems Millenium Prize Problems: Funded in 2000 by the Clay Mathematical Institute. Seven classical open problems in Mathematics. Fuzzy rules. The Poincar´ conjecture was solved by Perelman in 2003. e No award yet. Netflix Prize: Funded in 2006 by the DVD rental company Netflix. A problem in Applied Mathematics. David Bessis The Netflix Prize: yet another million dollar problem
  • 14. The Problem Strategies Some Funny New Science 7 + 1 Million Dollar Problems Millenium Prize Problems: Funded in 2000 by the Clay Mathematical Institute. Seven classical open problems in Mathematics. Fuzzy rules. The Poincar´ conjecture was solved by Perelman in 2003. e No award yet. Netflix Prize: Funded in 2006 by the DVD rental company Netflix. A problem in Applied Mathematics Computer Science. David Bessis The Netflix Prize: yet another million dollar problem
  • 15. The Problem Strategies Some Funny New Science 7 + 1 Million Dollar Problems Millenium Prize Problems: Funded in 2000 by the Clay Mathematical Institute. Seven classical open problems in Mathematics. Fuzzy rules. The Poincar´ conjecture was solved by Perelman in 2003. e No award yet. Netflix Prize: Funded in 2006 by the DVD rental company Netflix. A problem in Applied Mathematics Computer Science Psychology. David Bessis The Netflix Prize: yet another million dollar problem
  • 16. The Problem Strategies Some Funny New Science 7 + 1 Million Dollar Problems Millenium Prize Problems: Funded in 2000 by the Clay Mathematical Institute. Seven classical open problems in Mathematics. Fuzzy rules. The Poincar´ conjecture was solved by Perelman in 2003. e No award yet. Netflix Prize: Funded in 2006 by the DVD rental company Netflix. A problem in Applied Mathematics Computer Science Psychology (do we really care?) David Bessis The Netflix Prize: yet another million dollar problem
  • 17. The Problem Strategies Some Funny New Science 7 + 1 Million Dollar Problems Millenium Prize Problems: Funded in 2000 by the Clay Mathematical Institute. Seven classical open problems in Mathematics. Fuzzy rules. The Poincar´ conjecture was solved by Perelman in 2003. e No award yet. Netflix Prize: Funded in 2006 by the DVD rental company Netflix. A problem in Some Funny New Science. David Bessis The Netflix Prize: yet another million dollar problem
  • 18. The Problem Strategies Some Funny New Science 7 + 1 Million Dollar Problems Millenium Prize Problems: Funded in 2000 by the Clay Mathematical Institute. Seven classical open problems in Mathematics. Fuzzy rules. The Poincar´ conjecture was solved by Perelman in 2003. e No award yet. Netflix Prize: Funded in 2006 by the DVD rental company Netflix. A problem in Some Funny New Science. Clear rules. David Bessis The Netflix Prize: yet another million dollar problem
  • 19. The Problem Strategies Some Funny New Science 7 + 1 Million Dollar Problems Millenium Prize Problems: Funded in 2000 by the Clay Mathematical Institute. Seven classical open problems in Mathematics. Fuzzy rules. The Poincar´ conjecture was solved by Perelman in 2003. e No award yet. Netflix Prize: Funded in 2006 by the DVD rental company Netflix. A problem in Some Funny New Science. Reasonably clear rules. David Bessis The Netflix Prize: yet another million dollar problem
  • 20. The Problem Strategies Some Funny New Science 7 + 1 Million Dollar Problems Millenium Prize Problems: Funded in 2000 by the Clay Mathematical Institute. Seven classical open problems in Mathematics. Fuzzy rules. The Poincar´ conjecture was solved by Perelman in 2003. e No award yet. Netflix Prize: Funded in 2006 by the DVD rental company Netflix. A problem in Some Funny New Science. Reasonably clear rules. Prize awarded in September 2009. David Bessis The Netflix Prize: yet another million dollar problem
  • 21. The Problem Rules Strategies Competition Some Funny New Science Context Netflix has an “all-you-can-eat” pricing model. David Bessis The Netflix Prize: yet another million dollar problem
  • 22. The Problem Rules Strategies Competition Some Funny New Science Context Netflix has an “all-you-can-eat” pricing model. They need their users to watch a lot of movies. David Bessis The Netflix Prize: yet another million dollar problem
  • 23. The Problem Rules Strategies Competition Some Funny New Science Context Netflix has an “all-you-can-eat” pricing model. They need their users to watch a lot of movies. Beyond a few obvious choices, people don’t know what they want to watch. David Bessis The Netflix Prize: yet another million dollar problem
  • 24. The Problem Rules Strategies Competition Some Funny New Science Context Netflix has an “all-you-can-eat” pricing model. They need their users to watch a lot of movies. Beyond a few obvious choices, people don’t know what they want to watch. Collaborative filtering: recommending products based on prior evaluations by other users (just like Amazon does). David Bessis The Netflix Prize: yet another million dollar problem
  • 25. The Problem Rules Strategies Competition Some Funny New Science Context Netflix has an “all-you-can-eat” pricing model. They need their users to watch a lot of movies. Beyond a few obvious choices, people don’t know what they want to watch. Collaborative filtering: recommending products based on prior evaluations by other users (just like Amazon does). The Netflix prize is a collaborative filtering competition: David Bessis The Netflix Prize: yet another million dollar problem
  • 26. The Problem Rules Strategies Competition Some Funny New Science Context Netflix has an “all-you-can-eat” pricing model. They need their users to watch a lot of movies. Beyond a few obvious choices, people don’t know what they want to watch. Collaborative filtering: recommending products based on prior evaluations by other users (just like Amazon does). The Netflix prize is a collaborative filtering competition: Based on a huge dataset of actual ratings by Netflix users. David Bessis The Netflix Prize: yet another million dollar problem
  • 27. The Problem Rules Strategies Competition Some Funny New Science Context Netflix has an “all-you-can-eat” pricing model. They need their users to watch a lot of movies. Beyond a few obvious choices, people don’t know what they want to watch. Collaborative filtering: recommending products based on prior evaluations by other users (just like Amazon does). The Netflix prize is a collaborative filtering competition: Based on a huge dataset of actual ratings by Netflix users. Open to almost everyone. David Bessis The Netflix Prize: yet another million dollar problem
  • 28. The Problem Rules Strategies Competition Some Funny New Science Context Netflix has an “all-you-can-eat” pricing model. They need their users to watch a lot of movies. Beyond a few obvious choices, people don’t know what they want to watch. Collaborative filtering: recommending products based on prior evaluations by other users (just like Amazon does). The Netflix prize is a collaborative filtering competition: Based on a huge dataset of actual ratings by Netflix users. Open to almost everyone. Endowed with a $1.000.000 prize. David Bessis The Netflix Prize: yet another million dollar problem
  • 29. The Problem Rules Strategies Competition Some Funny New Science The Dataset The user space U consists of 480 189 users (identified by a meaningless non-sequential integral id). David Bessis The Netflix Prize: yet another million dollar problem
  • 30. The Problem Rules Strategies Competition Some Funny New Science The Dataset The user space U consists of 480 189 users (identified by a meaningless non-sequential integral id). The movie space M consists of 17 770 movies (identified by integers 1, . . . , 17 770, and the associated list of titles and release years is provided – this data is meaningful and minable). David Bessis The Netflix Prize: yet another million dollar problem
  • 31. The Problem Rules Strategies Competition Some Funny New Science The Dataset The user space U consists of 480 189 users (identified by a meaningless non-sequential integral id). The movie space M consists of 17 770 movies (identified by integers 1, . . . , 17 770, and the associated list of titles and release years is provided – this data is meaningful and minable). The date space D spans the period Oct. 1998 – Dec. 2005 (extremely meaningful data; no time of day is provided). David Bessis The Netflix Prize: yet another million dollar problem
  • 32. The Problem Rules Strategies Competition Some Funny New Science The Dataset The user space U consists of 480 189 users (identified by a meaningless non-sequential integral id). The movie space M consists of 17 770 movies (identified by integers 1, . . . , 17 770, and the associated list of titles and release years is provided – this data is meaningful and minable). The date space D spans the period Oct. 1998 – Dec. 2005 (extremely meaningful data; no time of day is provided). The rating space R is {1, 2, 3, 4, 5} (”stars”). David Bessis The Netflix Prize: yet another million dollar problem
  • 33. The Problem Rules Strategies Competition Some Funny New Science The Dataset The user space U consists of 480 189 users (identified by a meaningless non-sequential integral id). The movie space M consists of 17 770 movies (identified by integers 1, . . . , 17 770, and the associated list of titles and release years is provided – this data is meaningful and minable). The date space D spans the period Oct. 1998 – Dec. 2005 (extremely meaningful data; no time of day is provided). The rating space R is {1, 2, 3, 4, 5} (”stars”). The training dataset T contains 100 480 507 quadruples (u, m, d, r ) ∈ U × M × D × R. David Bessis The Netflix Prize: yet another million dollar problem
  • 34. The Problem Rules Strategies Competition Some Funny New Science The Dataset The user space U consists of 480 189 users (identified by a meaningless non-sequential integral id). The movie space M consists of 17 770 movies (identified by integers 1, . . . , 17 770, and the associated list of titles and release years is provided – this data is meaningful and minable). The date space D spans the period Oct. 1998 – Dec. 2005 (extremely meaningful data; no time of day is provided). The rating space R is {1, 2, 3, 4, 5} (”stars”). The training dataset T contains 100 480 507 quadruples (u, m, d, r ) ∈ U × M × D × R. The qualifying dataset Q contains 2 817 131 triples (u, m, d) ∈ U × M × D. David Bessis The Netflix Prize: yet another million dollar problem
  • 35. The Problem Rules Strategies Competition Some Funny New Science The Challenge Open to everyone David Bessis The Netflix Prize: yet another million dollar problem
  • 36. The Problem Rules Strategies Competition Some Funny New Science The Challenge Open to everyone except Netflix employees and their relatives David Bessis The Netflix Prize: yet another million dollar problem
  • 37. The Problem Rules Strategies Competition Some Funny New Science The Challenge Open to everyone except Netflix employees and their relatives and residents of Cuba, Iran, Syria, North Korea, Myanmar and Sudan. David Bessis The Netflix Prize: yet another million dollar problem
  • 38. The Problem Rules Strategies Competition Some Funny New Science The Challenge Open to everyone except Netflix employees and their relatives and residents of Cuba, Iran, Syria, North Korea, Myanmar and Sudan. Participants can join efforts in teams. David Bessis The Netflix Prize: yet another million dollar problem
  • 39. The Problem Rules Strategies Competition Some Funny New Science The Challenge Open to everyone except Netflix employees and their relatives and residents of Cuba, Iran, Syria, North Korea, Myanmar and Sudan. Participants can join efforts in teams. They can upload their predictions up to once a day. David Bessis The Netflix Prize: yet another million dollar problem
  • 40. The Problem Rules Strategies Competition Some Funny New Science The Challenge Open to everyone except Netflix employees and their relatives and residents of Cuba, Iran, Syria, North Korea, Myanmar and Sudan. Participants can join efforts in teams. They can upload their predictions up to once a day. Predictions are maps from the qualifying set Q to the interval [1, 5]. David Bessis The Netflix Prize: yet another million dollar problem
  • 41. The Problem Rules Strategies Competition Some Funny New Science The Challenge Open to everyone except Netflix employees and their relatives and residents of Cuba, Iran, Syria, North Korea, Myanmar and Sudan. Participants can join efforts in teams. They can upload their predictions up to once a day. Predictions are maps from the qualifying set Q to the interval [1, 5]. The metric used to benchmark predictions is the RMSE (”root of mean square error”) 1 RMSE = |predicted rating for q − actual rating for q|2 |Q| q∈Q David Bessis The Netflix Prize: yet another million dollar problem
  • 42. The Problem Rules Strategies Competition Some Funny New Science Typical RMSEs Theoretically, the RMSE cannot be greater than 2. David Bessis The Netflix Prize: yet another million dollar problem
  • 43. The Problem Rules Strategies Competition Some Funny New Science Typical RMSEs Theoretically, the RMSE cannot be greater than 2. Users tend to view and rate movies they like, so they typically give 3, 4 or 5 stars rather than 1 or 2 (the above upper bound is unrealistically pessimistic). David Bessis The Netflix Prize: yet another million dollar problem
  • 44. The Problem Rules Strategies Competition Some Funny New Science Typical RMSEs Theoretically, the RMSE cannot be greater than 2. Users tend to view and rate movies they like, so they typically give 3, 4 or 5 stars rather than 1 or 2 (the above upper bound is unrealistically pessimistic). A basic prediction consists of mapping a triple (u, m, d) to the average rating obtained by the movie m. David Bessis The Netflix Prize: yet another million dollar problem
  • 45. The Problem Rules Strategies Competition Some Funny New Science Typical RMSEs Theoretically, the RMSE cannot be greater than 2. Users tend to view and rate movies they like, so they typically give 3, 4 or 5 stars rather than 1 or 2 (the above upper bound is unrealistically pessimistic). A basic prediction consists of mapping a triple (u, m, d) to the average rating obtained by the movie m. It achieves 1.0540. David Bessis The Netflix Prize: yet another million dollar problem
  • 46. The Problem Rules Strategies Competition Some Funny New Science Typical RMSEs Theoretically, the RMSE cannot be greater than 2. Users tend to view and rate movies they like, so they typically give 3, 4 or 5 stars rather than 1 or 2 (the above upper bound is unrealistically pessimistic). A basic prediction consists of mapping a triple (u, m, d) to the average rating obtained by the movie m. It achieves 1.0540. At the beginning of the Challenge, Netflix’s in-house prediction system Cinematch achieved 0.9514 (roughly a 10% improvement). David Bessis The Netflix Prize: yet another million dollar problem
  • 47. The Problem Rules Strategies Competition Some Funny New Science Typical RMSEs Theoretically, the RMSE cannot be greater than 2. Users tend to view and rate movies they like, so they typically give 3, 4 or 5 stars rather than 1 or 2 (the above upper bound is unrealistically pessimistic). A basic prediction consists of mapping a triple (u, m, d) to the average rating obtained by the movie m. It achieves 1.0540. At the beginning of the Challenge, Netflix’s in-house prediction system Cinematch achieved 0.9514 (roughly a 10% improvement). Netflix set the following target: obtain a further 10% improvement over Cinematch. David Bessis The Netflix Prize: yet another million dollar problem
  • 48. The Problem Rules Strategies Competition Some Funny New Science Very Smart Rules 1: a Cryptographic Trick Netflix has secretly partitioned the qualifying set Q = Q1 Q2 into two subsets of (approximately) equal sizes. David Bessis The Netflix Prize: yet another million dollar problem
  • 49. The Problem Rules Strategies Competition Some Funny New Science Very Smart Rules 1: a Cryptographic Trick Netflix has secretly partitioned the qualifying set Q = Q1 Q2 into two subsets of (approximately) equal sizes. The RMSE achieved on Q1 is revealed to participants (there is a public leaderboard). David Bessis The Netflix Prize: yet another million dollar problem
  • 50. The Problem Rules Strategies Competition Some Funny New Science Very Smart Rules 1: a Cryptographic Trick Netflix has secretly partitioned the qualifying set Q = Q1 Q2 into two subsets of (approximately) equal sizes. The RMSE achieved on Q1 is revealed to participants (there is a public leaderboard). The RMSE achieved on Q2 is used to determine the winner. David Bessis The Netflix Prize: yet another million dollar problem
  • 51. The Problem Rules Strategies Competition Some Funny New Science Very Smart Rules 1: a Cryptographic Trick Netflix has secretly partitioned the qualifying set Q = Q1 Q2 into two subsets of (approximately) equal sizes. The RMSE achieved on Q1 is revealed to participants (there is a public leaderboard). The RMSE achieved on Q2 is used to determine the winner. This prevented participants from “learning from the oracle”. David Bessis The Netflix Prize: yet another million dollar problem
  • 52. The Problem Rules Strategies Competition Some Funny New Science Very Smart Rules 1: a Cryptographic Trick Netflix has secretly partitioned the qualifying set Q = Q1 Q2 into two subsets of (approximately) equal sizes. The RMSE achieved on Q1 is revealed to participants (there is a public leaderboard). The RMSE achieved on Q2 is used to determine the winner. This prevented participants from “learning from the oracle”. The goal was to achieve 0.8572. David Bessis The Netflix Prize: yet another million dollar problem
  • 53. The Problem Rules Strategies Competition Some Funny New Science Very Smart Rules 2: Crowd Psychology Tricks The Challenged opened on October 2, 2006. David Bessis The Netflix Prize: yet another million dollar problem
  • 54. The Problem Rules Strategies Competition Some Funny New Science Very Smart Rules 2: Crowd Psychology Tricks The Challenged opened on October 2, 2006. Annual $50.000 prizes were offered to current leaders David Bessis The Netflix Prize: yet another million dollar problem
  • 55. The Problem Rules Strategies Competition Some Funny New Science Very Smart Rules 2: Crowd Psychology Tricks The Challenged opened on October 2, 2006. Annual $50.000 prizes were offered to current leaders provided they made their current methodology public. David Bessis The Netflix Prize: yet another million dollar problem
  • 56. The Problem Rules Strategies Competition Some Funny New Science Very Smart Rules 2: Crowd Psychology Tricks The Challenged opened on October 2, 2006. Annual $50.000 prizes were offered to current leaders provided they made their current methodology public. The Challenge was to last for 30 more days after the goal was achieved. David Bessis The Netflix Prize: yet another million dollar problem
  • 57. The Problem Rules Strategies Competition Some Funny New Science Very Smart Rules 2: Crowd Psychology Tricks The Challenged opened on October 2, 2006. Annual $50.000 prizes were offered to current leaders provided they made their current methodology public. The Challenge was to last for 30 more days after the goal was achieved. The winner would be the team with the best RMSE after this 30 days period David Bessis The Netflix Prize: yet another million dollar problem
  • 58. The Problem Rules Strategies Competition Some Funny New Science Very Smart Rules 2: Crowd Psychology Tricks The Challenged opened on October 2, 2006. Annual $50.000 prizes were offered to current leaders provided they made their current methodology public. The Challenge was to last for 30 more days after the goal was achieved. The winner would be the team with the best RMSE after this 30 days period (no backstabbing arXiv-style “I posted first” effect). David Bessis The Netflix Prize: yet another million dollar problem
  • 59. The Problem Rules Strategies Competition Some Funny New Science Very Smart Rules 2: Crowd Psychology Tricks The Challenged opened on October 2, 2006. Annual $50.000 prizes were offered to current leaders provided they made their current methodology public. The Challenge was to last for 30 more days after the goal was achieved. The winner would be the team with the best RMSE after this 30 days period (no backstabbing arXiv-style “I posted first” effect). Every detail was carefully anticipated (even the possibility of a tie). David Bessis The Netflix Prize: yet another million dollar problem
  • 60. The Problem Rules Strategies Competition Some Funny New Science Very Smart Rules 2: Crowd Psychology Tricks The Challenged opened on October 2, 2006. Annual $50.000 prizes were offered to current leaders provided they made their current methodology public. The Challenge was to last for 30 more days after the goal was achieved. The winner would be the team with the best RMSE after this 30 days period (no backstabbing arXiv-style “I posted first” effect). Every detail was carefully anticipated (even the possibility of a tie). These smart rules, together with the $1.000.000 prize, attracted thousands of participants. David Bessis The Netflix Prize: yet another million dollar problem
  • 61. The Problem Rules Strategies Competition Some Funny New Science Timeline October 2006: Cinematch RMSE = 0.9514. David Bessis The Netflix Prize: yet another million dollar problem
  • 62. The Problem Rules Strategies Competition Some Funny New Science Timeline October 2006: Cinematch RMSE = 0.9514. October 2007: team KorBell leads with 0.8712 (8.43% improvement). David Bessis The Netflix Prize: yet another million dollar problem
  • 63. The Problem Rules Strategies Competition Some Funny New Science Timeline October 2006: Cinematch RMSE = 0.9514. October 2007: team KorBell leads with 0.8712 (8.43% improvement). October 2008: team “BellKor in BigChaos” (two teams merging efforts) leads with 0.8616 (9.44% improvement). David Bessis The Netflix Prize: yet another million dollar problem
  • 64. The Problem Rules Strategies Competition Some Funny New Science Timeline October 2006: Cinematch RMSE = 0.9514. October 2007: team KorBell leads with 0.8712 (8.43% improvement). October 2008: team “BellKor in BigChaos” (two teams merging efforts) leads with 0.8616 (9.44% improvement). June 26, 2009: the goal is achieved. David Bessis The Netflix Prize: yet another million dollar problem
  • 65. The Problem Rules Strategies Competition Some Funny New Science Timeline October 2006: Cinematch RMSE = 0.9514. October 2007: team KorBell leads with 0.8712 (8.43% improvement). October 2008: team “BellKor in BigChaos” (two teams merging efforts) leads with 0.8616 (9.44% improvement). June 26, 2009: the goal is achieved. July 26, 2009: Netflix stops gathering solutions. David Bessis The Netflix Prize: yet another million dollar problem
  • 66. The Problem Rules Strategies Competition Some Funny New Science Timeline October 2006: Cinematch RMSE = 0.9514. October 2007: team KorBell leads with 0.8712 (8.43% improvement). October 2008: team “BellKor in BigChaos” (two teams merging efforts) leads with 0.8616 (9.44% improvement). June 26, 2009: the goal is achieved. July 26, 2009: Netflix stops gathering solutions. The winner is announced on September 18, 2009. David Bessis The Netflix Prize: yet another million dollar problem
  • 67. The Problem Rules Strategies Competition Some Funny New Science The winning team Three teams combined their results to win the competition: BellKor Bob Bell (AT&T) Yehuda Koren (Yahoo) Chris Volinsky (AT&T) BigChaos Michael Jahrer (Commendo research and consulting) Andreas T¨scher (Commendo research and consulting) o Pragmatic Theory Martin Chabbert (Pragmatic Theory) Martin Piotte (Pragmatic Theory) David Bessis The Netflix Prize: yet another million dollar problem
  • 68. The Problem Rules Strategies Competition Some Funny New Science The winning team Three teams combined their results to win the competition: BellKor Bob Bell (AT&T) Yehuda Koren (Yahoo) Chris Volinsky (AT&T) BigChaos Michael Jahrer (Commendo research and consulting) Andreas T¨scher (Commendo research and consulting) o Pragmatic Theory Martin Chabbert (Pragmatic Theory) Martin Piotte (Pragmatic Theory) Their winnning submission achieved a RMSE of 0.8567 (10.06% improvement over Cinematch.) David Bessis The Netflix Prize: yet another million dollar problem
  • 69. The Problem Rules Strategies Competition Some Funny New Science The winning team Three teams combined their results to win the competition: BellKor Bob Bell (AT&T) Yehuda Koren (Yahoo) Chris Volinsky (AT&T) BigChaos Michael Jahrer (Commendo research and consulting) Andreas T¨scher (Commendo research and consulting) o Pragmatic Theory Martin Chabbert (Pragmatic Theory) Martin Piotte (Pragmatic Theory) Their winnning submission achieved a RMSE of 0.8567 (10.06% improvement over Cinematch.) Another team, The Ensemble, achieved the same RMSE... David Bessis The Netflix Prize: yet another million dollar problem
  • 70. The Problem Rules Strategies Competition Some Funny New Science The winning team Three teams combined their results to win the competition: BellKor Bob Bell (AT&T) Yehuda Koren (Yahoo) Chris Volinsky (AT&T) BigChaos Michael Jahrer (Commendo research and consulting) Andreas T¨scher (Commendo research and consulting) o Pragmatic Theory Martin Chabbert (Pragmatic Theory) Martin Piotte (Pragmatic Theory) Their winnning submission achieved a RMSE of 0.8567 (10.06% improvement over Cinematch.) Another team, The Ensemble, achieved the same RMSE... ...and lost because their submission was posted 24 minutes later! David Bessis The Netflix Prize: yet another million dollar problem
  • 71. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Computer implementation Memory requirements: Movies can be encoded on 2 bytes (17770 < 2562 ). David Bessis The Netflix Prize: yet another million dollar problem
  • 72. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Computer implementation Memory requirements: Movies can be encoded on 2 bytes (17770 < 2562 ). Viewers can be encoded on 3 bytes (480189 < 2563 ). David Bessis The Netflix Prize: yet another million dollar problem
  • 73. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Computer implementation Memory requirements: Movies can be encoded on 2 bytes (17770 < 2562 ). Viewers can be encoded on 3 bytes (480189 < 2563 ). Dates can be encoded on 2 bytes. David Bessis The Netflix Prize: yet another million dollar problem
  • 74. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Computer implementation Memory requirements: Movies can be encoded on 2 bytes (17770 < 2562 ). Viewers can be encoded on 3 bytes (480189 < 2563 ). Dates can be encoded on 2 bytes. A triple (m, v , d) can be encoded on 7 bytes. David Bessis The Netflix Prize: yet another million dollar problem
  • 75. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Computer implementation Memory requirements: Movies can be encoded on 2 bytes (17770 < 2562 ). Viewers can be encoded on 3 bytes (480189 < 2563 ). Dates can be encoded on 2 bytes. A triple (m, v , d) can be encoded on 7 bytes. 700 MB suffice to store the dataset. David Bessis The Netflix Prize: yet another million dollar problem
  • 76. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Computer implementation Memory requirements: Movies can be encoded on 2 bytes (17770 < 2562 ). Viewers can be encoded on 3 bytes (480189 < 2563 ). Dates can be encoded on 2 bytes. A triple (m, v , d) can be encoded on 7 bytes. 700 MB suffice to store the dataset. It is possible (necessary) to work in RAM. David Bessis The Netflix Prize: yet another million dollar problem
  • 77. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Computer implementation Memory requirements: Movies can be encoded on 2 bytes (17770 < 2562 ). Viewers can be encoded on 3 bytes (480189 < 2563 ). Dates can be encoded on 2 bytes. A triple (m, v , d) can be encoded on 7 bytes. 700 MB suffice to store the dataset. It is possible (necessary) to work in RAM. Commodity hardware is sufficient. David Bessis The Netflix Prize: yet another million dollar problem
  • 78. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Computer implementation Memory requirements: Movies can be encoded on 2 bytes (17770 < 2562 ). Viewers can be encoded on 3 bytes (480189 < 2563 ). Dates can be encoded on 2 bytes. A triple (m, v , d) can be encoded on 7 bytes. 700 MB suffice to store the dataset. It is possible (necessary) to work in RAM. Commodity hardware is sufficient. (I have some Ruby code to interactively play with the dataset.) David Bessis The Netflix Prize: yet another million dollar problem
  • 79. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Remarks About 200 ratings per users. David Bessis The Netflix Prize: yet another million dollar problem
  • 80. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Remarks About 200 ratings per users. This is likely caused by Cinematch’s data gathering procedure: David Bessis The Netflix Prize: yet another million dollar problem
  • 81. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Remarks About 200 ratings per users. This is likely caused by Cinematch’s data gathering procedure: users sometime rate tens of movies on a single day. David Bessis The Netflix Prize: yet another million dollar problem
  • 82. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Remarks About 200 ratings per users. This is likely caused by Cinematch’s data gathering procedure: users sometime rate tens of movies on a single day. This causes an insanely huge bias within the dataset (movies are perceived differently when rated individually or within a rating spree), not fully exploited by the winners. David Bessis The Netflix Prize: yet another million dollar problem
  • 83. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Remarks About 200 ratings per users. This is likely caused by Cinematch’s data gathering procedure: users sometime rate tens of movies on a single day. This causes an insanely huge bias within the dataset (movies are perceived differently when rated individually or within a rating spree), not fully exploited by the winners. Netflix, do you read me? David Bessis The Netflix Prize: yet another million dollar problem
  • 84. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Remarks About 200 ratings per users. This is likely caused by Cinematch’s data gathering procedure: users sometime rate tens of movies on a single day. This causes an insanely huge bias within the dataset (movies are perceived differently when rated individually or within a rating spree), not fully exploited by the winners. Netflix, do you read me? Some movies were rated by hundreds of thousands viewers, some by just a few (long-tail distribution). David Bessis The Netflix Prize: yet another million dollar problem
  • 85. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Remarks About 200 ratings per users. This is likely caused by Cinematch’s data gathering procedure: users sometime rate tens of movies on a single day. This causes an insanely huge bias within the dataset (movies are perceived differently when rated individually or within a rating spree), not fully exploited by the winners. Netflix, do you read me? Some movies were rated by hundreds of thousands viewers, some by just a few (long-tail distribution). Similarly, a user rated all the movies, and many just a few. David Bessis The Netflix Prize: yet another million dollar problem
  • 86. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Remarks About 200 ratings per users. This is likely caused by Cinematch’s data gathering procedure: users sometime rate tens of movies on a single day. This causes an insanely huge bias within the dataset (movies are perceived differently when rated individually or within a rating spree), not fully exploited by the winners. Netflix, do you read me? Some movies were rated by hundreds of thousands viewers, some by just a few (long-tail distribution). Similarly, a user rated all the movies, and many just a few. Let F be the set of all final 9 ratings for all individual users. David Bessis The Netflix Prize: yet another million dollar problem
  • 87. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Remarks About 200 ratings per users. This is likely caused by Cinematch’s data gathering procedure: users sometime rate tens of movies on a single day. This causes an insanely huge bias within the dataset (movies are perceived differently when rated individually or within a rating spree), not fully exploited by the winners. Netflix, do you read me? Some movies were rated by hundreds of thousands viewers, some by just a few (long-tail distribution). Similarly, a user rated all the movies, and many just a few. Let F be the set of all final 9 ratings for all individual users. Then F = Q P, with P ⊂ T publicly tagged by Netflix. David Bessis The Netflix Prize: yet another million dollar problem
  • 88. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Remarks About 200 ratings per users. This is likely caused by Cinematch’s data gathering procedure: users sometime rate tens of movies on a single day. This causes an insanely huge bias within the dataset (movies are perceived differently when rated individually or within a rating spree), not fully exploited by the winners. Netflix, do you read me? Some movies were rated by hundreds of thousands viewers, some by just a few (long-tail distribution). Similarly, a user rated all the movies, and many just a few. Let F be the set of all final 9 ratings for all individual users. Then F = Q P, with P ⊂ T publicly tagged by Netflix. Q is a random draw of 2/3 of F . David Bessis The Netflix Prize: yet another million dollar problem
  • 89. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Remarks About 200 ratings per users. This is likely caused by Cinematch’s data gathering procedure: users sometime rate tens of movies on a single day. This causes an insanely huge bias within the dataset (movies are perceived differently when rated individually or within a rating spree), not fully exploited by the winners. Netflix, do you read me? Some movies were rated by hundreds of thousands viewers, some by just a few (long-tail distribution). Similarly, a user rated all the movies, and many just a few. Let F be the set of all final 9 ratings for all individual users. Then F = Q P, with P ⊂ T publicly tagged by Netflix. Q is a random draw of 2/3 of F . Q resembles P but is very dissimilar from T . David Bessis The Netflix Prize: yet another million dollar problem
  • 90. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Algorithms The machine learning toolbox consists of many methods: Clustering methods. David Bessis The Netflix Prize: yet another million dollar problem
  • 91. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Algorithms The machine learning toolbox consists of many methods: Clustering methods. Regressions. David Bessis The Netflix Prize: yet another million dollar problem
  • 92. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Algorithms The machine learning toolbox consists of many methods: Clustering methods. Regressions. Latent parameters methods (SVD). David Bessis The Netflix Prize: yet another million dollar problem
  • 93. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Algorithms The machine learning toolbox consists of many methods: Clustering methods. Regressions. Latent parameters methods (SVD). Neural networks. David Bessis The Netflix Prize: yet another million dollar problem
  • 94. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Algorithms The machine learning toolbox consists of many methods: Clustering methods. Regressions. Latent parameters methods (SVD). Neural networks. SVM David Bessis The Netflix Prize: yet another million dollar problem
  • 95. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Algorithms The machine learning toolbox consists of many methods: Clustering methods. Regressions. Latent parameters methods (SVD). Neural networks. SVM ... David Bessis The Netflix Prize: yet another million dollar problem
  • 96. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Beginner’s mistakes Underestimate the volume effect. David Bessis The Netflix Prize: yet another million dollar problem
  • 97. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Beginner’s mistakes Underestimate the volume effect. Think conceptually and discretely rather than globally and continuously. David Bessis The Netflix Prize: yet another million dollar problem
  • 98. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Beginner’s mistakes Underestimate the volume effect. Think conceptually and discretely rather than globally and continuously. Put users and movies into categories (clustering introduces unwanted discontinuities). David Bessis The Netflix Prize: yet another million dollar problem
  • 99. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Beginner’s mistakes Underestimate the volume effect. Think conceptually and discretely rather than globally and continuously. Put users and movies into categories (clustering introduces unwanted discontinuities). Learn from the probe. David Bessis The Netflix Prize: yet another million dollar problem
  • 100. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Beginner’s mistakes Underestimate the volume effect. Think conceptually and discretely rather than globally and continuously. Put users and movies into categories (clustering introduces unwanted discontinuities). Learn from the probe. Dealing with 100 000 000 data isn’t a logic puzzle. David Bessis The Netflix Prize: yet another million dollar problem
  • 101. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Beginner’s mistakes Underestimate the volume effect. Think conceptually and discretely rather than globally and continuously. Put users and movies into categories (clustering introduces unwanted discontinuities). Learn from the probe. Dealing with 100 000 000 data isn’t a logic puzzle. It resembles Thermodynamics. David Bessis The Netflix Prize: yet another million dollar problem
  • 102. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Linear regression Suppose all viewers in X have rated all movies in Y : the rating matrix is (rx,y )(x,y )∈X ×Y . David Bessis The Netflix Prize: yet another million dollar problem
  • 103. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Linear regression Suppose all viewers in X have rated all movies in Y : the rating matrix is (rx,y )(x,y )∈X ×Y . Suppose you want to model the ratings given to a particular movie y0 based on the ratings given to the movies in Y = Y − {y0 }. David Bessis The Netflix Prize: yet another million dollar problem
  • 104. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Linear regression Suppose all viewers in X have rated all movies in Y : the rating matrix is (rx,y )(x,y )∈X ×Y . Suppose you want to model the ratings given to a particular movie y0 based on the ratings given to the movies in Y = Y − {y0 }. A linear regression is a natural way to do that. David Bessis The Netflix Prize: yet another million dollar problem
  • 105. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Linear regression Suppose all viewers in X have rated all movies in Y : the rating matrix is (rx,y )(x,y )∈X ×Y . Suppose you want to model the ratings given to a particular movie y0 based on the ratings given to the movies in Y = Y − {y0 }. A linear regression is a natural way to do that. Write (rx,y ) = (Cy )y ∈Y where the Cy are the column vectors. David Bessis The Netflix Prize: yet another million dollar problem
  • 106. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Linear regression Suppose all viewers in X have rated all movies in Y : the rating matrix is (rx,y )(x,y )∈X ×Y . Suppose you want to model the ratings given to a particular movie y0 based on the ratings given to the movies in Y = Y − {y0 }. A linear regression is a natural way to do that. Write (rx,y ) = (Cy )y ∈Y where the Cy are the column vectors. Performing the linear regression consists of approximating Cy0 by ˆ its orthogonal projection Cy0 on the hyperplane generated by the (Cy )y ∈Y . David Bessis The Netflix Prize: yet another million dollar problem
  • 107. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Linear regression Suppose all viewers in X have rated all movies in Y : the rating matrix is (rx,y )(x,y )∈X ×Y . Suppose you want to model the ratings given to a particular movie y0 based on the ratings given to the movies in Y = Y − {y0 }. A linear regression is a natural way to do that. Write (rx,y ) = (Cy )y ∈Y where the Cy are the column vectors. Performing the linear regression consists of approximating Cy0 by ˆ its orthogonal projection Cy0 on the hyperplane generated by the (Cy )y ∈Y . Clearly, there exists a unique solution. David Bessis The Netflix Prize: yet another million dollar problem
  • 108. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Linear regression Suppose all viewers in X have rated all movies in Y : the rating matrix is (rx,y )(x,y )∈X ×Y . Suppose you want to model the ratings given to a particular movie y0 based on the ratings given to the movies in Y = Y − {y0 }. A linear regression is a natural way to do that. Write (rx,y ) = (Cy )y ∈Y where the Cy are the column vectors. Performing the linear regression consists of approximating Cy0 by ˆ its orthogonal projection Cy0 on the hyperplane generated by the (Cy )y ∈Y . Clearly, there exists a unique solution. It optimizes RMSE. David Bessis The Netflix Prize: yet another million dollar problem
  • 109. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Linear regression Suppose all viewers in X have rated all movies in Y : the rating matrix is (rx,y )(x,y )∈X ×Y . Suppose you want to model the ratings given to a particular movie y0 based on the ratings given to the movies in Y = Y − {y0 }. A linear regression is a natural way to do that. Write (rx,y ) = (Cy )y ∈Y where the Cy are the column vectors. Performing the linear regression consists of approximating Cy0 by ˆ its orthogonal projection Cy0 on the hyperplane generated by the (Cy )y ∈Y . Clearly, there exists a unique solution. It optimizes RMSE. Write the formula! David Bessis The Netflix Prize: yet another million dollar problem
  • 110. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Real life problems 1: missing data Not all viewers have seen all movies. David Bessis The Netflix Prize: yet another million dollar problem
  • 111. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Real life problems 1: missing data Not all viewers have seen all movies. Worse, there are virtually no complete rectangular blocks within the dataset. David Bessis The Netflix Prize: yet another million dollar problem
  • 112. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Real life problems 1: missing data Not all viewers have seen all movies. Worse, there are virtually no complete rectangular blocks within the dataset. Regression by viewers or by movies? David Bessis The Netflix Prize: yet another million dollar problem
  • 113. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Real life problems 1: missing data Not all viewers have seen all movies. Worse, there are virtually no complete rectangular blocks within the dataset. Regression by viewers or by movies? It is better to do regression by movies. David Bessis The Netflix Prize: yet another million dollar problem
  • 114. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Real life problems 1: missing data Not all viewers have seen all movies. Worse, there are virtually no complete rectangular blocks within the dataset. Regression by viewers or by movies? It is better to do regression by movies. Normalize ratings: David Bessis The Netflix Prize: yet another million dollar problem
  • 115. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Real life problems 1: missing data Not all viewers have seen all movies. Worse, there are virtually no complete rectangular blocks within the dataset. Regression by viewers or by movies? It is better to do regression by movies. Normalize ratings: replace the rating rv ,m by the meaningful signal, i.e., the difference r v ,m between rv ,m and the average rating for m. David Bessis The Netflix Prize: yet another million dollar problem
  • 116. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Real life problems 1: missing data Not all viewers have seen all movies. Worse, there are virtually no complete rectangular blocks within the dataset. Regression by viewers or by movies? It is better to do regression by movies. Normalize ratings: replace the rating rv ,m by the meaningful signal, i.e., the difference r v ,m between rv ,m and the average rating for m. Then it becomes natural to set r v ,m to 0 when v hasn’t rated m. David Bessis The Netflix Prize: yet another million dollar problem
  • 117. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Real life problems 1: missing data Not all viewers have seen all movies. Worse, there are virtually no complete rectangular blocks within the dataset. Regression by viewers or by movies? It is better to do regression by movies. Normalize ratings: replace the rating rv ,m by the meaningful signal, i.e., the difference r v ,m between rv ,m and the average rating for m. Then it becomes natural to set r v ,m to 0 when v hasn’t rated m. Actually, whether or not v has rated m is a meaningful information! David Bessis The Netflix Prize: yet another million dollar problem
  • 118. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Real life problems 1: missing data Not all viewers have seen all movies. Worse, there are virtually no complete rectangular blocks within the dataset. Regression by viewers or by movies? It is better to do regression by movies. Normalize ratings: replace the rating rv ,m by the meaningful signal, i.e., the difference r v ,m between rv ,m and the average rating for m. Then it becomes natural to set r v ,m to 0 when v hasn’t rated m. Actually, whether or not v has rated m is a meaningful information! Add normalized bit columns to account for that. David Bessis The Netflix Prize: yet another million dollar problem
  • 119. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Real life problems 2: the curse of dimensionality We all know that Lagrange interpolators are not to be used on noisy data. Rather, one should look at best-fitting polynomials of a given low degree. David Bessis The Netflix Prize: yet another million dollar problem
  • 120. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Real life problems 2: the curse of dimensionality We all know that Lagrange interpolators are not to be used on noisy data. Rather, one should look at best-fitting polynomials of a given low degree. Similarly, the curse of dimensionality asserts that: With high-dimensionality datasets, one will always find stupid predictors, making perfect predictions on the dataset, and failing to generalize. David Bessis The Netflix Prize: yet another million dollar problem
  • 121. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Real life problems 2: the curse of dimensionality We all know that Lagrange interpolators are not to be used on noisy data. Rather, one should look at best-fitting polynomials of a given low degree. Similarly, the curse of dimensionality asserts that: With high-dimensionality datasets, one will always find stupid predictors, making perfect predictions on the dataset, and failing to generalize. By looking at my audience today, what should I be able to infer? David Bessis The Netflix Prize: yet another million dollar problem
  • 122. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Real life problems 2: the curse of dimensionality We all know that Lagrange interpolators are not to be used on noisy data. Rather, one should look at best-fitting polynomials of a given low degree. Similarly, the curse of dimensionality asserts that: With high-dimensionality datasets, one will always find stupid predictors, making perfect predictions on the dataset, and failing to generalize. By looking at my audience today, what should I be able to infer? That having long hair is a reasonably good gender predictor? David Bessis The Netflix Prize: yet another million dollar problem
  • 123. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Real life problems 2: the curse of dimensionality We all know that Lagrange interpolators are not to be used on noisy data. Rather, one should look at best-fitting polynomials of a given low degree. Similarly, the curse of dimensionality asserts that: With high-dimensionality datasets, one will always find stupid predictors, making perfect predictions on the dataset, and failing to generalize. By looking at my audience today, what should I be able to infer? That having long hair is a reasonably good gender predictor? That wearing a grey sweater is a reasonably good gender predictor? David Bessis The Netflix Prize: yet another million dollar problem
  • 124. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Real life problems 2: the curse of dimensionality We all know that Lagrange interpolators are not to be used on noisy data. Rather, one should look at best-fitting polynomials of a given low degree. Similarly, the curse of dimensionality asserts that: With high-dimensionality datasets, one will always find stupid predictors, making perfect predictions on the dataset, and failing to generalize. By looking at my audience today, what should I be able to infer? That having long hair is a reasonably good gender predictor? That wearing a grey sweater is a reasonably good gender predictor? Dilemma: overlearning vs underlearning. David Bessis The Netflix Prize: yet another million dollar problem
  • 125. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Ridge regression (aka Tikhonov regularization) Linear regression: given vectors x, y1 , . . . , yn ∈ Rm , find λ1 , . . . , λn that minimize ||x − λi yi ||2 . David Bessis The Netflix Prize: yet another million dollar problem
  • 126. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Ridge regression (aka Tikhonov regularization) Linear regression: given vectors x, y1 , . . . , yn ∈ Rm , find λ1 , . . . , λn that minimize ||x − λi yi ||2 . When n is large (with respect to m), the linear system is overdetermined. Overfitting occurs. David Bessis The Netflix Prize: yet another million dollar problem
  • 127. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Ridge regression (aka Tikhonov regularization) Linear regression: given vectors x, y1 , . . . , yn ∈ Rm , find λ1 , . . . , λn that minimize ||x − λi yi ||2 . When n is large (with respect to m), the linear system is overdetermined. Overfitting occurs. A telltale sign of overfitting is the presence of λi ’s with huge norms compensating each other. David Bessis The Netflix Prize: yet another million dollar problem
  • 128. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Ridge regression (aka Tikhonov regularization) Linear regression: given vectors x, y1 , . . . , yn ∈ Rm , find λ1 , . . . , λn that minimize ||x − λi yi ||2 . When n is large (with respect to m), the linear system is overdetermined. Overfitting occurs. A telltale sign of overfitting is the presence of λi ’s with huge norms compensating each other. Ridge regression (Tikhonov regularization): find λ1 , . . . , λn that minimize ||x − λi yi ||2 + ε |λi |2 where ε is a well-adjusted (small) penalty term. David Bessis The Netflix Prize: yet another million dollar problem
  • 129. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Assigning attributes to movies Assume that movies differ by their amount of certain qualities: David Bessis The Netflix Prize: yet another million dollar problem
  • 130. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Assigning attributes to movies Assume that movies differ by their amount of certain qualities: Violence. David Bessis The Netflix Prize: yet another million dollar problem
  • 131. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Assigning attributes to movies Assume that movies differ by their amount of certain qualities: Violence. Sex. David Bessis The Netflix Prize: yet another million dollar problem
  • 132. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Assigning attributes to movies Assume that movies differ by their amount of certain qualities: Violence. Sex. Anything else? David Bessis The Netflix Prize: yet another million dollar problem
  • 133. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Assigning attributes to movies Assume that movies differ by their amount of certain qualities: Violence. Sex. Maybe not. David Bessis The Netflix Prize: yet another million dollar problem