O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Machine Learning for Q&A Sites: The Quora Example

4.733 visualizações

Publicada em

Talk I gave at the Question Answering Workshop at WWW2016 Conference in Montreal, Canada

Publicada em: Tecnologia

Machine Learning for Q&A Sites: The Quora Example

  1. 1. Machine Learning for Q&A Sites: The Quora Example Xavier Amatriain (@xamat) 04/11/2016
  2. 2. “To share and grow the world’s knowledge” • Millions of questions & answers • Millions of users • Thousands of topics • ...
  3. 3. DemandQuality Relevance
  4. 4. Data
  5. 5. Machine Learning Applications for Q&A Sites
  6. 6. Answer Ranking
  7. 7. Goal • Given a question and n answers, come up with the ideal ranking of those n answers
  8. 8. What is a good Quora answer? • truthful • reusable • provides explanation • well formatted • ...
  9. 9. How are those dimensions translated into features? • Features that relate to the text quality itself • Interaction features (upvotes/downvotes, clicks, comments…) • User features (e.g. expertise in topic)
  10. 10. Feed Ranking
  11. 11. • Goal: Present most interesting stories for a user at a given time • Interesting = topical relevance + social relevance + timeliness • Stories = questions + answers • ML: Personalized learning-to-rank approach • Relevance-ordered vs time-ordered = big gains in engagement • Challenges: • potentially many candidate stories • real-time ranking • optimize for relevance
  12. 12. Feed dataset: impression logs click upvote downvote expand share click answer pass downvote follow
  13. 13. ● Value of showing a story to a user, e.g. weighted sum of actions: v = ∑a va 1{ya = 1} ● Goal: predict this value for new stories. 2 possible approaches: ○ predict value directly v_pred = f(x) ■ pros: single regression model ■ cons: can be ambiguous, coupled ○ predict probabilities for each action, then compute expected value: v_pred = E[ V | x ] = ∑a va p(a | x) ■ pros: better use of supervised signal, decouples action models from action values ■ cons: more costly, one classifier per action What is relevance?
  14. 14. ● Essential for getting good rankings ● Better if updated in real-time (more reactive) ● Main sets of features: ○ user (e.g. age, country, recent activity) ○ story (e.g. popularity, trendiness, quality) ○ interactions between the two (e.g. topic or author affinity) Feature engineering
  15. 15. ● Linear ○ simple, fast to train ○ manual, non-linear transforms for richer representation (buckets, ngrams) ● Decision trees ○ learn non-linear representations ● Tree ensembles ○ Random forests ○ Gradient boosted decision trees ● In-house C++ training code, third-party libraries for prototyping new models Models
  16. 16. Ask2Answer
  17. 17. ● Given a question and a viewer rank all other users based on how “well-suited” they are. ○ “Well-suited” = likelihood of viewer sending a request + likelihood of the candidate adding a good answer. ● A2A = extension of CTR-prediction ○ Not only care about the viewer’s probability of sending a request, but also the recipient’s probability of writing a good answer A2A
  18. 18. ● Example labels: ○ Binary label: 0 if no request was sent or no answer was added and 1 if a request was sent and yielded an answer with a goodness score above some threshold. ○ Continuous label: w1⋅had_request+w2⋅had_answer+w3⋅answer_ goodness+⋯w1⋅had_request+w2⋅had_answer+ w3⋅answer_goodness+⋯ A2A
  19. 19. ● Features ○ Based on what the viewer or candidate has done in the past. ○ Historical features that encapsulate the relationship of the viewer to the candidate. ○ In addition to historical features, other features can be devised (e.g. a binary feature saying whether the viewer follows the candidate) ● Many more features are possible. Feature engineering is a crucial component of any ML system. A2A
  20. 20. Topics & Users Recommendations
  21. 21. Goal: Recommend new topics for the user to follow ● Based on ○ Other topics followed ○ Users followed ○ User interactions ○ Topic-related features ○ ...
  22. 22. Goal: Recommend new users to follow ● Based on: ○ Other users followed ○ Topics followed ○ User interactions ○ User-related features ○ ...
  23. 23. Related Questions
  24. 24. ● Given interest in question A (source) what other questions will be interesting? ● Not only about similarity, but also “interestingness” ● Features such as: ○ Textual ○ Co-visit ○ Topics ○ … ● Important for logged-out use case
  25. 25. Duplicate Questions
  26. 26. ● Important issue for Q&A Sites ○ Want to make sure we don’t disperse knowledge to the same question ● Solution: binary classifier trained with labelled data ● Features ○ Textual vector space models ○ Usage-based features ○ ...
  27. 27. User Trust
  28. 28. Goal: Infer user’s trustworthiness in relation to a given topic ● We take into account: ○ Answers written on topic ○ Upvotes/downvotes received ○ Endorsements ○ ... ● Trust/expertise propagates through the network ● Must be taken into account by other algorithms
  29. 29. Trending Topics
  30. 30. Goal: Highlight current events that are interesting for the user ● We take into account: ○ Global “Trendiness” ○ Social “Trendiness” ○ User’s interest ○ ... ● Trending topics are a great discovery mechanism
  31. 31. Moderation
  32. 32. ● Very important for Quora to keep quality of content ● Pure manual approaches do not scale ● Hard to get algorithms 100% right ● ML algorithms detect content/user issues ○ Output of the algorithms feed manually curated moderation queues
  33. 33. Content Creation Prediction
  34. 34. ● Quora’s algorithms not only optimize for probability of reading ● Important to predict probability of a user answering a question ● Parts of our system completely rely on that prediction ○ E.g. A2A (ask to answer) suggestions
  35. 35. Models
  36. 36. ● Logistic Regression ● Elastic Nets ● Gradient Boosted Decision Trees ● Random Forests ● (Deep) Neural Networks ● LambdaMART ● Matrix Factorization ● LDA ● ... ●
  37. 37. Experimentation
  38. 38. ⚫ Extensive A/B testing, data-driven decision- making ⚫ Separate, orthogonal “layers” for different parts of the system ⚫ Experiment framework showing comparisons for various metrics
  39. 39. Conclusions
  40. 40. • Q&A sites have not only Big, but also “rich” data • Algorithms need to understand and optimize complex aspects such as quality, interestingness, or user expertise • ML is one of the keys to success • Many interesting problems, and many unsolved challenges
  41. 41. Questions?

×