Maintaining high quality user generated content through machine learning

Maintaining High Quality User-Generated Content
Through Machine Learning
Nikhil Dandekar
Quora: Nikhil-Dandekar
Twitter: @nikhilbd
Paula Griffin
Quora: Paula-Griffin-1
Twitter: @paulajgriffin

What is Quora?
Quora is a platform to ask
questions, get useful
answers, and share what
you know with the world.

Incredible answers from credible sources

Not everyone is Peter Norvig.
● Biggest challenges of any user-generated-content site are quality and moderation
● Two (mostly distinct) sets of users to deal with
○ Bad actors trying to cause harm
○ Well-meaning users who miss the mark

Growing challenges
● Millions of questions, answers, users, and topics
○ More incentives for bad actors
○ More users who aren’t familiar with Quora norms
● Without active effort, quality gets worse as we scale
● We need solutions that get better as our content grows

Solving these problems together

Writing the rulebook
● First step: deciding what you want on your platform
● “Be Nice, Be Respectful” policy since before our public launch in 2010
○ No hate speech
○ No harassment
○ No retaliation
● Almost all other policies flow from “being helpful” to someone viewing the page
○ Don’t write joke answers
○ Tag content with appropriate topics

Enforcing the rules
● Users can report content and users for violating Quora’s policies
● Starting out: manual review of all reports
● Problems:
○ Many man-hours needed to review all reports
○ Low reporting rates
○ The worst part: someone actually has to see the bad content

Enforcing the rules at scale
● Heuristics and machine learning help us reduce the burden of handling user reports, and
can proactively identify bad content
○ Deal with reported content faster and more cheaply
○ Catch spam, harassment, and other problems before other users see it
○ Automatically fix formatting and grammar in some cases
● Benefits of scale:
○ More content → more choice of good content
○ Ongoing feedback from human review systems
○ More data to train our models

Maintaining high content quality using
Machine Learning

ML Models for quality
● Questions: Adult detection, Question quality classification,
Duplicate questions detector, Overly personal question detector,
Question autocorrection etc.
● Answers + Comments: Adult detection, Answer ranking for
questions, Answer collapsing, BNBR classifier, Harassment classifier,
Spam classifier etc.
● Topics: Duplicate Topics detector, Bad Topic classifier etc.
● Users: Bad actor detection, Bad user-credentials classifier, Fake
name detection, User-topic bio classifier etc.
● Classifiers on other content types, e.g. answer wikis.
Machine Learning for quality: Overview

Machine Learning for quality: Overview
Algorithms
● RNNs (LSTMs/GRUs) and other deep networks,
Gradient Boosted Decision Trees, Random Forests,
Logistic Regression, LambdaMART, k-means and other
clustering techniques, k-NNs, PageRank etc.
Libraries
● Tensorflow, Keras, Sklearn, Xgboost, LightGBM,
FastText, RankLib, NTLK, spaCy etc.

Machine Learning model decision flow
Content
ML model
High-confidence
decision?
Take automatic action Ask a human to verify the action
NoYes

● Some examples of this decision flow:
○ Spam detection
○ BNBR violation detection
○ Question quality classifier
○ Duplicate question detection
○ ...and more
● The more nuanced and sensitive the decision, the
more the need for human verification
ML decision flow examples

Machine Learning data feedback loop
Training
data
Run model
on content
User actions
Human reviews
Train
Models

Case study: Question quality and automatic
question correction

● Users often ask questions with grammatical and spelling errors
● Example:
○ Which coin/token is next big thing in crypto currencies? And why?
○ Which coin/token is the next big thing in cryptocurrencies? Why?
● These are good questions, but the lack of correct phrasing hurts them
○ Less likely to be answered by experts
○ Harder to catch duplicate questions
○ Can hurt the perception of “quality” of Quora
“Bad” questions on Quora

“Bad” questions on Quora
● Types of errors in questions
○ Grammatical errors, e.g., “How I can ...”
○ Spelling mistakes
○ Missing preposition or article
○ Wrong/missing punctuation
○ Wrong capitalization
○ etc.
● Can we use Machine Learning to automatically correct these questions?
● Started off as an “offroad” hack-week project
● Since shipped

Automatic question correction: research

● Frame this problem similar to the machine translation
problem
● Final Model:
○ Sequence-to-sequence, character-level RNN (GRU)
with attention
Automatic question correction: Model

Automatic question correction: Model
● Model Details:
○ Sequence to sequence (encoder-decoder) model
○ Character-level
○ GRUs (Gated Recurrent Units)
○ Attention-based
○ Bidirectional
○ Beam search for decoding
● Tried solving the subproblems individually, but didn’t work as
well

● Training
○ Training data: Pairs of [bad question, corrected question]
○ Tensorflow, on a single box with GPUs
○ Training time: 2-3 hours
● Serving:
○ Tensorflow, GPU-based serving
○ Latency: <500 ms p99
● Run on new questions added to Quora
Automatic question correction: System Details

Automatic question correction: Results

● Checks for BNBR violations on questions, answers,
comments.
● Binary classifier
● Training data:
○ Positive: Confirmed BNBR violations
○ Negative: False BNBR reports, other good content
● Model: NN with 1 hidden layer (fastText)
● Same ML decision flow as before
BNBR classification

● Quality is one of the most important problems we face at Quora
● There are various systems to maintain quality, and we need to use all of them in order to keep up
● Machine Learning solutions helps us maintain quality at scale
○ ...but you can’t totally bypass human efforts
In conclusion

Thank you!
Nikhil Dandekar
Quora: Nikhil-Dandekar
Twitter: @nikhilbd
Paula Griffin
Quora: Paula-Griffin-1
Twitter: @paulajgriffin

Maintaining high quality user generated content through machine learning

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (11)

Semelhante a Maintaining high quality user generated content through machine learning

Semelhante a Maintaining high quality user generated content through machine learning (20)

Último

Último (20)

Maintaining high quality user generated content through machine learning