The advent of Web 2.0 gave birth to a new kind of applications where content is generated through the collaborative contribution of many different users. This form of content generation is believed to generate data of higher quality since the “wisdom of the crowds” makes its way into the data. However, as it is generally the case in real life, there are many issues for which there is no generally accepted opinion. These issues are characterised as controversial. Knowing these issues when reading the user generated content is of major importance in understanding the quality of the data and the trust that should be given to them. In this work we describe a technique that finds these controversial issues by analyzing the edits that have been performed on the data over time. We apply our technique on Wikipedia, the world’s largest known collaboratively generated database and we report our findings.
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Fine-Grained Controversy Detection on Wikipedia
1. Fine-Grained Controversy Detection in Wikipedia
Siarhei Bykau (Purdue University)
Flip Korn (Google Research)
Divesh Srivastava (AT&T Labs-Research)
Yannis Velegrakis (University of Trento)
2. Siarhei Bykau 2
Wikipedia: The Wisdom of Crowds
● Collaborative
Content Creation [Giles
2005]
– Up-to-date
– Pluralistic
– Neutral point of view
● Data Quality Problems:
– Reputation&Trust [Adler
and Alfaro 2007, Adler et al. 2008]
– Vandalism [Chin et al. 2010,
Potthast et al. 2008, Smeth et al.
2008]
– Stability [Druck et al. 2008]
– Controversy
3. Siarhei Bykau 3
Controversy
● A prolonged dispute by a number of people on the same topic *
● Should be distingueshed from:
– regular edits
– vandalism
● Help in
– preserve neutral point of view (NPOV)
– requesting supporting evidences
* http://en.wikipedia.org/wiki/Controversy
4. Siarhei Bykau 4
Arab-Israeli Conflic
●
Sensitive page, rife with controversial content
– Number of casualties, Israeli per-capita GDP, etc.
5. Siarhei Bykau 5
The Beatles
● Non-sensitive page, with controversial content
– Should it be “The Beatles” or “the Beatles”?
7. Siarhei Bykau 7
Controversy Detection:
Related Work
● machine learning [Kittur et al 2007]
● # of revisions, # of unique authors, page length
● mutual reinforcement principle [Vuong et al 2008]
● content is more controversial if page’s controversy is low
● bipolarities in the edit graph [Sepehri Rad and Barbosa 2011]
● nodes = authors
● edges = one author deletes/reverts content written by another
● revert statistics [Yasseri et al. 2012]
● number of authors who revert an article back to a previous
version
8. Siarhei Bykau 8
Controversy Detection:
Related Work
● None of these methods work to fine-grained controversies
– WHERE a controversy is located
– WHO is involved into a controversy
– WHEN a controversy occurred
– WHAT are the arguments of a controversy
9. Siarhei Bykau 9
Caesar salad
● Previous work only detects that the Caesar salad page is controversial
The history of this popular salad is a
controversial issue, even in the spelling of
the name. There is a widely held
misconception that it is named after
[[Julius Caesar]], but the salad's creation
is generally attributed to restaurateur
'''[[Cesar Cardini]]''' (an [[Italy|Italian]]-born
Mexican). As his daughter Rosa (1928–
2003) reported,[2] her father invented the
dish when a Fourth of July 1924 rush
depleted the kitchen's supplies. Cardini
made do with what he had, adding the
dramatic flair of the table-side tossing "by
the chef".
The history of this popular salad is a
controversial issue, even in the spelling of
the name. There is a widely held
misconception that it is named after
'''[[Cesar Cardini]]''', but the salad's
creation is generally attributed to [[Julius
Caesar]] (an [[Italy|Italian]]-born emperor).
As his daughter Rosa (1928–2003)
reported,[2] her father invented the dish
when a Fourth of July 1924 rush depleted
the kitchen's supplies. Cardini made do
with what he had, adding the dramatic flair
of the table-side tossing "by the chef".
- What are diffirent alternatives?
- When the controversy occured?
- Who created the salad?
- After whom it is named?
10. Siarhei Bykau 10
Challenge: Fine-grained Controversies
● Controversies are typically expressed via substitutions
– Not Insertions/Deletions
– Alternating content
...There is a widely held
misconception that it is named
after [[Julius Caesar]], but the
salad's creation is generally
attributed to restaurateur '''[[Cesar
Cardini]]''' (an [[Italy|Italian]]-born
Mexican). As his daughter Rosa
(1928–2003) reported,...
...There is a widely held
misconception that it is named after
'''[[Cesar Cardini]]''', but the salad's
creation is generally attributed to
[[Julius Caesar]] (an [[Italy|Italian]]-
born emperor). As his daughter
Rosa (1928–2003) reported,..
11. Siarhei Bykau 11
Challenge: Track Topic across Revisions
● Positions of edits change significantly across revisions
● Text is ambiguous
● Surrounding context of edit clarifies semantics
– Edits with same or similar context likely refer to the same topic
...There is a widely held
misconception that it is named
after [[Julius Caesar]], but the
salad's creation is generally
attributed to restaurateur '''[[Cesar
Cardini]]''' (an [[Italy|Italian]]-born
Mexican). As his daughter Rosa
(1928–2003) reported,...
...There is a widely held
misconception that it is named
after '''[[Cesar Cardini]]''', but the
salad's creation is generally
attributed to [[Julius Caesar]] (an
[[Italy|Italian]]-born emperor). As his
daughter Rosa (1928–2003)
reported,..
12. Siarhei Bykau 12
Challenge: Distinguish from Other Edits
● Cardinality
– # of edits
● Duration
– Lifespan of a controversy
● Plurality
– # of distinct authors
13. Siarhei Bykau 13
Challenge: Variability of Text Content
● sequence of wiki links, not words
– Link -> semantic concept
– Wikipedia encourages to have a high density of wiki links
olive oil Worcestershire sauce
Julius Caesar Cesar Cardini Italy
Mexican Hollywood
olive oil Worcestershire sauce
Caesar Cadini Julius Caesar
Caesar Cadini Italy Hollywood
14. Siarhei Bykau 14
Challenge: Large Number of Revisions
● 4.5 million content pages, about 100 million revs, 7 TB of data
● scalable controversy detection algorithm (CDA)
● Input: a Wikipedia page with its revision history
– Edit extraction // use Myer’s algorithm, find substitutions
– Eliminate edits with low user support
– Cluster edits based on context // use DBSCAN for efficiency
– Cluster and merge the sets of edits based on the subject
● Output: ranked clusters of edits which represent controversies
15. Siarhei Bykau 15
Experimental Evaluation Setup
● Dataset: English-language Wikipedia dump from December
2013
– 4.5 million content pages, about 100 million revisions, 7 TB of data
● Implemented CDA in Java, used JWPL parser to discover links
– Baseline identifies controversies based on the number of revisions
Parameter Range Default Value
model link, text link
radius of context 2, 4, 6, 8 8
max tokens in
substituion
1, 2, 3, 4, 5 2
context similarity [0...1] 0.75
number of authors 1, 2, 3, 4, 5 2
16. Siarhei Bykau 16
Sources of Controversy
● Wikipedia Provided Controversies (WPC)
– Metrics:
● Recall
● User surveys
– Metrics:
● noise/signal ratio
● Top1 Precision
● # of distinct controversies
17. Siarhei Bykau 17
Recall on selected WPC
● Baseline – adapted [Kittur et al 2007]
● Text model has higher recall than link model, baseline is
worst
18. Siarhei Bykau 18
Recall on full WPC using Text Model
● Text model can retrieve 117 out of 263 WPCs in top-10
result
– Clean controversies doesn't have irrelevant substitutions
19. Siarhei Bykau 19
New Previously Unknown Controversies
page WPC New controversies
Chopin nationality birthday, photo, name
Avril Lavigne song spelling music genre, birthplace, religion
Bolzano name spelling language
Futurama verb spelling TV, seasons, channel
Freddie Mercury origin name spelling, image
20. Siarhei Bykau 20
Precision
● Link model has considerably higher precision than
text model
– For many (cardinality, duration, plurality) ranking functions
Link Model Text Model
21. Siarhei Bykau 21
Subsititutions vs Insertions/Delitions
metric link text link ins/del text ins/del baseline
noise/signal 0.19 0.25 0.64 0.57 0.75
# of dist contr 65 80 29 25 17
● Link model with substitutions has lowest noise/signal ratio
● Models with insertions/deletions have very high noise/signal ratio
● Text model with substitutions find highest # of controversies
● Models with insertions/deletions find low number of controversies
22. Siarhei Bykau 22
Experiment Takeaways
● Text model with substitutions has a higher recall
– Able to retrieve 23% more controversies among WPC
● Link model with substitutions has a much higher
precision
– Use of semantic concepts in wiki links doubles the precision
● Cardinality, duration, plurality – good ranking
functions
– Validates the definition of controversy
23. Siarhei Bykau 23
Conclusions
● Detection of fine-grained controversies in Wikipedia
– answer Where, What, Who and When questions
● Link model generates more semantically meaningful
controversies then text model
● Experimental evaluation shows the efficiency and
effectiveness of the proposed solutions