SNOW 2014 Data Challenge

WWW 2014
Seoul, April 8th
SNOW 2014 Data Challenge
Symeon Papadopoulos (CERTH)
David Corney (RGU)
Luca Aiello (Yahoo! Labs)

Overview of Challenge
• Goal: Detection of newsworthy topics in a large and
noisy set of tweets
• Topic: a news story represented by a headline + tags
+ representative tweets + representative images
(optional)
• Newsworthy: A topic that ends up being covered by
at least some major online news sources
• Topics are detected per timeslot (small equally-sized
time intervals)
• We want a maximum number of topics per timeslot
#2

Challenge Activity Log
• Challenge definition (Dec 2013)
• Challenge toolkit and registration (Jan 20, 2014)
• Development dataset collection (Feb 3, 2014)
• Rehearsal dataset collection (Feb 17, 2014)
• Test dataset collection (Feb 25, 2014)
• Results submission (Mar 4, 2014)
• Paper submission (Mar 9, 2014)
• Results evaluation (Mar 5-18, 2014)
• Workshop (Apr 7, 2014)
#3

Some statistics
• Registered participants: 25
– India: 4, Belgium: 3, Germany: 3, UK: 3, Greece: 3,
Ireland: 2, USA: 2, France: 2, Italy: 1, Spain: 1, Russia: 1
• Participants that signed the Challenge agreement: 19
• Participants that submitted results: 11
• Participants that submitted papers: 9
#4

Evaluation Protocol
• Defined several evaluation criteria:
– Newsworthiness  Precision/Recall, F-score
– Readability  scale [1-5]
– Coherence  scale [1-5]
– Diversity  scale [1-5]
• List of reference topics
• Set up precise evaluation guidelines
• Blind evaluation (i.e. evaluator not aware of which
method a topic comes from) based on Web UI
• Participants submitted topics for 96 timeslots, but
manual evaluation happened for 5 sample timeslots.
• Result validation and analysis
#5

Teams key
#6
Key Team
A UKON
B IBCN
C ITI
D math-dyn
E Insight
F FUB-TORV
G PILOTS
H RGU
I UoGMIR
J EURECOM
K SNOWBITS
References to the submitted papers will be
included in the overview paper in the
workshop proceedings.

Results – Reference topic recall
#7
Team Recall (%) Rank
A 0.44 5
B 0.58 4
C 0.32 7
D 0.63 2
E 0.66 1
F 0.39 6
G 0.24 8
H 0.6 3
I 0.17 10
J 0.24 8
K 0.14 11
Recall computed with respect
to 59 reference topics.
Those were partitioned in
three groups (20, 20, 19) and
each of the three evaluators
manually matched the topics
of participants to the topics
assigned to him.
Eval. Pair Correlation
Eval. 1 – Eval. 2 0.894913
Eval. 1 – Eval. 3 0.930247
Eval. 2 – Eval. 3 0.811976

Results – Pooled topic recall (1/2)
• Each evaluator independently evaluated the topics
of each participant as newsworthy or not
• Selected all topics that were marked as newsworthy
by at least two evaluators
• Manually extracted the unique topics (70 in total,
partially overlapping with reference topic list)
• Manually matched correct topics of each participants
to the list of newsworthy topics
• Computed precision, recall and F-score
#8

Results – Pooled topic recall (2/2)
#9
Team Matched Unique Total Prec Rec F-score Rank
A 13 13 27 0.481 0.186 0.268 6
B 12 12 23 0.522 0.171 0.258 7
C 22 15 50 0.44 0.214 0.288 4
D 18 14 39 0.462 0.2 0.279 5
E 28 25 50 0.56 0.357 0.436 1
F 4 2 15 0.267 0.029 0.052 10
G 4 4 10 0.4 0.057 0.099 9
H 19 17 49 0.388 0.243 0.299 3
I 36 15 45 0.8 0.214 0.338 2
J 1 1 8 0.125 0.014 0.027 11
K 8 7 10 0.8 0.1 0.178 8

Results - Readability
#10
Team Readability Rank
A 4.29 9
B 4.92 2
C 4.49 7
D 4.59 6
E 4.74 4
F 4.18 10
G 4.93 1
H 4.71 5
I 4.8 3
J 3.38 11
K 4.32 8
Eval. 1 – Eval. 2 0.902124
Eval. 1 – Eval. 3 0.357733
Eval. 2 – Eval. 3 0.278632

Results - Coherence
#11
Team Coherence Rank
A 4.4 6
B 4.08 9
C 4.68 5
D 4.91 2
E 4.97 1
F 4.78 4
G 4.83 3
H 4.22 8
I 3.95 10
J 3.75 11
K 4.36 7
Eval. 1 – Eval. 2 0.549512
Eval. 1 – Eval. 3 0.730684
Eval. 2 – Eval. 3 0.684426

Results - Diversity
#12
Team Diversity Rank
A 2.12 7
B 2.36 4
C 2.31 6
D 2.11 8
E 2.11 8
F 2 10
G 1.92 11
H 3.27 2
I 2.36 4
J 2.5 3
K 3.47 1
Eval. 1 – Eval. 2 0.873365
Eval. 1 – Eval. 3 0.890415
Eval. 2 – Eval. 3 0.905915

Results – Image Relevance
#13
Team Precision (%) Rank
A 54.19 3
B 31.75 5
C 58.09 2
D 52.04 4
E 27.39 6
F 0 8
G 0 8
H 58.82 1
I 0 8
J 0 8
K 18.45 7
Eval. 1 – Eval. 2 0.944946
Eval. 1 – Eval. 3 0.919469
Eval. 2 – Eval. 3 0.79596

Results – Aggregate (1/2)
• For each criterion Ci, we computed the score of each
team relative to the best team for this criterion:
Ci
* (team) = Ci (team) / max(Ci (teamj))
• We then aggregated over the different norm. scores:
Ctot = 0.25*Cref*Cpool + 0.25*Cread + 0.25*Ccoh + 0.25*Cdiv
where Cref is computed from the recall of reference
topics, Cpool from the F-score of the pooled topics,
and Cread, Ccoh and Cdiv from readability, coherence
and diversity respectively.
#14

Results – Aggregate (2/2)
#15
Team Precision (%) Rank
A 0.694 7
B 0.755 4
C 0.710 5
D 0.785 3
E 0.892 1
F 0.614 10
G 0.652 9
H 0.842 2
I 0.662 8
J 0.546 11
K 0.70987 6
We tried several other
alternative aggregation
scores. The top three teams
were the same!

Program
15:20-15:30: Carlos Martin-Dancausa and Ayse Goker: Real-time topic detection with
bursty n-grams.
16:00-16:20: Gopi Chand Nuttaki, Olfa Nasraoui, Behnoush Abdollahi, Mahsa Badami,
Wenlong Sun: Distributed LDA based topic modelling and topic agglomeration in a
latent space.
16:20-16:40: Steven van Canneyt, Matthias Feys, Steven Schockaert, Thomas
Demeester, Chris Develder, Bart Dhoedt: Detecting newsworthy topics in Twitter.
16:40-17:00: Georgiana Ifrim, Bichen Shi, Igor Brigadir: Event detection in Twitter
using aggressive filtering and hierarchical tweet clustering.
17:00-17:20: Gerard Burnside, Dimitrios Milioris, Philippe Jacquet: One day in Twitter:
Topic detection via joint complexity.
17:20-17:30: Georgios Petkos, Symeon Papadopoulos, Yiannis Kompatsiaris: Two-level
message clustering for topic detection in Twitter.
17:30-17:40: Winners’ announcement!
#16

Limitations – Lessons Learned
• Did not take into account time
– However, methods that produce a newsworthy topic earlier
should be rewarded
• Did not take into account image relevance
– since we considered it an optional field
• Coherence and diversity had extreme values in
numerous cases
– e.g. when a single relevant tweet was provided as
representative
• Evaluation turned out to be a very complex task!
• Assessing only five slots (out of the 96) is definitely a
compromise: (a) consider use of more evaluators/AMT,
(b) consider simpler evaluation tasks
#17

Plan
• Release evaluation resources
– list of reference topics
– list of pooled newsworthy topics
– evaluation scores
• Papers
– SNOW Data Challenge paper
– Resubmission of participants’ papers with CEUR style
– Submission to CEUR-ws.org
• Open-source implementations?
• Further plans?
#18

SNOW 2014 Data Challenge

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a SNOW 2014 Data Challenge

Semelhante a SNOW 2014 Data Challenge (20)

Mais de Symeon Papadopoulos

Mais de Symeon Papadopoulos (20)

Último

Último (20)

SNOW 2014 Data Challenge