AWS Community Day CPH - Three problems of Terraform
SED2012 Dataset
1. The 2012 Social Event Detection Dataset
Symeon Papadopoulos1, Emmanouil Schinas1, Vasileios Mezaris1,
Raphaël Troncy2, Yiannis Kompatsiaris1
1
CERTH-ITI, Thessaloniki, Greece
2
EURECOM, Sophia Antipolis, France
Oslo, 28 Feb - 1 Mar 2013
2. SED2012 Overview
• Large collection (>160K) of CC-licensed Flickr
photos and some of their metadata
• Event annotations for 149 target events (of
specific categories and locations of interest)
• Primary use: Social event detection
– Used in the context of MediaEval 2012 (SED task)
• Secondary uses: image geotagging,
distractors in CBIR, city summarization
2
3. Dataset Overview
Flickr photo collection
• 167,332 photos
• 4,422 unique contributors
• Creative Commons licenses
Event Annotations
• Challenge 1: Technical events in Germany
• Challenge 2: Soccer events in Hamburg and Madrid
• Challenge 3: Indignados movement events in Madrid
3
4. Data Collection Process
• Flickr API: http://www.flickr.com/services/api/
• Used method flickr.photo.search with five
geographical centres:
Barcelona, Cologne, Hamburg, Hannover, Madrid
• Time period: Jan 2009 – Dec 2011
• All photos CC licensed
• 403 photos from the
EventMedia collection
R. Troncy, B. Malocha, and A. Fialho. Linking Events with Media. In 6th Intern.
Conference on Semantic Systems (I-SEMANTICS), Graz, Austria, 2010
4
6. Dataset Collection Motivation
Selection of five cities (three German, two Spanish):
• Include large number of non-English text metadata (cf.
language distribution table)
• Ensure existence of numerous events for the target types
• Include distractor images:
– Challenge 2: Cologne, Hannover distractor for Hamburg, Barcelona
distractor for Madrid
– Challenge 3: Barcelona distractor for Madrid
Selection of only geotagged photos:
• Ease of annotation
Selection of only CC-licensed photos:
• Reuse of collection for research
6
7. Tag Statistics (1/2)
number of users using the tag
51,611 unique tags
prevalence of
location specific tags
event-specific tags
7
8. Tag Statistics (2/2)
barcelona
>20K photos have no tags spain
madrid
>57% of tags appear
once or twice
83.9% less than or equal to 10 tags >40K tags appear less than 10 times
8
9. User Statistics
60% of users less
than 10 photos
30 most active users contribute ~30% of dataset
9
10. Ground Truth Creation
• Manual annotations by use of CrEve
– web-based annotation
– two-round annotation by five annotators (three in the
first, two in the second)
– interactive annotation (search & annotate)
– each round terminated as soon as no new event-related
photos discovered
– approximate effort: 100 person-hours
C. Zigkolis, S. Papadopoulos, G. Filippou, Y. Kompatsiaris, A. Vakali. Collaborative Event
Annotation in Tagged Photo Collections. Multimedia Tools & Applications, 2012
• Annotations for Challenge 1 enriched by EventMedia
(403 photos featuring technical events in Germany)
10
11. Ground Truth Statistics (1/3)
10 events related
with >100 photos
~27% of events associated
with 1 or 2 photos
11
12. Ground Truth Statistics (2/3)
106 events are captured by
single users
erroneous timestamps in photos
9 events captured by more The majority of events last for less
than 10 people than a day (typical for soccer)
12
13. Ground Truth Statistics (3/3)
Madrid events
Santiago Bernabeu
stadium Puerta del Sol
Stadium of Butarque
Vicente Calderon stadium
13
17. Evaluation
• F-measure (macro), Precision, Recall
– goodness of retrieved photos, but not how well
they were clustered into events
• Normalized Mutual Information (NMI)
– compares automatically extracted clustering of
photos into events with the ground truth
• Evaluation script is made available together
with the dataset.
• Implementation of event detection available:
http://mklab.iti.gr/project/sed2012_certh
17