Social Data and Multimedia Analytics for News and Events Applications lecture given at 2015 IEEE SPS Italy Chapter Summer School on Signal Processing (S3P)
Generative AI on Enterprise Cloud with NiFi and Milvus
Processing Large Complex Data
1. Processing
Large
Complex
Data
Social
Data
and
Mul8media
Analy8cs
for
News
and
Events
Applica8ons
Dr.
Yiannis
Kompatsiaris,
ikom@i2.gr
Mul$media,
Knowledge
and
Social
Media
Analy$cs
Lab,
Head
CERTH-‐ITI
2015
IEEE
SPS
Italy
Chapter
Summer
School
on
Signal
Processing
(S3P)
2. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
#2
Overview
• Introduc8on
– Mo8va8on
–
Challenges
• Example
Use
Cases
• Research
Approaches
– Large-‐Scale
visual
search
– Graphs
-‐
Community
Detec8on
-‐
Clustering
– Social
Event
Detec8on
– Verifica8on
• Demos
–
Applica8ons
– MM
News
Demo
– ClusJour
– Thessfest
• Evalua8on
-‐
Benchmarking
• Conclusions
3. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
#3
Introduc2on
Mo2va2on
Example
Applica2ons
Conceptual
Architecture
Challenges
4. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
#4
Pope
Francis
Pope
Benedict
2007:
iPhone
release
2008:
Android
release
2010:
iPad
release
http://petapixel.com/2013/03/14/a-starry-sea-of-cameras-at-the-unveiling-of-pope-francis/
5. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
hJp://www.puzzlemarketer.com/digital-‐social-‐brands-‐in-‐60-‐seconds/
(Apr,
2012)
6. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
6
rise
of
the
networks
7. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Social
Networks
as
Graphs
10#
social#web#as#a#graph#
nodes&=&twi+er&users&
edges&=&retweets&on&#jan25&hashtag&
announcement&of&Mubarak’s&resigna<on&
h1p://gephi.org/2011/the7egyp9an7revolu9on7on7twi1er/#
8. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
#8
Social
Networks
as
Graphs
“Social
networks
have
emergent
proper$es.
Emergent
proper$es
are
new
aFributes
of
a
whole
that
arise
from
the
interac$on
and
interconnec$on
of
the
parts”
• Emo8ons,
Health,
Sexual
rela8onships
do
not
depend
just
on
our
connec8ons
(e.g.
number
of
them)
but
on
our
posi8on
-‐
structure
in
the
social
graph
– Central
–
Hub
– Outlier
– Transi8vity
(connec8ons
between
friends)
9. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Social
Networks
as
Real-‐Life
Sensors
• Social
Networks
is
a
data
source
with
an
extremely
dynamic
nature
that
reflects
events
and
the
evolu8on
of
community
focus
(user’s
interests)
• Huge
smartphones
and
mobile
devices
penetra2on
provides
real-‐8me
and
loca8on-‐based
user
feedback
• Transform
individually
rare
but
collec2vely
frequent
media
to
meaningful
topics,
events,
points
of
interest,
emo8onal
states
and
social
connec8ons
• Present
in
an
efficient
way
for
a
variety
of
applica8ons
(news,
marke8ng,
science,
health,
entertainment)
10. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Caption
Time
User
Profile
Favs
Comms
Tags
Social
Media
aspects
11. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Examples
-‐
Science
Xin
Jin,
Andrew
Gallagher,
Liangliang
Cao,
Jiebo
Luo,
and
Jiawei
Han.
The
wisdom
of
social
mulHmedia:
using
flickr
for
predicHon
and
forecast,
Interna8onal
conference
on
Mul8media
(MM
'10).
ACM.
11
“…if
you're
more
than
100
km
away
from
the
epicenter
[of
an
earthquake]
you
can
read
about
the
quake
on
twiJer
before
it
hits
you…”
Many
twiJer
examples
at:
What
can
TwiJer
tell
us
about
the
real
world?
TwiJer
and
the
Real
World
CIKM'13
Tutorial,
hJps://sites.google.com/site/twiJerandtherealworld/home
12. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Examples
-‐
Science
12
13. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Examples
-‐
Science
13
Be
careful
of
correla8on
diagrams
14. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Example
–
News
(Boston
bombing)
#14
“Following
the
Boston
Marathon
bombings,
one
quarter
of
Americans
reportedly
looked
to
Facebook,
TwiJer
and
other
social
networking
sites
for
informa8on,
according
to
The
Pew
Research
Center.
When
the
Boston
Police
Department
posted
its
final
“CAPTURED!!!”
tweet
of
the
manhunt,
more
than
140,000
people
retweeted
it.”
“Authori8es
have
recognized
that
one
the
first
places
people
go
in
events
like
this
is
to
social
media,
to
see
what
the
crowd
is
saying
about
what
to
do
next”
"I
have
been
following
my
friend's
Facebook
[account]
who
is
near
the
scene
and
she
is
upda2ng
everyone
before
it
even
gets
to
the
news”
15. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Example
–
Crisis
–
Humanitarian
(Syria)
#15
Syria
Tracker
offers
a
crisis
mapping
system
that
uses
crowdsourced
text,
photo
and
video
reports
and
data
mining
techniques
forming
a
live
map
of
the
Syrian
conflict
since
March
2011
…stream
of
content-‐filtered
media
from
news,
social
media
(TwiJer
and
Facebook)
and
official
sources
16. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Events
-‐
Fes2vals
#16
http://www.eventmanagerblog.com/uploads/2012/12/event-technology-infographic.jpg
17. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Many
other
examples:
smellymaps
#17
Smell
related
words
in
geo-‐located
social
media
hJp://researchswinger.org/smellymaps/
18. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
API
Wrapper
Website
Wrapper
Scheduler
CRAWLING
Visual
Indexing
Near-‐duplicates
Text
Indexing
INDEXING
Media
Fetcher
SNA
Sen2ment
-‐
Influence
Trends
-‐
Topics
MINING
Model
Building
Concepts
Relevance
Diversity
Popularity
RANKING
Veracity
Crawling
Specs
Sources
Interac2on
Responsiveness
Aggrega2on
VISUALIZATION
Aesthe2cs
Conceptual
Architecture
19. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Challenges
–
Content
(Mining)
• Mul2-‐modality:
e.g.
image
+
tags,
video,
audio
• Rich
social
context:
spa8o-‐temporal,
social
connec8ons,
rela8ons
and
social
graph
• Specific
messages:
short,
conversa8ons,
errors,
no
context
• Inconsistent
quality:
noise,
spam,
fake,
propaganda
• Huge
volume:
Massively
produced
and
disseminated
• Mul2-‐source:
may
be
generated
by
different
applica8ons
and
user
communi8es
• Dynamic:
Fast
updates,
real-‐8me
20. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Policy
–
Licensing
–
Legal
challenges
•
Fragmented
access
to
data
– Separate
wrappers/APIs
for
each
source
(TwiJer,
Facebook,
etc.)
– Different
data
collec8on/crawling
policies
•
Limita8ons
imposed
by
API
providers
(“Walled
Gardens”)
• Full
access
to
data
impossible
or
extremely
expensive
(e.g.
see
data
licensing
plans
for
GNIP
and
DataSit
• Non-‐transparent
data
access
prac8ces
(e.g.
access
is
provided
to
an
organiza8on/person
if
they
have
a
contact
in
TwiJer)
•
Constant
change
of
model
and
ToS
of
social
APIs
– No
backwards
compa8bility,
addi8onal
development
costs
•
Ephemeral
nature
of
content
• Social
search
results
oten
lead
to
removed
content
à
inconsistent
and
unreliable
referencing
•
User
Privacy
&
Purpose
of
use
• Fuzzy
regulatory
framework
regarding
mining
user-‐contributed
data
21. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
#21
Example
Use
Cases
Events
and
News
22. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
SocialSensor
Project
Objec2ve
SocialSensor
quickly
surfaces
trusted
and
relevant
material
from
social
media
–
with
context.
DySCO
behaviour
loca8on
8me
content
usage
social
context
Massive
social
media
and
unstructured
web
Social
media
mining
Aggrega8on
&
indexing
News
-‐
Infotainment
Personalised
access
Ad-‐hoc
P2P
networks
23. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
#23
“It has changed the way we do
news”(MSN)
“Social media is the key place for emerging stories –
internationally, nationally, locally” (BBC)
“Social media is transforming the way we do journalism”
(New York Times)
Source: picture alliance / dpa
24. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
#24
Source:
GeJy
Images
“It’s really hard to find the nuggets of useful stuff
in an ocean of content” (BBC)
“Things that aren’t relevant crowd out the content
you are looking for” (MSN)
“The filters aren’t configurable
enough” (CNN)
25. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Verifica2on
was
simpler
in
the
past...
Source: Frank Grätz
#25
26. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
#26
News
Use
Case
Requirements
Quickly
surface
trusted
and
relevant
material
from
social
media
–
with
context.
• “quickly”:
in
real
8me
• “surfaces”:
automa8cally
discovers,
clusters
and
searches
• “trusted”:
automa8c
support
in
verifica8on
process
• “relevant”:
to
the
specific
event
• “material”:
any
material
(text,
image,
audio,
video
=
mul8media),
aggregated
with
other
sources
(e.g.
web)
• “social
media”:
across
all
relevant
social
media
plaworms
• “with
context”:
loca8on,
8me,
sen8ment,
influence
27. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
#27
Infotainment
• Events
with
large
numbers
of
visitors
• Thessaloniki
Interna8onal
Film
Fes8val
– 80,000
viewers
/
100,000
visitors
in
10
days
– 150
films,
350
screenings
• Discovery
and
presenta8on
of
relevant
aggregated
social
media
– Trending
Topics
– Sen8ment
– Tweet
–
film
matching
– Visualiza8on
(Social
Walls)
28. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
#28
Conceptual
Architecture
and
Main
components
SEMANTIC
MIDDLEWARE
Public
Data
SEARCH
&
RECOMMENDATION
USER
MODELLING
&
PRESENTATION
INDEXING
MINING
STORAGE
DATA
COLLECTION
/
CRAWLING
• Real
8me
dynamic
topic
and
event
clustering
• Trend,
popularity
and
sen8ment
analysis
• Calculate
trust/influence
scores
around
people
• Personalized
search,
access
&
presenta8on
based
on
social
network
interac8ons
• Seman8c
enrichment
and
discovery
of
services
29. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
#29
Research
Approaches
Large-‐Scale
Visual
Search
Graphs
–
Clustering/Community
Detec2on
Visual
Event
Summariza2on
Social
Media
Verifica2on
30. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
#30
Scalable
visual
feature
aggrega2on
&
indexing
• Problem:
Example-‐based
image
search
– Find
images
that
represent
same
or
similar
object
or
scene
with
a
given
query
image
– Viewed
from
different
viewpoints,
occlusions,
cluJer
• Challenge:
Large-‐scale
– Searching
databases
with
tens
of
millions
of
images
– Objec8ves
to
be
full-‐filed:
• Sufficient
discrimina8ve
power
• Fast
response
8mes
• Efficient
memory
usage
31. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
#31
Large-‐scale
visual
search
image
collec8on
from
social
media/
Web
image
local
feature
extrac8on
feature
aggrega8on
feature
indexing
kNN
visual
similarity
search
concept-‐based
image
annota8on
image
clustering
image
(geo)tagging
concept-‐based
search/filtering
duplicate
detec2on
32. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
#32
Framework
• Implementa8on
and
evalua8on
of
the
effec8veness
of
VLAD
in
combina8on
with
SURF
• Scalable
image
indexing
E.
Spyromitros-‐Xioufis,
S.
Papadopoulos,
Y.
Kompatsiaris,
G.
Tsoumakas,
I.
Vlahavas,
"A
Comprehensive
Study
over
VLAD
and
Product
Quan8za8on
in
Large-‐scale
Image
Retrieval",
IEEE
Transac8ons
on
Mul8media
16(6),
pp.
1713-‐1728,
October
2014.
image
local
descriptor
extrac8on
descriptor
aggrega8on
dimensionality
reduc8on
set
of
local
descriptors
fixed
size
vector
encoding
&
indexing
low
dimensional
vector
SIFT
/
SURF
BOW
/
VLAD
PCA
PQ
+
ADC/IVFADC
33. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
#33
Scalable
indexing
of
features
• ADC
16x8
requires
16
bytes
per
image
– ~67M
images
per
GB
• IVFADC
requires
4
addi8onal
bytes
per
image
– ~53.6M
images
per
GB
• In
current
implementa8on
we
achieve
only
half
of
above
numbers
due
to
using
short
int[]
instead
of
byte[],
but
possible
to
improve.
• Ideally,
1
billion
images
could
be
indexed
on
a
server
with
20GB
of
RAM
(projec2on).
• Query
8me
(for
1M
vectors):
– Exhaus8ve
search
of
VLAD
vectors
(d’=128):
0.50
sec
– Product
Quan8za8on
with
ADC
16x8:
0.10
sec
(x5
faster)
– Product
Quan8za8on
with
IVFADC
16x8:
0.02
sec
(x25
faster)
34. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
#34
VLAD+SIFT
vs.
VLAD+SURF
Accuracy
vs.
dimensionality
• VLAD+SURF
improves
VLAD+SIFT
and
FV+SIFT
across
all
dimensions
in
both
Holidays
and
Oxford
datasets
Results
in
rows
star8ng
with
*
are
taken
from
Jégou
et
al.,
2011,
hence
the
missing
values
for
some
entries.
SIFT
corresponds
to
PCA
reduced
SIFT
which
yielded
beJer
results
than
standard
SIFT
in
Jegou
et
al.,
2011
35. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
#35
Clustering
–
Community
Detec2on
36. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
graph
G
=
(V,
E)
nodes
edges
An
abstract
data
type
represen8ng
rela8onships
or
connec8ons
37. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Some
Examples
Webpage
www.x.com
href=“www.y.com”
href
=
“www.z.com”
Webpage
www.y.com
href=“www.x.com”
href
=
“www.a.com”
href
=
“www.b.com”
Webpage
www.z.com
href=“www.a.com”
y
a
x
z
b
38. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Biology
example
Nodes
–
Proteins
Edges
–
Interac8ons
Visualiza8on
plays
an
important
role
39. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
blogosphere
as
a
graph
nodes
=
blogs
edges
=
hyperlinks
technical
-‐
gadgets
society
-‐
poli2cs
hJp://datamining.typepad.com/gallery/blog-‐map-‐gallery.html
40. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
social
web
as
a
graph
nodes
=
twirer
users
edges
=
retweets
on
#jan25
hashtag
announcement
of
Mubarak’s
resigna2on
hJp://gephi.org/2011/the-‐egyp8an-‐revolu8on-‐on-‐twiJer/
41. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
• graphs
on
the
web
present
certain
structural
characteris8cs
• groups
of
nodes
interac8ng
with
each
other
à
dense
inter-‐connec2ons
à
func8onal/topical
associa8ons
• what
can
we
gain
by
studying
them?
– topic
analysis
– photo
clustering
– improved
recommenda8on
methods
– detect
influencers
emerging
structures
42. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Community
and
graphs
Communi8es
correspond
to
groups
of
nodes
on
a
graph
that
share
common
proper8es
or
have
a
common
role
in
the
organiza8on/opera8on
of
the
system.
S.
Fortunato,
C.
Castellano.
Community
structure
in
graphs.
arXiv:0712.2716v1,
Dec
2007.
43. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
• Pairs
of
nodes
are
more
likely
to
be
connected
if
they
are
both
members
of
the
same
community,
and
less
likely
to
be
connected
if
they
do
not
share
communi8es.
• explicit
– the
result
of
conscious
human
decision
• implicit
– emerging
from
the
interac8ons
&
ac8vi8es
of
users
– need
special
methods
to
be
discovered
– Community
detec8on,
par88on,
clustering
Community
types
44. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
• Oten
communi8es
are
defined
with
respect
to
a
graph,
G
=
(V,E)
represen8ng
a
set
of
objects
(V)
and
their
rela8ons
(E).
• Even
if
such
graph
is
not
explicit
in
the
raw
data,
it
is
usually
possible
to
construct,
e.g.
feature
vectors
à
distances
à
thresholding
à
graph
• Given
a
graph,
a
community
is
defined
as
a
set
of
nodes
that
are
more
densely
connected
to
each
other
than
to
the
rest
of
the
network
nodes.
communi2es
and
graphs
45. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
communi2es
and
graphs
-‐
example
inter-‐community
edge
intra-‐community
edge
46. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
community
arributes
overlap
weighted
par8cipa8on
roles
hierarchy
evolu8on
47. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
• Given
nodes
u
and
v
of
graph
G
=
(V,E)
a
cut
is
a
set
of
edges
C
⊂
E,
such
that
the
two
nodes
are
unconnected
on
the
graph
G΄=
(V,E-‐C).
• Using
s
to
denote
a
“source”
node
and
t
to
denote
a
“terminal”
node,
a
cut
(S,T)
of
G
=
(V,E)
is
a
par88on
of
V
in
sets
S
and
Τ
=
V-‐S,
such
that
s
∈
S
and
t∈T.
graph
cuts
s
t
T
S
48. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
• A
graph
can
be
split
into
communi8es
in
numerous
ways,
i.e.
for
each
graph
there
are
many
possible
community
structures.
In
the
simple
case,
a
community
structure
is
defined
as
a
graph
par88on
into
a
set
of
node
sets
C
=
{Ci}
• To
provide
a
measure
of
the
quality
of
a
community
structure,
we
make
use
of
modularity.
• The
modularity
maximiza8on
method
detects
communi8es
by
searching
over
possible
divisions
of
a
network
for
one
or
more
that
have
par8cularly
high
modularity.
• Modularity
quan8fies
the
extent
to
which
a
given
graph
par88on
into
communi8es
presents
a
systema8c
tendency
to
have
more
intra-‐community
links
than
the
same
community
structure
would
present
if
the
links
would
be
rewired
under
ER
(Erdos-‐Renyi)
graph
model.
Modularity
maximiza2on
49. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
graph
degress
deg(vi)
=
ki
=
number
of
neighbors
In
directed
graphs,
we
differen8ate
between
in-‐
and
out-‐degree.
Αij
=
link
between
nodes
i
and
j
0
à
no
link
1
à
link
α
à
link
with
weight
equal
to
α
node
degree
adjacency
matrix
50. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Degrees
&
Adjancency
v1
v2
v3
v4
v5
Adjacency
matrix
on
an
undirected
graph
:
A(i,j),
i,j
<=
n
degree
of
a
vertex
v
(number
of
edges
incident
upon
it):
∑=
w
v wvAk ),(
51. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
• Modularity
is
computed
as
follows:
– Αij:
adjacency
matrix
– ki:
degree
of
node
i
– ci:
community
of
node
i
– δ(ci,cj)
=
1
if
i,
j
belong
to
the
same
community
– m:
number
of
edges
on
the
graph
modularity
computa2on
∑ −=
ji
ji
ji
ij cc
m
kk
A
m
Q
,
),()
2
(
2
1
δ
Expected number of
edges between i and j, if
edges are placed
randomly.
Observed number of
intra-community
edges.
52. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
• In
a
random
graph
(ER
model),
we
expect
that
any
possible
par88on
would
lead
to
Q
=
0.
• Typically,
in
non-‐random
graphs
modularity
takes
values
between
0.3
and
0.7.
modularity
-‐
example
Q = 0.60
clear community
structure
Q = 0.37
fuzzy communities
53. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
• Exhaus8ve
search
over
all
possible
divisions
is
usually
intractable
• Algorithms
based
on
approximate
op8miza8on
– greedy
algorithms
– simulated
annealing
– spectral
op8miza8on
– local-‐based
op8miza8on
• Balances
between
speed
and
accuracy
Modularity
maximiza2on
approaches
54. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
• other
community-‐ness
measures:
– conductance
– density
• defini8ons
to
sa8sfy
– each
member
should
be
connected
to
more
nodes
within
the
community
than
to
nodes
outside
it
– each
member
should
be
connected
to
all
other
members
(k-‐clique)
• result
of
a
process
– if
I
start
removing
edges
with
a
certain
order,
the
graph
will
break
into
pieces
à
communi8es
other
means
to
define
communi2es
55. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
• Given
a
graph
G=(V,E),
find
a
par88on
of
V
in
k
disjoint
subsets,
such
that
the
number
of
edges
in
Ε
of
which
the
endpoints
belong
to
different
subsets
is
minimized.
• Various
solu8ons:
Kernighan-‐Lin
algorithm
[Kernighan70],
spectral
bisec8on
[Pothen90].
• Mul8-‐level
par88on
(me8s)
[Karypis99]:
Repeated
applica8on
of
bisec8on
un8l
the
graph
is
par88oned
into
k
parts
under
constraint
to
the
sizes
of
the
subsets.
• Not
sa8sfactory
solu8on,
since
the
number
of
communi8es
needs
to
be
provided
as
input
to
the
algorithm.
Some8mes
event
the
community
sizes
need
to
be
provided
as
inputs.
graph
par22on
B.
W.
Kernighan,
S.
Lin.
An
Efficient
Heuris8c
Procedure
for
Par88oning
of
Electrical
Circuits.
Bell
Systems
Technical
Journal,
Vol.
49,
No.
2,
pp.
291-‐
307,
February
1970.
A.
Pothen,
H.D.
Simon
and
K.-‐P.
Liou.
Par88oning
sparse
matrices
with
eigenvectors
of
graphs.
SIAM
journal
of
Matrix
Analysis
and
Applica8ons,
11:
430-‐452,
1990.
G.
Karypis
and
V.
Kumar,
A
fast
and
high
quality
mul8level
scheme
for
par88oning
irregular
graphs,
SIAM
J.
Sci.
Comput.
20
(1):
359–392,
1999.
56. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
taxonomy
S.
Papadopoulos,
Y.
Kompatsiaris,
A.
Vakali,
P.
Spyridonos.
“Community
detec8on
in
Social
Media”.
In
Data
Mining
and
Knowledge
Discovery,
Springer,
2011
58. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
• (μ,ε)-‐core:
– based
on
the
concept
of
structural
similarity
subgraph
discovery
2
(μ,ε)-‐core
μ
=
5,
ε
=
0.72
(μ,ε)-‐core
μ
=
6,
ε
=
0.675
hub
outlier
Percentage
of
common
neighbors
for
each
edge
59. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
• Betweenness
centrality
– Being
in
many
shortest
paths
• Closeness
– Being
close
to
many
nodes
• Eigenvector
centrality
– End
of
many
paths
• Degree
centrality
– High
degree
hJps://commons.wikimedia.org/wiki/File:6_centrality_measures.png#/
media/File:6_centrality_measures.png
Carlos
Cas8llo,
Social
Media
Mining
and
Retrieval,
hJp://www.slideshare.net/ChaToX/social-‐media-‐mining-‐and-‐retrieval
centrality
60. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
• Find
edges
that
stand
between
communi8es.
• Progressively
remove
more
“central”
edges
un8l
the
graph
breaks
into
separate
communi8es.
• As
the
graph
spli†ng
progresses,
new
communi8es
emerge
that
are
assigned
to
a
hierarchical
structure.
• Edge
centrality
is
defined
similarly
to
node
centrality:
60
divisive
-‐
use
of
edge
centrality
Depic8on
of
node
centrality:
red
(min)
à
blue
(max)
∑ ∈
≠≠=
Vts
vts
ts
ts v
vbc
,
,
, )(
)(
σ
σ
)(, vtsσ
ts,σ
:
number
of
paths
from
node
s
to
t
that
include
node
v
:
total
number
of
paths
from
s
to
t
Betweenness centrality quantifies
the number of times a node acts
as a bridge along the shortest path
between two other nodes.
61. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
• GN
algorithm
is
one
of
the
most
important
algorithms
s8mula8ng
a
whole
wave
of
community
detec8on
methods.
• Basic
principle:
– Compute
betweenness
centrality
for
each
edge.
– Remove
edge
with
highest
score.
– Re-‐compute
all
scores.
– Repeat
2nd
step.
• Complexity:
Ο(n3)
• Many
varia8ons
have
been
presented
to
improve
precision
by
use
of
different
betweenness
measures
or
reduce
complexity,
e.g.
by
sampling
or
local
computa8ons.
Girvan
-‐
Newman
algorithm
Girvan,
M.,
Newman,
M.E.J.
“Community
structure
in
social
and
biological
networks”.
In
Proceedings
of
Na8onal
Academy
of
Science,
U.
S.
A.
99(12),
7821–7826,
2002
62. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Girvan
-‐
Newman
(example)
Social
network
in
Zachary
karate
club
Hierarchical
community
structure
detected
by
the
algorithm.
63. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Visual
Event
Summariza2on
on
Social
Media
using
Topic
Modelling
and
Graph-‐based
Ranking
Algorithms
64. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Large-‐scale
real
world
events
(1)
• Long-‐running
events
→
Consist
of
several
sub-‐events
e.g.
10
days
of
Sundance
Film
Fes8val
include
opening
and
awards
ceremonies,
screenings
etc.
• A
lot
of
involved
persons
that
use
social
media
→
huge
amount
of
event-‐related
micro-‐blogging
messages
• A
growing
number
of
these
messages
carry
mul2media
content
• The
existence
of
an
image
in
a
micro-‐post
can
convey
a
much
beJer
impression
for
the
specific
moment
of
the
ongoing
event
65. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Large-‐scale
real
world
events
(2)
#nbafinals
→
2.6M
tweets
in
one
month
#BaltimoreRiots 29 April-2 May 2015
à1.3M tweets in 5 days
E3 conference 2015 16-18 June
>5M tweets before conference
2M tweets during conference
new game releases à multimedia content
66. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Large-‐scale
real
world
events
(3)
But…
• the
huge
number
of
messages,
makes
it
very
challenging
for
interested
users
to
monitor
the
evolu8on
of
the
event
• many
messages
can
be
considered
as
spam
or
non-‐
informa2ve
• In
case
of
mul8media:
internet
memes,
screenshots,
images
of
low
quality…
• Redundancy
due
to
near
duplicate
messages
and
images
67. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Large-‐scale
real
world
events
(4)
#nbafinals
Irrelevant
Duplicates with
no explicit
association
Non-informative
68. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Event
related
collec$on
is
available
Visual
Event
Summariza2on
Visual
Event
Summariza2on
is
the
problem
of
selec8ng
a
concise
set
of
images
that
are
highly
relevant
to
the
event
and
contain
visually,
the
key
aspects
of
the
event.
Event-‐based
Visual
Summarizer
List
of
all
event
images
Set
of
Selected
Representa2ve
and
Diverse
Images
69. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Exis2ng
Approaches:
Text-‐based
Radev
et
al.
(2004)
• summary
consists
of
messages
that
are
closest
to
their
N·∙idf
centroid
Erkan
et
al.
(2004),
LexRank
&
Mihalcea
et
al.
(2004),
TextRank
• finding
salient
sentences
by
using
the
centrality
of
each
sentence
in
a
similarity
graph
• adapted
for
mul8-‐document
summariza8on
using
each
message
as
a
sentence.
• outperforms
naïve
centroid-‐based
approach.
Shen
at
al.
(2013)
• mixture
model
to
detect
sub-‐events
at
par8cipant
level
• N·∙idf
centroid
to
find
a
summary
of
each
sub-‐event
Chakrabar2
and
Punera
(2011)
• Hidden
Markov
Model
to
obtain
a
8me-‐based
segmenta8on
of
tweets
• N·∙idf
centroid
to
find
a
summary
of
each
8me
segment
70. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Exis2ng
Approaches:
Mul2media
Bian
et
al.
(2013)
• mul8modal
extension
of
LDA
• textual
and
visual
features
Lin
et
al.
(2012)
• mul8-‐graph
of
objects
capturing
visual,
textual
and
temporal
proximity
• 8me-‐ordered
sequence
of
important
objects
via
graph
op8miza8on
McParlane
et
al.
(2014)
–
state-‐of-‐the-‐art
baseline
• visual
features
+
SVM
to
discard
irrelevant
images
• clustering
in
subtopics
and
selec8on
of
popular
images
for
each
subtopic
based
on
popularity
and
specificity
71. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
MGraph:
Framework
Overview
1. create
message
mul8-‐graph
using
textual,
visual
and
temporal
proximity
2. find
underlying
topics
using
SCAN
algorithm
3. calculate
prior
scores
of
images
based
on
topics
and
popularity
(relevance)
4. diversify
using
DivRank
72. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Pre-‐processing
/
Filtering
Text-‐based
filtering
• heuris8c
rules
for
spam
filtering
→
discard
very
short
messages
&
messages
with
many
men8ons,
URLs
or
hashtags.
• filtering
of
unstructured
messages
using
POS
tagging
Accept
→
(determiner?
adjec$ve*
noun+
verb)+
Visual-‐based
filtering
• discard
small
images
• detect
and
discard
memes,
screenshots
and
images
containing
heavy
text
73. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Pre-‐processing
/
Filtering
Text-‐based
filtering
Visual-based filtering
Tweet length
POS tagging filtering
74. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Mul2-‐graph
Genera2on
(1)
Given
a
set
of
(original)
messages
M={m1,
m2,
...,
mn}
we
construct
a
mul8-‐graph
GM
=
{V,
Etextual,
Evisual,
Esocial,
E2me}
• vertex
vi
∈
V
corresponds
to
message
mi
• Etextual
→
undirected
edges
expressing
the
textual
similarity
(cosine
similarity)
between
nodes
(Z·∙idf
vector
vm)
• Evisual
→
undirected
edges
that
represent
the
visual
similarity
(L2
distance)
between
nodes
with
images
(VLAD+SURF
vectors)
Thresholding:
add
an
edge
in
Etextual
or
Evisual,
only
if
the
textual
or
visual
similarity
between
the
corresponding
nodes
is
higher
than
thtextual
or
thvisual
respec8vely
75. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Mul2-‐graph
Genera2on
(2)
76. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Example
mul2-‐modal
sub-‐graph
#
77. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Visual
deduplica2on
• Visual
duplicates
for
which
there
is
no
explicit
connec8on
→
apply
Clique
Percola8on
Method
(CPM)
on
sub-‐graph
Gvisual
=
{V,
Evisual}
• Represent
detected
cliques
as
single
messages:
– VLAD
aggrega8on
on
SURF
descriptors
of
all
images
in
the
clique
– mean
value
of
publica8on
8me
– aggregated
value
of
reposts
of
each
message.
– merged
w·∙idf
vector
• Replace
clustered
messages
in
GM
with
cliques
and
re-‐calculate
the
corresponding
edges
78. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Visual
deduplica2on
GM
Gvisual
79. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Topic
Detec2on
• Apply
Structural
Clustering
Algorithm
for
Networks
(SCAN)
→
iden8fy
dense
sub-‐graphs
of
messages
in
GM
• Sub-‐graphs
represent
the
topics
that
exist
in
the
stream
of
messages
• Each
topici
contains
messages
{Mi}
and
is
represented
as
a
merged
N·∙idf
vector
Vi
• A
substan8al
amount
of
messages
is
kept
outside
of
the
detected
clusters
– Hubs
&
Outliers
most
probably
are
non-‐informa8ve
– May
include
valuable
informa8on
→
also
considered
in
summariza8on
process
as
single-‐item
clusters
80. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Message
Selec2on
Score
reposts
relevance x
cluster size
x specificity
81. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Specificity
High
specificity
Low
specificity
rare
across
all
topics
of
the
event
common
across
topics
82. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Image
Ranking
&
Diversifica2on
variant
of
PageRank
aiming
diversity
83. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Dataset
and
Event
Descrip2on
• dataset
of
McMinn
et
al.
having
more
than
500
events
from
different
domains
• we
used
the
50
largest
events
in
terms
of
tweets
• sports
events
(e.g.,
the
Sochi
winter
Olympics),
poli8cal
events
(Ukraine
crisis,
Venezuelan
protests),
disasters,
etc.
• 364,005
tweets,
on
average
4,730
tweets/event
• 296,160
remaining
tweets,
due
to
suspended
accounts
and
deleted
messages
• about
3,51%
of
these,
i.e.
12,772
tweets,
contain
an
embedded
image
84. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Relevance
Judgments
Each
image
is
shown
to
3
par8cipants
(20
img-‐20
part)
without
ranking
informa8on
Task
Descrip2on:
You
are
presented
with
an
image
and
an
event
8tle
describing
a
trending
topic
in
TwiJer.
For
each
image
and
event
8tle,
you
are
asked
to
answer
the
following
ques8on:
Is
this
image
relevant
to
the
event?
1. The
image
is
clearly
not
relevant
to
the
event.
2. The
image
is
probably
not
relevant
to
the
event,
but
I
am
not
en8rely
sure.
3. The
image
is
somewhat
relevant
to
the
event,
but
I
have
my
doubts
on
whether
I
would
like
to
see
it
in
a
photo
coverage
of
the
event.
4. The
image
is
clearly
relevant
to
the
event,
and
I
would
like
to
see
it
in
a
photo
coverage
of
the
event.
85. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Experimental
Se{ng
• VLAD+SURF
extrac8on
– 64–dimensional
SURF
descriptors
– four
codebooks
of
128
visual
words
(in
total
512)
to
quan8ze
each
descriptor
– aggregate
SURF
descriptors
into
a
single
vector
of
64*512
=
32.768
dimensions
using
VLAD
scheme
– PCA
to
create
a
1024-‐dimensional
L2-‐normalized
reduced
vector
that
represents
the
visual
content
of
the
image
• Mul8-‐graph
genera8on
– k
=
500
nearest
neighbors
– visual
and
textual
similarity
thresholds
were
set
to
0.5
and
0.6
– σ2
of
the
temporal
kernel
was
empirically
set
to
24
hours
• SCAN
parameters
were
set
to
μ=2
and
ε=0.65
• DivRank’s
dumping
factor
was
set
to
d=0.75
86. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Evalua2on
metrics
(1)
Precision-‐oriented
metrics
• Precision
(P@N):
The
percentage
of
images
among
the
top
N
that
are
relevant
(answers
3&4)
to
the
corresponding
event,
averaged
among
all
events.
We
calculate
precision
for
N
equal
to
1,
5,
and
10.
• Success
(S@N):
Percentage
of
events,
where
there
exist
at
least
one
relevant
image
among
the
top
N
returned,
for
N=10.
• Mean
Reciprocal
Rank
(MRR)
:
Computed
as
1/r,
where
r
is
the
rank
of
the
first
relevant
image
returned,
averaged
over
all
events.
87. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Evalua2on
metrics
(2)
Diversity-‐oriented
metrics
• α-‐normalized
Discounted
Cumula2ve
Gain
:
α-‐nDCG@N
measures
the
usefulness,
or
gain,
of
the
returned
images
based
on
their
posi8on
in
the
summary
(N=10).
• Average
Visual
Similarity:
AVS@N
measures
the
average
visual
similarity
among
all
pairs
of
images
in
the
top
N
selected
images,
averaged
over
all
events.
Lower
AVS
values
are
preferable
since
they
imply
higher
diversity
in
terms
of
visual
content.
88. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Baselines
• Random:
randomly
selects
N
images
from
the
filtered
set
of
images
as
the
summary
set
• MostPopular:
picks
up
the
N
most
popular
images
in
terms
of
reposts
• LexRank:
uses
items
graph
GM,
ranks
the
nodes
using
the
LexRank
and
selects
the
top
N
nodes
that
contain
images
• TopicBased:
selects
the
N
most
relevant
messages
from
the
most
significant
topics
(S_cov)
(relevance,
no
specificity
&
diversity)
• P-‐TWR:
ranks
images
in
descending
order
using
the
weigh8ng
scheme
described
in
McParlane
et
al.
(popularity)
• S-‐TWR:
groups
the
tweets
of
each
event
into
sub-‐clusters
and
select
the
highest
ranked
item
of
each
cluster
using
the
previous
weigh8ng
scheme
(specificity)
89. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Results
(1)
–
Precision
oriented
metrics
89
• MGraph
outperforms
all
of
the
compe8ng
methods
• Popularity-‐based
approach
performs
well
for
P@1
but
drops
significantly
for
N=5,10
• LexRank
and
TopicBased
approaches
achieve
lower
but
more
steady
results
First relevant in
positions 1 - 2
90. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Results:
Canada
Team
in
#Sochi
Popularity-based
S-TWR
MGraph
91. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Results
(2)
–
Diversity
oriented
metrics
• MGraph
achieves
the
best
score
for
α-‐nDCG@10
• Best
values
of
AVS
achieved
by
S-‐TWR
• The
worst
results
in
terms
of
AVS
are
obtained
using
LexRank
92. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Results
(3)
Performance
of
MGraph
across
different
categories
• Best
P@10
measure
is
obtained
for
events
about
Science
&
Technology
• The
second
best
P@10
is
obtained
for
events
about
Arts
&
Entertainment
• Difficult
to
diversify
• The
best
value
of
AVS
is
achieved
for
events
about
disasters
&
accidents
e.g.,
earthquakes
93. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Results
(4)
Impact
of
the
dumping
factor
d
on
P@10,
S@5,
MRR
and
α-‐nDCG@10
• The
worst
results
for
all
metrics
are
obtained
for
d=0
(no
re-‐ranking)
• The
best
results
are
achieved
for
0.7<d<0.8
• slight
decrease
for
d>0.8
• more
diverse
→
less
relevant
94. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Conclusions
• Graph-‐based
approach
for
visual
summaries
for
real-‐world
events
• Maximizes
relevance
and
diversity
• Mul8modal
approach
taking
into
account
• Textual
content
• Visual
content
• Social
• Interac8ons
(replies)
• Popularity
• Time
• Introduc8on
of
user
related
features
(e.g.
influence)
95. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Monitoring
and
intelligence
system
for
Web
mul2media
verifica2on
96. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Can
mul2media
on
the
Web
be
trusted?
#96
Real
photo
captured
April
2011
by
WSJ
but
heavily
tweeted
during
Hurricane
Sandy
(29
Oct
2012)
Tweeted
by
mul8ple
sources
&
retweeted
mul8ple
8mes
Original
online
at:
hJp://blogs.wsj.com/metropolis/2011/04/28/weather-‐
journal-‐clouds-‐gathered-‐but-‐no-‐tornado-‐damage/
97. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
The
Problem
• Everyone
can
easily
publish
content
on
the
Web
• Content
can
be
easily
repurposed
and
manipulated
• News
outlets
are
compe8ng
for
views
and
clicks
à
Pressure
for
airing
stories
very
quickly
leaves
very
liJle
room
for
verifica8on.
à
Very
oten,
even
well-‐
reputed
news
providers
fall
for
fake
news
content.
• Mul8ple
tools
and
services
available
for
individual
tasks
à
complex
verifica8on
process
Very
hard
and
2me
consuming
to
check
the
veracity
of
Web
mul2media
#97
98. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Media
REVEALr
• Developed
within
the
REVEAL
project:
hJp://revealproject.eu/
• Framework
for
collec8ng,
indexing
and
browsing
mul8media
content
from
the
Web
and
social
media
• Support
for
verifica8on:
– Near-‐duplicate
detec8on
against
an
indexed
collec8on
– Clustering
of
social
media
posts
by
visual
similarity
à
compara8ve
view
of
the
same
incident
– Aggrega8on
and
visualiza8on
of
Named
En88es
around
an
incident
#98
99. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Related
Work
• Majority
of
works
have
focused
on
problem
of
topic
detec8on
and
summariza8on:
– TwitInfo
(Marcus
et
al.,
2011)
– TwiJermonitor
(Mathioudakis
&
Koudas,
2010)
– Meme
detec8on
&
predic8on
(Weng
et
al.,
2014)
• Visual
memes
and
clustering
– Visual
meme
tracking
(Xie
et
al.,
2011)
– Supervised
mul8modal
clustering
(Petkos
et
al.,
2012)
• Image
manipula8on
tracking
– Internet
image
archaeology
(Kennedy
&
Chang,
2008)
#99
100. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Overview
of
Media
REVEALr
#100
Media
collec8on
Media
pre-‐processing
&
feature
extrac8on
Media
analysis,
mining
&
indexing
Persistence
(storage,
indexing)
Access
(API)
Visualiza8on,
front-‐end
TEXT
VISUAL
101. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Named
En2ty
Detec2on
• Brevity
and
noisy
nature
of
text
in
social
media
poses
a
serious
challenge
• Employed
solu8on:
– Pre-‐processing:
tokeniza8on,
user
men8on
resolu8on,
text
cleaning
– Stanford
NER
+
user
men8on
resolu8on
– Regular
expressions
to
remove
special
characters
and
symbols
(e.g.,
#,
@,
URLs,
etc.)
#101
102. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Visual
Indexing
• Content-‐based
image
retrieval
to
solve
Near-‐
Duplicate
Search
(NDS)
problem
• Based
on
local
descriptors
(SURF),
aggrega8on
(VLAD),
dimensionality
reduc8on
(PCA),
quan8za8on
(PQ)
and
indexing
(IVFADC)
• State-‐of-‐the-‐art
visual
similarity
search
– High
precision/recall
– Very
efficient
and
scalable
implementa8on
(search
many
millions
of
images
in
a
few
msec,
maintain
full
index
in
memory
using
~1GB/10M
images)
#102
103. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Improving
NDS
Resilience
(NDS+)
• Oten,
NDS
performance
suffers
from
overlay
graphics
and
fonts
• To
address
this
issue,
we
integrate
a
descriptor-‐level
classifier
that
tries
to
remove
the
font/graphic
descriptors
from
the
VLAD
vector
#103
104. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Example:
Filtering
Out
Font
Descriptors
• Assuming
that
in
most
cases
the
classifier
is
correct,
the
resul8ng
VLAD
vector
is
of
much
higher
quality
compared
to
the
one
without
filtering
#104
105. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Classifier
Details
• Random
Forest
used
as
base
classifier
• Cost
Sensi8ve
meta-‐classifier
to
penalize
misclassifica8on
of
True
Posi8ves
• Challenge
due
to
Class
Imbalance
(overlay
descriptors
<<
useful
image
content
descriptors)
– Cost
Sensi8ve
meta-‐classifier
performs
over-‐sampling
of
minority
class
to
balance
the
training
set
• Training
set
created
by
collec8ng
images
with
overlays
(e.g.,
memes)
from
the
Web
and
manually
annota8ng
them
(selec8ng
areas
w.
fonts/overlays)
#105
106. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Mining:
Clustering
and
Aggrega2on
• Visual
aggrega8on
– DBSCAN
on
the
visual
feature
representa8on
(PCA-‐
reduced
VLAD
vectors)
– Element
(tweet)
selected
based
on
the
largest
amount
of
keywords
(expected
to
result
in
more
informa8on)
• En8ty
aggrega8on
– NER
on
individual
items
– En8ty
categoriza8on
(à
Persons,
Loca8on,
Organiza8ons)
– En8ty
ranking
based
on
frequency
of
occurrence
#106
107. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
User
Interface:
Collec2ons
View
#107
108. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
User
Interface:
Items
View
&
Search
#108
109. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
User
Interface:
Clusters
View
#109
110. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
User
Interface:
En22es
View
#110
111. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Evalua2on:
NER
• Manual
annota8on
of
400
tweets
from
the
SNOW
Data
Challenge
dataset
(Papadopoulos
et
al.,
2014)
• Measure:
Accuracy
à
instance
is
considered
correct
when
both
en8ty
and
type
are
correctly
iden8fied
• Three
compe8ng
solu8ons:
– Base
Stanford
NER
(S-‐NER)
– S-‐NER
+
Extensions/Post-‐processing
(S-‐NER+)
– Ellogon
library
(hJp://www.ellogon.org)
#111
112. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Evalua2on:
NDS
• Benchmark
Datasets
– Holidays:
1,491
images,
500
queries
(Jegou
et
al.,
2008)
– Oxford:
5,063
images,
55
queries
(Philbin
et
al.,
2008)
– Paris:
6,412
images,
55
queries
(Philbin
et
al.,
2008)
• Accuracy:
mean
Average
Precision
(mAP)
#112
CLEAN
DATASET
NOISY
DATASET
113. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Evalua2on:
NDS
• Execu8on
Time
(msec)
• Example
#113
INDEXED
IMAGE
QUERY
IMAGE
NDS:
#27
NDS+:
#1
114. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Use
Cases:
Real-‐world
Datasets
#114
sandy
boston
malaysia
ferry
115. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
NDS
Use
Case
(boston)
#115
116. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Clustering
Use
Case
(boston)
• Visual
clustering
enables
compara8ve
view
and
analysis
over
8me
(in
this
case
showing
increasing
confidence
on
picture).
• When
journalists
see
many
similar
photos
of
the
same
scene,
they
have
more
confidence
that
it
is
real
and
not
fabricated.
#116
117. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
En2ty
Aggrega2on
Use
Case
(snow)
#117
LOCATIONS
PERSONS
ORGANIZATIONS
118. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Conclusion
• Key
contribu8ons
– Framework
and
web
applica8on
offering
valuable
verifica8on
support
for
Web
mul8media
– High-‐quality
individual
components
for
NER,
NDS,
clustering
and
aggrega8on
• Future
Work
– Incremental
image
clustering
– Temporal
views
to
explore
evolu8on
of
a
story
– Mul8media
forensics
toolbox
(splice,
copy-‐move
detec8on)
#118
119. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Computa2onal
Verifica2on
in
Social
Media
• Create
a
computa$onal
verifica$on
framework
to
classify
tweets
with
unreliable
media
content.
• Events
used
for
experimenta8on
#119
Fake
images
posted
during
Hurricane
Sandy
natural
disaster
Fake
images
posted
during
Boston
Marathon
bombings
120. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Methodology
#120
Tweet
Extrac8on
• Use
Topsy
machine
to
collect
tweets
with
certain
keywords
Image
Indexing
• Create
a
predefined
set
of
verified
fake
and
real
images
• Keep
the
tweets
with
iden8cal
or
near-‐duplicate
images
Feature
Extrac8on
• Extract
Content
and
User
features
for
each
tweet
collected
and
their
combina8on
Dataset
• Annotate
each
tweet
as
fake
or
real
based
on
the
image
• Keep
only
tweets
wriJen
in
English,
Spanish
or
German
Classifica8on
• Test
using
cross-‐
valida$on
approach
• Test
using
the
two
dis8nct
datasets
• Test
using
different
training
and
tes8ng
dataset
Content
features
• Length
of
the
tweet
• Number
of
words
• Contains
exclama8on
mark
and
their
number
• Contains
quota8on
mark
and
their
number
• If
the
text
contains
emo8con
(happy
or
sad)
• Number
of
uppercase
characters
• Number
of
hashtags
• Number
of
men8ons
• Number
of
pronouns
• Number
of
urls
• Number
of
sen8ment
words
• Number
of
retweets
User
features
• Username
• Number
of
friends
• Number
of
followers
• Number
of
followers/number
of
friends
ra8o
• Number
of
8mes
the
user
was
listed
• If
the
status
of
the
user
contains
url
• If
the
user
is
verified
or
not
121. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Results
• Tweet
Sta8s8cs
• Approaches
#121
Tweets
with
URLs
343939
Tweets
with
fake
images
10758
Tweets
with
real
images
3540
Hurricane
Sandy
Boston
Marathon
Tweets
with
URLs
112449
Tweets
with
fake
images
281
Tweets
with
real
images
460
Classifier
Classified
correctly(%)
Content
features
User
features
Total
features
J48
tree
81.41
67.72
80.68
KStar
81.28
71.16
81.38
Random
Forest
80.59
70.15
80.94
Detec8on
accuracy
using
cross
–
valida8on
approach
Classifier
Classified
correctly(%)
Content
features
User
features
Total
features
J48
tree
76.45
70.81
81.25
KStar
81.28
74.12
75.78
Random
Forest
78.59
76.15
79.10
Hurricane
Sandy
Boston
Marathon
122. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Results(2)
#122
Classifier
Classified
correctly(%)
Content
features
User
features
Total
features
J48
tree
73.79
51.06
65.06
KStar
75.30
62.29
53.31
Random
Forest
74.02
63.10
65.96
Detec8on
accuracy
using
different
training
and
tes8ng
set
in
Hurricane
Sandy
Classifier
Classified
correctly(%)
Content
features
User
features
Total
features
J48
tree
55.05
50.12
54.10
KStar
50.01
50.10
50.97
Random
Forest
58.75
51.03
58.78
Detec8on
accuracy
using
Hurricane
Sandy
for
training
and
Boston
Marathon
for
tes8ng
123. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
#123
Other
approaches
• Graph-‐based
mul8modal
clustering
for
social
event
detec8on
in
large
collec8ons
of
images
– automa8c
organiza8on
of
a
mul8media
collec8on
into
groups
of
items,
each
(group)
of
which
corresponds
to
a
dis8nct
event.
• Unsupervised
concept
learning
detec8on
using
social
media
as
training
data
• Text
analysis
for
en88es
matching
and
sen8ment
analysis
• Placing
images
based
on
content-‐features
• Retrieving
diverse
images
for
same
en8ty
124. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
#124
Demos
-‐
Applica2ons
MM
News
Demo
Clusrour
ThesFest
125. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Mul2media
Demo
126. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
#126
Mul2media
Demo
Architecture
#126
StreamManager
TwiJer
Facebook
Flickr
YouTube
RSS
Instagram
160.xx.xx.207
MongoDBWrapper
160.xx.xx.207
TextIndexer
(Solr)
160.xx.xx.207
160.xx.xx.207
MediaFetcher,
FeatureExtractor
(HDFS)
160.xx.xx.58
160.xx.xx.107
Social
Focused
Crawler
(HDFS)
160.xx.xx.187
Nutch
Nutch
VLAD
FeatureIndexer
(HDFS)
160.xx.xx.207
IVFADC
Data
Mining
160.xx.xx.191
Visual
Clust.
Geo
Clust.
Sta8s8cs
Web
server
160.xx.xx.116
API
(3)
API
(4)
API
(1)
API
(2)
127. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
MongoDB
Document-‐oriented
database
→
support
of
json
Current
stable
version:
3.0.6
hJps://www.mongodb.org/
Flexible
Data
Model
→
schemeless,
usefulll
for
social
media
data
that
change
over
8me
Horizontal
scaling
via
shards
and
replica
sets
Storage
of
social
media
items
as
json
objects
→
millions
of
documents
can
be
handled
Number
of
different
index
types
→
single
field,
compound,
mul8key
indexes.
Example:
Store
facebook
posts
and
index
them
by
publica8on
8me
and
number
of
likes
Query:
get
most
recent
posts
sorted
by
popularity
(#likes)
Na8ve
support
of
map-‐reduce
jobs
→
get
most
shared
images
in
a
collec8on
of
tweets
128. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Apache
Solr
Full-‐text
search
plaworm
built
on
top
ofApache
Lucene
Current
version:
5.3.0
hJp://lucene.apache.org/solr/
Indexing
of
social
media
items
e.g.
Tweets,
FB
posts,
metadata
of
Youtube
videos
etc.
Addi2onal
features
l Faceted
Search
and
Filtering
→
get
top
N
per
field
e.g.
users
l Spa8al
index
&
Search
→
very
usefull
in
geo-‐tagged
documents
e.g.
Tweets.
l Plugin-‐based
archtecture
→
language
detec8on,
NLP
etc
as
steps
of
indexing
pipeline
Get
tweets
containg
the
name
“Barack
Obama”
OR
the
phrase
“us
elec8ons”
having
geo-‐loca8on
around
New
York
SolrCloud
→
Cluster
of
Solr
instances
Automa8c
load
balancing
and
fail-‐over
for
queries
ZooKeeper
integra8on
for
cluster
coordina8on
and
configura8on
129. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Storm
Distributed
real-‐8me
computa8on
system
hJps://storm.apache.org
Topologies
→
processing
logic
Stream:
unbounded
sequence
of
tuples
e.g.
tweets
or
URLs
Spouts:
source
of
streams
Bolts:
processing,
filtering,
etc
Processing
of
URLS
shared
in
social
media
→
storm
pipeline
l Expand
short
URLs
l Fetch
new
URLs
l Extract
content
e.g.
ar8cles
and
images
130. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Redis
Key
-‐
Value
cache
and
store
Current
stable
version:
3.0
hJps://storm.apache.org/
Par22oning
→
distribu8on
of
data
among
mul8ple
Redis
instances
Keys
can
contain
strings,
hashes,
lists,
sets,
sorted
sets,
etc
Atomic
opera2ons:
set,
increment,
push
etc
Store
crawling
status
of
URLs,
sharing
informa8on
of
URLs
and
images
Addi8onal
Feature
l Implementa8on
of
Publisher/Subscriber
paJern
l Communica8on
of
different
components
in
a
system
for
social
media
analy8cs
131. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
tags:
sagrada
familia,
cathedral,
barcelona
taken:
12
May
2009
lat:
41.4036,
lon:
2.1743
PHOTOS
&
METADATA
SPATIAL
CLUSTERING
+
TEMPORAL
ANALYSIS
COMMUNITY
DETECTION
CLASSIFICATION
TO
LANDMARKS/EVENTS
VISUAL
TAG
HYBRID
[2
years,
50
users
/
120
photos]
#users
/
#photos
dura8on
[1
day,
2
users
/
10
photos]
S.
Papadopoulos,
C.
Zigkolis,
Y.
Kompatsiaris,
A.
Vakali.
“Cluster-‐based
Landmark
and
Event
Detec8on
on
Tagged
Photo
Collec8ons”.
In
IEEE
Mul8media
Magazine
18(1),
pp.
52-‐63,
2011
City
profile
crea2on
(Clusrour)
132. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
#132
City
profile
crea2on
(Clusrour)
Community
detec2on
on
image
similarity
graphs
Nodes:
photos
Edges:
visual
and
tag
similarity
134. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
#134
ThessFest
• Thessaloniki
Interna8onal
Film
Fes8val
• Support
twiJer/
comment
usage
within
the
app
• Ra8ngs
and
comments
per
film
• Feedback
aggrega8on
• Votes
• Tweets
• Real-‐8me
feedback
to
the
organisa8on
and
visitors
ThessFest
135. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Fête
de
la
Musique
Berlin
app
• FETEberlin
in
App
Store
and
Google
Play
• More
than
100K
visitors
• About
5K
musicians
• More
than
5K
app
downloads,
25K
sessions
App
features
• Browse
and
filter
detailed
program
• Interac8ve
maps
and
rou8ng
• Social
Sharing
• Ar8sts’
and
Stages
Details
• Social
Monitoring
Main
benefits
for
arendants
• Visitors
can
browse
through
maps
and
don’t
get
lost
as
stages
are
numerous
• Event
schedule
is
available
always
and
per
stage
– Very
useful
when
the
server
was
down
and
there
was
no
access
to
the
online
schedule
#135
136. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
#136
Topic
analysis
• Top-‐10
topics
• Manual
inspec8on
of
clusters:
– 53.8%
of
topic
8tles
considered
informa8ve
– 98.5%
of
clusters
were
found
to
be
“clean”
• Topics
in
8me
137. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Other
Applica2on
Areas
• Science
– Sociology,
machine
learning
(machine
as
a
teacher),
computer
vision
(annota8on)
• Tourism
–
Leisure
–
Culture
– Off-‐the-‐beaten
path
POI
extrac8on
• Marke8ng
– Brand
monitoring,
personalised
ads
• Predic8on
– Poli8cs:
elec8on
results
• News
– Topics,
trends
event
detec8on
• Others
– Environment,
emergency
response,
energy
saving,
etc
139. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
#139
Benchmarking
-‐
Datasets
140. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
dataset:
SNOW
2014
Data
Challenge
• A
set
of
~1M
tweets
collected
using
a
list
of
5000
UK-‐
focused
“news
hounds”
and
the
keywords
“Syria”,
“terror”,
“Ukraine”,
and
“bitcoin”
for
a
period
of
24
hours
star8ng
from
Feb
25,
18:00.
• Average
rate:
~720
tweets/minute
• Number
of
unique
twiJer
accounts:
~556K
• Number
of
retweets:
~648K
• Number
of
replies:
~135K
• Ground
truth
topics:
hJp://figshare.com/ar8cles/SNOW_2014_Data_Challenge/1003755
#140
141. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Overview
of
Challenge
• Goal:
Detec8on
of
newsworthy
topics
in
a
large
and
noisy
set
of
tweets
• Topic:
a
news
story
represented
by
a
headline
+
tags
+
representa8ve
tweets
+
representa8ve
images
(op8onal)
• Newsworthy:
A
topic
that
ends
up
being
covered
by
at
least
some
major
online
news
sources
• Topics
are
detected
per
2meslot
(small
equally-‐sized
8me
intervals)
• We
want
a
maximum
number
of
topics
per
8meslot
#141
143. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Some
sta2s2cs
• Registered
par8cipants:
25
– India:
4,
Belgium:
3,
Germany:
3,
UK:
3,
Greece:
3,
Ireland:
2,
USA:
2,
France:
2,
Italy:
1,
Spain:
1,
Russia:
1
• Par8cipants
that
signed
the
Challenge
agreement:
19
• Par8cipants
that
submiJed
results:
11
• Par8cipants
that
submiJed
papers:
9
#143
144. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
Evalua2on
Protocol
• Defined
several
evalua8on
criteria:
– Newsworthiness
à
Precision/Recall,
F-‐score
– Readability
à
scale
[1-‐5]
– Coherence
à
scale
[1-‐5]
– Diversity
à
scale
[1-‐5]
• List
of
reference
topics
• Set
up
precise
evalua8on
guidelines
• Blind
evalua8on
(i.e.
evaluator
not
aware
of
which
method
a
topic
comes
from)
based
on
Web
UI
• Par8cipants
submiJed
topics
for
96
8meslots,
but
manual
evalua8on
happened
for
5
sample
8meslots.
• Result
valida8on
and
analysis
#144
145. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
social
event
detec2on
146. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
a
bit
of
background...
• mediaeval
– well-‐known
benchmarking
ac8vity
since
2010
(started
as
VideoCLEF
in
2008)
– consists
of
several
tasks
dedicated
to
specific
challenges
• social
event
detec2on
(SED)
– first
run
in
2011
(7
par8cipants)
147. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
task
defini2on
&
dataset
• 2011
collec8on:
73,645
flickr
photos
from
five
ci8es,
May
2009
find
events
related
to
two
target
categories
>
soccer
matches
in
Barcelona
and
Rome
>
concerts
in
venues
Paradiso
and
Parc
del
Forum
• 2012
collec8on:
167,332
flickr
photos
from
five
ci8es,
2009-‐2011
find
events
related
to
three
target
categories
>
technical
events
(e.g.
exhibi8ons,
fairs)
in
Germany
>
soccer
events
in
Hamburg
and
Madrid
>
Indignados
movement
in
Madrid
• 2013
collec8on
1:
437,370
flickr
photos
+
1,327
YouTube
videos
collec8on
2:
57,165
Instagram
photos
cluster
collec8on
1
into
events
(aJach
YouTube
videos
to
them)
categorize
collec8on
2
images
into
eight
event
types
or
non-‐event
variant
1
variant
4
variant
4
148. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
sed2012:
evalua2on
setup
• ground
truth:
photos
clustered
around
149
events
(18
technical,
79
soccer,
52
Indignados)
• assess
the
following
aspects:
– accuracy
of
same-‐event
classifica8on
– compare
clustering
quality
between
item-‐to-‐cluster
and
the
two
versions
of
item-‐to-‐item
(batch
&
incremental)
– measure
contribu8ons
of
different
features
– study
generaliza8on
abili8es
of
same
event
model
149. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
evalua2on:
main
caveat
• crea8on
strategy
of
benchmark
dataset
can
drama8cally
affect
how
hard
(or
easy)
the
problem
is
– if
events
are
very
sparsely
distributed
over
8me,
then
a
simple
8me-‐based
clustering
could
be
sufficient
– if
events
correspond
to
users
one-‐to-‐one,
then
a
simple
user-‐based
look-‐up
could
yield
very
high
accuracy
– using
the
same
source
for
training/tes8ng
makes
it
easy
• need
to
explore
new
challenging
se†ngs
– mul8ple
sources
of
mul8media
– huge
amounts
of
non-‐event
content
– very
dense
coverage
of
feature
space
by
test
events
150. S3P
2015,
Garda
Lake,
Italy
Processing
Large
Complex
Data
#150
Conclusions