SlideShare uma empresa Scribd logo
1 de 155
Baixar para ler offline
Processing	
  Large	
  Complex	
  Data	
  
Social	
  Data	
  and	
  Mul8media	
  Analy8cs	
  for	
  News	
  and	
  Events	
  
Applica8ons	
  
Dr.	
  Yiannis	
  Kompatsiaris,	
  ikom@i2.gr	
  
Mul$media,	
  Knowledge	
  and	
  Social	
  Media	
  Analy$cs	
  Lab,	
  Head	
  
CERTH-­‐ITI	
  
2015	
  IEEE	
  SPS	
  Italy	
  Chapter	
  Summer	
  School	
  on	
  Signal	
  
Processing	
  (S3P)	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #2	
  
Overview	
  
•  Introduc8on	
  
–  Mo8va8on	
  –	
  Challenges	
  
•  Example	
  Use	
  Cases	
  
•  Research	
  Approaches	
  
–  Large-­‐Scale	
  visual	
  search	
  
–  Graphs	
  -­‐	
  Community	
  Detec8on	
  -­‐	
  Clustering	
  
–  Social	
  Event	
  Detec8on	
  
–  Verifica8on	
  
•  Demos	
  –	
  Applica8ons	
  
–  MM	
  News	
  Demo	
  
–  ClusJour	
  
–  Thessfest	
  
•  Evalua8on	
  -­‐	
  Benchmarking	
  
•  Conclusions	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #3	
  
Introduc2on	
  
Mo2va2on	
  
Example	
  Applica2ons	
  
Conceptual	
  Architecture	
  
Challenges	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #4	
  
Pope	
  Francis	
  
Pope	
  Benedict	
  
2007:	
  iPhone	
  release	
  
2008:	
  Android	
  release	
  
2010:	
  iPad	
  release	
  
http://petapixel.com/2013/03/14/a-starry-sea-of-cameras-at-the-unveiling-of-pope-francis/
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
hJp://www.puzzlemarketer.com/digital-­‐social-­‐brands-­‐in-­‐60-­‐seconds/	
  	
  (Apr,	
  2012)	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  6	
  
rise	
  of	
  the	
  networks	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Social	
  Networks	
  as	
  Graphs	
  
10#
social#web#as#a#graph#
nodes&=&twi+er&users&
edges&=&retweets&on&#jan25&hashtag&
announcement&of&Mubarak’s&resigna<on&
h1p://gephi.org/2011/the7egyp9an7revolu9on7on7twi1er/#
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #8	
  
Social	
  Networks	
  as	
  Graphs	
  
“Social	
  networks	
  have	
  emergent	
  
proper$es.	
  Emergent	
  proper$es	
  
are	
  new	
  aFributes	
  of	
  a	
  whole	
  that	
  
arise	
  from	
  the	
  interac$on	
  and	
  
interconnec$on	
  of	
  the	
  parts”	
  
•  Emo8ons,	
  Health,	
  Sexual	
  
rela8onships	
  do	
  not	
  depend	
  
just	
  on	
  our	
  connec8ons	
  (e.g.	
  
number	
  of	
  them)	
  but	
  on	
  our	
  
posi8on	
  -­‐	
  structure	
  in	
  the	
  social	
  
graph	
  
–  Central	
  –	
  Hub	
  
–  Outlier	
  
–  Transi8vity	
  (connec8ons	
  between	
  
friends)	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Social	
  Networks	
  as	
  Real-­‐Life	
  Sensors	
  
•  Social	
  Networks	
  is	
  a	
  data	
  source	
  with	
  an	
  
extremely	
  dynamic	
  nature	
  that	
  reflects	
  
events	
  and	
  the	
  evolu8on	
  of	
  community	
  
focus	
  (user’s	
  interests)	
  
•  Huge	
  smartphones	
  and	
  mobile	
  devices	
  
penetra2on	
  provides	
  real-­‐8me	
  and	
  
loca8on-­‐based	
  user	
  feedback	
  
•  Transform	
  individually	
  rare	
  but	
  
collec2vely	
  frequent	
  media	
  to	
  meaningful	
  
topics,	
  events,	
  points	
  of	
  interest,	
  
emo8onal	
  states	
  and	
  social	
  connec8ons	
  
•  Present	
  in	
  an	
  efficient	
  way	
  for	
  a	
  variety	
  of	
  
applica8ons	
  (news,	
  marke8ng,	
  science,	
  
health,	
  entertainment)	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Caption
Time
User
Profile
Favs
Comms
Tags
Social	
  Media	
  aspects	
  	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Examples	
  -­‐	
  Science	
  
Xin	
  Jin,	
  Andrew	
  Gallagher,	
  Liangliang	
  Cao,	
  Jiebo	
  Luo,	
  and	
  
Jiawei	
  Han.	
  The	
  wisdom	
  of	
  social	
  mulHmedia:	
  
using	
  flickr	
  for	
  predicHon	
  and	
  forecast,	
  
Interna8onal	
  conference	
  on	
  Mul8media	
  (MM	
  '10).	
  ACM.	
  
11	
  
“…if	
  you're	
  more	
  than	
  100	
  km	
  away	
  from	
  the	
  epicenter	
  
[of	
  an	
  earthquake]	
  you	
  can	
  read	
  about	
  the	
  quake	
  on	
  
twiJer	
  before	
  it	
  hits	
  you…”	
  
Many	
  twiJer	
  examples	
  at:	
  What	
  can	
  TwiJer	
  tell	
  us	
  about	
  the	
  real	
  world?	
  TwiJer	
  and	
  the	
  Real	
  
World	
  CIKM'13	
  Tutorial,	
  hJps://sites.google.com/site/twiJerandtherealworld/home	
  	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Examples	
  -­‐	
  Science	
  
12	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Examples	
  -­‐	
  Science	
  
13	
  
Be	
  careful	
  of	
  correla8on	
  diagrams	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Example	
  –	
  News	
  (Boston	
  bombing)	
  
#14	
  
“Following	
  the	
  Boston	
  Marathon	
  bombings,	
  one	
  quarter	
  of	
  
Americans	
  reportedly	
  looked	
  to	
  Facebook,	
  TwiJer	
  and	
  
other	
  social	
  networking	
  sites	
  for	
  informa8on,	
  according	
  to	
  
The	
  Pew	
  Research	
  Center.	
  When	
  the	
  Boston	
  Police	
  
Department	
  posted	
  its	
  final	
  “CAPTURED!!!”	
  tweet	
  of	
  the	
  
manhunt,	
  more	
  than	
  140,000	
  people	
  retweeted	
  it.”	
  	
  
“Authori8es	
  have	
  recognized	
  that	
  one	
  the	
  first	
  
places	
  people	
  go	
  in	
  events	
  like	
  this	
  is	
  to	
  social	
  
media,	
  to	
  see	
  what	
  the	
  crowd	
  is	
  saying	
  about	
  what	
  
to	
  do	
  next”	
  
"I	
  have	
  been	
  following	
  my	
  friend's	
  
Facebook	
  [account]	
  who	
  is	
  near	
  the	
  scene	
  
and	
  she	
  is	
  upda2ng	
  everyone	
  before	
  it	
  
even	
  gets	
  to	
  the	
  news”	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Example	
  –	
  Crisis	
  –	
  Humanitarian	
  (Syria)	
  
#15	
  
Syria	
  Tracker	
  offers	
  a	
  crisis	
  mapping	
  system	
  that	
  uses	
  crowdsourced	
  text,	
  photo	
  
and	
  video	
  reports	
  and	
  data	
  mining	
  techniques	
  forming	
  a	
  live	
  map	
  of	
  the	
  Syrian	
  
conflict	
  since	
  March	
  2011	
  
…stream	
  of	
  
content-­‐filtered	
  
media	
  from	
  
news,	
  social	
  
media	
  (TwiJer	
  
and	
  Facebook)	
  
and	
  official	
  
sources	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Events	
  -­‐	
  Fes2vals	
  
#16	
  
http://www.eventmanagerblog.com/uploads/2012/12/event-technology-infographic.jpg
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Many	
  other	
  examples:	
  smellymaps	
  
#17	
  
Smell	
  related	
  words	
  in	
  geo-­‐located	
  social	
  media	
  
hJp://researchswinger.org/smellymaps/	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
API	
  Wrapper	
  
Website	
  Wrapper	
  
Scheduler	
  
CRAWLING	
  
Visual	
  Indexing	
  
Near-­‐duplicates	
  
Text	
  Indexing	
  
INDEXING	
  
Media	
  Fetcher	
  
SNA	
  
Sen2ment	
  -­‐	
  Influence	
  
Trends	
  -­‐	
  Topics	
  
MINING	
  
Model	
  Building	
  
Concepts	
  
Relevance	
  
Diversity	
  
Popularity	
  
RANKING	
  
Veracity	
  
Crawling	
  Specs	
  
Sources	
  
Interac2on	
  
Responsiveness	
  	
  
Aggrega2on	
  
VISUALIZATION	
  
Aesthe2cs	
  
Conceptual	
  Architecture	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Challenges	
  –	
  Content	
  (Mining)	
  
•  Mul2-­‐modality:	
  e.g.	
  image	
  +	
  tags,	
  video,	
  audio	
  
•  Rich	
  social	
  context:	
  spa8o-­‐temporal,	
  social	
  connec8ons,	
  
rela8ons	
  and	
  social	
  graph	
  
•  Specific	
  messages:	
  short,	
  conversa8ons,	
  errors,	
  no	
  context	
  
•  Inconsistent	
  quality:	
  noise,	
  spam,	
  fake,	
  propaganda	
  
•  Huge	
  volume:	
  Massively	
  produced	
  and	
  disseminated	
  
•  Mul2-­‐source:	
  may	
  be	
  generated	
  by	
  different	
  applica8ons	
  
and	
  user	
  communi8es	
  
•  Dynamic:	
  Fast	
  updates,	
  real-­‐8me	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Policy	
  –	
  Licensing	
  –	
  Legal	
  challenges	
  
•  	
  Fragmented	
  access	
  to	
  data	
  
–  Separate	
  wrappers/APIs	
  for	
  each	
  source	
  (TwiJer,	
  Facebook,	
  etc.)	
  
–  Different	
  data	
  collec8on/crawling	
  policies	
  
•  	
  Limita8ons	
  imposed	
  by	
  API	
  providers	
  (“Walled	
  Gardens”)	
  
•  Full	
  access	
  to	
  data	
  impossible	
  or	
  extremely	
  expensive	
  (e.g.	
  see	
  data	
  
	
  licensing	
  plans	
  for	
  GNIP	
  and	
  DataSit	
  
•  Non-­‐transparent	
  data	
  access	
  prac8ces	
  (e.g.	
  access	
  is	
  provided	
  to	
  an	
  
	
  organiza8on/person	
  if	
  they	
  have	
  a	
  contact	
  in	
  TwiJer)	
  	
  
•  	
  Constant	
  change	
  of	
  model	
  and	
  ToS	
  of	
  social	
  APIs	
  
–  No	
  backwards	
  compa8bility,	
  addi8onal	
  development	
  costs	
  
•  	
  Ephemeral	
  nature	
  of	
  content	
  
•  Social	
  search	
  results	
  oten	
  lead	
  to	
  removed	
  content	
  à	
  inconsistent	
  
	
  and	
  unreliable	
  referencing	
  
•  	
  User	
  Privacy	
  &	
  Purpose	
  of	
  use	
  
•  Fuzzy	
  regulatory	
  framework	
  regarding	
  mining	
  user-­‐contributed	
  data
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #21	
  
Example	
  Use	
  Cases	
  
Events	
  and	
  News	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
SocialSensor	
  Project	
  Objec2ve	
  
SocialSensor	
  quickly	
  surfaces	
  trusted	
  and	
  relevant	
  material	
  	
  
from	
  social	
  media	
  –	
  with	
  context.	
  
DySCO	
  
behaviour	
  
loca8on	
  
8me	
  content	
  
usage	
  
social	
  context	
  
Massive	
  social	
  media	
  
and	
  unstructured	
  web	
  
Social	
  media	
  mining	
  
Aggrega8on	
  &	
  indexing	
  
News	
  -­‐	
  Infotainment	
  
Personalised	
  access	
  
	
  Ad-­‐hoc	
  P2P	
  networks	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #23	
  
“It has changed the way we do
news”(MSN)
“Social media is the key place for emerging stories –
internationally, nationally, locally” (BBC)
“Social media is transforming the way we do journalism”
(New York Times)
Source: picture alliance / dpa
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #24	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
   	
   	
   	
   	
   	
   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Source:	
  GeJy	
  Images	
  
“It’s really hard to find the nuggets of useful stuff
in an ocean of content” (BBC)
“Things that aren’t relevant crowd out the content
you are looking for” (MSN)
“The filters aren’t configurable
enough” (CNN)
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Verifica2on	
  was	
  simpler	
  in	
  the	
  past...	
  
Source: Frank Grätz
#25	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #26	
  
News	
  Use	
  Case	
  Requirements	
  
Quickly	
  surface	
  trusted	
  and	
  relevant	
  material	
  from	
  
social	
  media	
  –	
  with	
  context.	
  
•  “quickly”:	
  in	
  real	
  8me	
  
•  “surfaces”:	
  automa8cally	
  discovers,	
  clusters	
  and	
  searches	
  	
  
•  “trusted”:	
  automa8c	
  support	
  in	
  verifica8on	
  process	
  
•  “relevant”:	
  to	
  the	
  specific	
  event	
  
•  “material”:	
  any	
  material	
  (text,	
  image,	
  audio,	
  video	
  =	
  
mul8media),	
  aggregated	
  with	
  other	
  sources	
  (e.g.	
  web)	
  
•  “social	
  media”:	
  across	
  all	
  relevant	
  social	
  media	
  plaworms	
  
•  “with	
  context”:	
  loca8on,	
  8me,	
  sen8ment,	
  influence	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #27	
  
Infotainment	
  
•  Events	
  with	
  large	
  numbers	
  
of	
  visitors	
  
•  Thessaloniki	
  Interna8onal	
  
Film	
  Fes8val	
  	
  
–  80,000	
  viewers	
  /	
  100,000	
  
visitors	
  in	
  10	
  days	
  
–  150	
  films,	
  350	
  screenings	
  
•  Discovery	
  and	
  presenta8on	
  
of	
  relevant	
  aggregated	
  
social	
  media	
  
–  Trending	
  Topics	
  
–  Sen8ment	
  
–  Tweet	
  –	
  film	
  matching	
  
–  Visualiza8on	
  (Social	
  Walls)	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #28	
  
Conceptual	
  Architecture	
  and	
  Main	
  components	
  
SEMANTIC	
  MIDDLEWARE	
  
Public	
  
Data	
  
SEARCH	
  &	
  RECOMMENDATION	
  
USER	
  MODELLING	
  &	
  PRESENTATION	
  
INDEXING	
  MINING	
  
STORAGE	
  
DATA	
  COLLECTION	
  /	
  CRAWLING	
  
•  Real	
  8me	
  dynamic	
  topic	
  
and	
  event	
  clustering	
  
•  Trend,	
  popularity	
  
and	
  sen8ment	
  analysis	
  
•  Calculate	
  trust/influence	
  
scores	
  around	
  people	
  
•  Personalized	
  search,	
  
access	
  &	
  presenta8on	
  
based	
  on	
  social	
  network	
  
interac8ons	
  
•  Seman8c	
  enrichment	
  
and	
  discovery	
  of	
  services	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #29	
  
Research	
  Approaches	
  
	
  
Large-­‐Scale	
  Visual	
  Search	
  
Graphs	
  –	
  Clustering/Community	
  Detec2on	
  
Visual	
  Event	
  Summariza2on	
  
Social	
  Media	
  Verifica2on	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #30	
  
Scalable	
  visual	
  feature	
  aggrega2on	
  &	
  
indexing	
  
•  Problem:	
  Example-­‐based	
  image	
  search	
  
–  Find	
  images	
  that	
  represent	
  same	
  or	
  similar	
  object	
  or	
  scene	
  
with	
  a	
  given	
  query	
  image	
  
–  Viewed	
  from	
  different	
  viewpoints,	
  	
  occlusions,	
  	
  cluJer	
  
•  Challenge:	
  Large-­‐scale	
  
–  Searching	
  databases	
  with	
  tens	
  of	
  millions	
  of	
  images	
  
–  Objec8ves	
  to	
  be	
  full-­‐filed:	
  
•  Sufficient	
  discrimina8ve	
  power	
  
•  Fast	
  response	
  8mes	
  
•  Efficient	
  memory	
  usage	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #31	
  
Large-­‐scale	
  visual	
  search	
  
image	
  collec8on	
  
from	
  social	
  media/	
  
Web	
  
image	
  local	
  feature	
  
extrac8on	
  
feature	
  aggrega8on	
  
feature	
  indexing	
  kNN	
  visual	
  
similarity	
  search	
  
concept-­‐based	
  
image	
  annota8on	
  
image	
  clustering	
  
image	
  (geo)tagging	
  
concept-­‐based	
  
search/filtering	
  
duplicate	
  detec2on	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #32	
  
Framework	
  
•  Implementa8on	
  and	
  evalua8on	
  of	
  the	
  effec8veness	
  
of	
  VLAD	
  in	
  combina8on	
  with	
  SURF	
  
•  Scalable	
  image	
  indexing	
  
E.	
  Spyromitros-­‐Xioufis,	
  S.	
  Papadopoulos,	
  Y.	
  Kompatsiaris,	
  G.	
  
Tsoumakas,	
  I.	
  Vlahavas,	
  "A	
  Comprehensive	
  Study	
  over	
  VLAD	
  and	
  
Product	
  Quan8za8on	
  in	
  Large-­‐scale	
  Image	
  Retrieval",	
  IEEE	
  
Transac8ons	
  on	
  Mul8media	
  16(6),	
  pp.	
  1713-­‐1728,	
  October	
  2014.	
  
image	
  
local	
  
descriptor	
  
extrac8on	
  
descriptor	
  
aggrega8on	
  
dimensionality	
  
reduc8on	
  set	
  of	
  local	
  
descriptors	
  
fixed	
  size	
  
vector	
  
encoding	
  &	
  
indexing	
  
low	
  dimensional	
  	
  
vector	
  
SIFT	
  /	
  SURF	
   BOW	
  /	
  VLAD	
   PCA	
  
PQ	
  +	
  ADC/IVFADC	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #33	
  
Scalable	
  indexing	
  of	
  features	
  
•  ADC	
  16x8	
  requires	
  16	
  bytes	
  per	
  image	
  
–  ~67M	
  images	
  per	
  GB	
  
•  IVFADC	
  requires	
  4	
  addi8onal	
  bytes	
  per	
  image	
  
–  ~53.6M	
  images	
  per	
  GB	
  
•  In	
  current	
  implementa8on	
  we	
  achieve	
  only	
  half	
  of	
  above	
  numbers	
  due	
  to	
  
using	
  short	
  int[]	
  instead	
  of	
  byte[],	
  but	
  possible	
  to	
  improve.	
  
•  Ideally,	
  1	
  billion	
  images	
  could	
  be	
  indexed	
  on	
  a	
  server	
  with	
  
20GB	
  of	
  RAM	
  (projec2on).	
  
•  Query	
  8me	
  (for	
  1M	
  vectors):	
  
–  Exhaus8ve	
  search	
  of	
  VLAD	
  vectors	
  (d’=128):	
   	
  0.50	
  sec	
  
–  Product	
  Quan8za8on	
  with	
  ADC	
  16x8:	
   	
  0.10	
  sec	
  (x5	
  faster)	
  
–  Product	
  Quan8za8on	
  with	
  IVFADC	
  16x8:	
   	
  0.02	
  sec	
  (x25	
  faster)	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #34	
  
VLAD+SIFT	
  vs.	
  VLAD+SURF 	
   	
   	
  	
  
Accuracy	
  vs.	
  dimensionality	
  
•  VLAD+SURF	
  improves	
  VLAD+SIFT	
  and	
  FV+SIFT	
  across	
  all	
  dimensions	
  in	
  
both	
  Holidays	
  and	
  Oxford	
  datasets	
  
Results	
  in	
  rows	
  star8ng	
  with	
  *	
  are	
  taken	
  from	
  Jégou	
  et	
  al.,	
  2011,	
  	
  hence	
  the	
  missing	
  values	
  for	
  some	
  entries.	
  
SIFT	
  corresponds	
  	
  to	
  PCA	
  reduced	
  SIFT	
  which	
  yielded	
  beJer	
  results	
  than	
  standard	
  SIFT	
  in	
  Jegou	
  et	
  al.,	
  2011	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #35	
  
Clustering	
  –	
  Community	
  Detec2on	
  
	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
graph	
  
G	
  =	
  (V,	
  E)	
  
nodes	
  
edges	
  
An	
  abstract	
  data	
  type	
  represen8ng	
  rela8onships	
  or	
  connec8ons	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Some	
  Examples	
  
Webpage	
  www.x.com	
  
href=“www.y.com”	
  
href	
  =	
  “www.z.com”	
  
Webpage	
  www.y.com	
  
href=“www.x.com”	
  
href	
  =	
  “www.a.com”	
  
href	
  =	
  “www.b.com”	
  
Webpage	
  www.z.com	
  
href=“www.a.com”	
  
y	
  
a	
  
x	
  
z	
  
b	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Biology	
  example	
  
Nodes	
  –	
  Proteins	
  
	
  
Edges	
  –	
  Interac8ons	
  
	
  
Visualiza8on	
  plays	
  an	
  important	
  role	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
blogosphere	
  as	
  a	
  graph	
  
nodes	
  =	
  blogs	
  
edges	
  =	
  hyperlinks	
  
technical	
  -­‐	
  gadgets	
  
society	
  -­‐	
  poli2cs	
  
hJp://datamining.typepad.com/gallery/blog-­‐map-­‐gallery.html	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
social	
  web	
  as	
  a	
  graph	
  
nodes	
  =	
  twirer	
  users	
  
edges	
  =	
  retweets	
  on	
  #jan25	
  hashtag	
  
announcement	
  of	
  Mubarak’s	
  resigna2on	
  
hJp://gephi.org/2011/the-­‐egyp8an-­‐revolu8on-­‐on-­‐twiJer/	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
•  graphs	
  on	
  the	
  web	
  present	
  certain	
  structural	
  
characteris8cs	
  
•  groups	
  of	
  nodes	
  interac8ng	
  with	
  each	
  other	
  à	
  
	
  dense	
  inter-­‐connec2ons	
  à	
   	
   	
   	
  	
  	
  
	
  func8onal/topical	
  associa8ons	
  
•  what	
  can	
  we	
  gain	
  by	
  studying	
  them?	
  
–  topic	
  analysis	
  
–  photo	
  clustering	
  
–  improved	
  recommenda8on	
  methods	
  
–  detect	
  influencers	
  
emerging	
  structures	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Community	
  and	
  graphs	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  Communi8es	
  correspond	
  to	
  groups	
  of	
  nodes	
  on	
  a	
  graph	
  that	
  
share	
  common	
  proper8es	
  or	
  have	
  a	
  common	
  role	
  in	
  the	
  
organiza8on/opera8on	
  of	
  the	
  system.	
  
S.	
  Fortunato,	
  C.	
  Castellano.	
  Community	
  structure	
  in	
  graphs.	
  arXiv:0712.2716v1,	
  Dec	
  2007.	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
•  Pairs	
  of	
  nodes	
  are	
  more	
  likely	
  to	
  be	
  connected	
  if	
  they	
  are	
  
both	
  members	
  of	
  the	
  same	
  community,	
  and	
  less	
  likely	
  to	
  
be	
  connected	
  if	
  they	
  do	
  not	
  share	
  communi8es.	
  
•  explicit	
  
–  the	
  result	
  of	
  conscious	
  human	
  decision	
  
	
  
•  implicit	
  
–  emerging	
  from	
  the	
  interac8ons	
  &	
  ac8vi8es	
  of	
  users	
  	
  
–  need	
  special	
  methods	
  to	
  be	
  discovered	
  
–  Community	
  detec8on,	
  par88on,	
  clustering	
  
Community	
  types	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
•  Oten	
  communi8es	
  are	
  defined	
  with	
  respect	
  to	
  a	
  
graph,	
  	
  G	
  =	
  (V,E)	
  represen8ng	
  a	
  set	
  of	
  objects	
  (V)	
  and	
  
their	
  rela8ons	
  (E).	
  
•  Even	
  if	
  such	
  graph	
  is	
  not	
  explicit	
  in	
  the	
  raw	
  data,	
  it	
  is	
  
usually	
  possible	
  to	
  construct,	
  e.g.	
  feature	
  vectors	
  à	
  
distances	
  à	
  thresholding	
  à	
  graph	
  
•  Given	
  a	
  graph,	
  a	
  community	
  is	
  defined	
  as	
  a	
  set	
  of	
  
nodes	
  that	
  are	
  more	
  densely	
  connected	
  to	
  each	
  
other	
  than	
  to	
  the	
  rest	
  of	
  the	
  network	
  nodes.	
  
communi2es	
  and	
  graphs	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
communi2es	
  and	
  graphs	
  -­‐	
  example	
  
inter-­‐community	
  edge	
  
intra-­‐community	
  edge	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
community	
  arributes	
  
overlap	
   weighted	
  par8cipa8on	
   roles	
  
hierarchy	
   evolu8on	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
•  Given	
  nodes	
  u	
  and	
  v	
  of	
  graph	
  G	
  =	
  (V,E)	
  a	
  cut	
  is	
  a	
  set	
  
of	
  edges	
  C	
  ⊂	
  E,	
  such	
  that	
  the	
  two	
  nodes	
  are	
  
unconnected	
  on	
  the	
  graph	
  G΄=	
  (V,E-­‐C).	
  
•  Using	
  s	
  to	
  denote	
  a	
  “source”	
  node	
  and	
  t	
  to	
  denote	
  a	
  
“terminal”	
  node,	
  a	
  cut	
  (S,T)	
  of	
  G	
  =	
  (V,E)	
  is	
  a	
  par88on	
  
of	
  V	
  in	
  sets	
  S	
  and	
  Τ	
  =	
  V-­‐S,	
  such	
  that	
  s	
  ∈	
  S	
  and	
  t∈T.	
  
graph	
  cuts	
  
s
t
T
S
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
•  A	
  graph	
  can	
  be	
  split	
  into	
  communi8es	
  in	
  numerous	
  ways,	
  i.e.	
  
for	
  each	
  graph	
  there	
  are	
  many	
  possible	
  community	
  
structures.	
  In	
  the	
  simple	
  case,	
  a	
  community	
  structure	
  is	
  
defined	
  as	
  a	
  graph	
  par88on	
  into	
  a	
  set	
  of	
  node	
  sets	
   	
  
	
   	
   	
   	
  C	
  =	
  {Ci}	
  
•  To	
  provide	
  a	
  measure	
  of	
  the	
  quality	
  of	
  a	
  community	
  structure,	
  
we	
  make	
  use	
  of	
  modularity.	
  
•  The	
  modularity	
  maximiza8on	
  method	
  detects	
  communi8es	
  by	
  
searching	
  over	
  possible	
  divisions	
  of	
  a	
  network	
  for	
  one	
  or	
  more	
  
that	
  have	
  par8cularly	
  high	
  modularity.	
  	
  
•  Modularity	
  quan8fies	
  the	
  extent	
  to	
  which	
  a	
  given	
  graph	
  
par88on	
  into	
  communi8es	
  presents	
  a	
  systema8c	
  tendency	
  to	
  
have	
  more	
  intra-­‐community	
  links	
  than	
  the	
  same	
  community	
  
structure	
  would	
  present	
  if	
  the	
  links	
  would	
  be	
  rewired	
  under	
  
ER	
  (Erdos-­‐Renyi)	
  graph	
  model.	
  
Modularity	
  maximiza2on	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
graph	
  degress	
  
deg(vi)	
  =	
  ki	
  =	
  number	
  of	
  neighbors	
  
In	
  directed	
  graphs,	
  we	
  differen8ate	
  between	
  in-­‐	
  and	
  out-­‐degree.	
  
Αij	
  =	
  link	
  between	
  nodes	
  i	
  and	
  j	
  
0	
  à	
  no	
  link	
  
1	
  à	
  link	
  
α	
  à	
  link	
  with	
  weight	
  equal	
  to	
  α	
  
node	
  degree	
  
adjacency	
  matrix	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Degrees	
  &	
  Adjancency	
  
v1	
   v2	
  
v3	
  
v4	
  v5	
  
Adjacency	
  matrix	
  on	
  an	
  undirected	
  graph	
  	
  :	
  A(i,j),	
  	
  i,j	
  <=	
  n	
  	
  
degree	
  of	
  a	
  vertex	
  v	
  	
  
(number	
  of	
  edges	
  incident	
  upon	
  it):	
   ∑=
w
v wvAk ),(
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
•  Modularity	
  is	
  computed	
  as	
  follows:	
  	
  
	
  
–  Αij:	
  adjacency	
  matrix	
  
–  ki:	
  degree	
  of	
  node	
  i	
  
–  ci:	
  community	
  of	
  node	
  i	
  
–  δ(ci,cj)	
  =	
  1	
  if	
  i,	
  j	
  belong	
  to	
  the	
  same	
  community	
  
–  m:	
  number	
  of	
  edges	
  on	
  the	
  graph	
  
modularity	
  computa2on	
  
∑ −=
ji
ji
ji
ij cc
m
kk
A
m
Q
,
),()
2
(
2
1
δ
Expected number of
edges between i and j, if
edges are placed
randomly.
Observed number of
intra-community
edges.
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
•  In	
  a	
  random	
  graph	
  (ER	
  model),	
  we	
  expect	
  that	
  any	
  
possible	
  par88on	
  would	
  lead	
  to	
  Q	
  =	
  0.	
  
•  Typically,	
  in	
  non-­‐random	
  graphs	
  modularity	
  takes	
  
values	
  between	
  0.3	
  and	
  0.7.	
  	
  
modularity	
  -­‐	
  example	
  
Q = 0.60
clear community
structure
Q = 0.37
fuzzy communities
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
•  Exhaus8ve	
  search	
  over	
  all	
  possible	
  divisions	
  is	
  usually	
  
intractable	
  
•  Algorithms	
  based	
  on	
  approximate	
  op8miza8on	
  
–  greedy	
  algorithms	
  
–  simulated	
  annealing	
  
–  spectral	
  op8miza8on	
  
–  local-­‐based	
  op8miza8on	
  
•  Balances	
  between	
  speed	
  and	
  accuracy	
  
Modularity	
  maximiza2on	
  approaches	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
•  other	
  community-­‐ness	
  measures:	
  
–  conductance	
  
–  density	
  
•  defini8ons	
  to	
  sa8sfy	
  
–  each	
  member	
  should	
  be	
  connected	
  to	
  more	
  nodes	
  within	
  
the	
  community	
  than	
  to	
  nodes	
  outside	
  it	
  
–  each	
  member	
  should	
  be	
  connected	
  to	
  all	
  other	
  members	
  
(k-­‐clique)	
  
•  result	
  of	
  a	
  process	
  
–  if	
  I	
  start	
  removing	
  edges	
  with	
  a	
  certain	
  order,	
  the	
  graph	
  
will	
  break	
  into	
  pieces	
  à	
  communi8es	
  
other	
  means	
  to	
  define	
  communi2es	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
•  Given	
  a	
  graph	
  G=(V,E),	
  find	
  a	
  par88on	
  of	
  V	
  in	
  k	
  disjoint	
  
subsets,	
  such	
  that	
  the	
  number	
  of	
  edges	
  in	
  Ε	
  of	
  which	
  the	
  
endpoints	
  belong	
  to	
  different	
  subsets	
  is	
  minimized.	
  
•  Various	
  solu8ons:	
  Kernighan-­‐Lin	
  algorithm	
  [Kernighan70],	
  
spectral	
  bisec8on	
  [Pothen90].	
  
•  Mul8-­‐level	
  par88on	
  (me8s)	
  [Karypis99]:	
  Repeated	
  applica8on	
  
of	
  bisec8on	
  un8l	
  the	
  graph	
  is	
  par88oned	
  into	
  k	
  parts	
  under	
  
constraint	
  to	
  the	
  sizes	
  of	
  the	
  subsets.	
  
•  Not	
  sa8sfactory	
  solu8on,	
  since	
  the	
  number	
  of	
  communi8es	
  
needs	
  to	
  be	
  provided	
  as	
  input	
  to	
  the	
  algorithm.	
  Some8mes	
  
event	
  the	
  community	
  sizes	
  need	
  to	
  be	
  provided	
  as	
  inputs.	
  
graph	
  par22on	
  
B.	
  W.	
  Kernighan,	
  S.	
  Lin.	
  An	
  Efficient	
  Heuris8c	
  Procedure	
  for	
  Par88oning	
  of	
  Electrical	
  Circuits.	
  Bell	
  
Systems	
  Technical	
  Journal,	
  Vol.	
  49,	
  No.	
  2,	
  pp.	
  291-­‐	
  307,	
  February	
  1970.	
  
	
  
A.	
  Pothen,	
  H.D.	
  Simon	
  and	
  K.-­‐P.	
  Liou.	
  Par88oning	
  sparse	
  matrices	
  with	
  eigenvectors	
  of	
  graphs.	
  
SIAM	
  journal	
  of	
  Matrix	
  Analysis	
  and	
  Applica8ons,	
  11:	
  430-­‐452,	
  1990.	
  
	
  
	
  G.	
  Karypis	
  and	
  V.	
  Kumar,	
  A	
  fast	
  and	
  high	
  quality	
  mul8level	
  scheme	
  for	
  par88oning	
  
	
  irregular	
  graphs,	
  SIAM	
  J.	
  Sci.	
  Comput.	
  20	
  (1):	
  359–392,	
  1999.	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
taxonomy	
  
S.	
  Papadopoulos,	
  Y.	
  Kompatsiaris,	
  A.	
  Vakali,	
  P.	
  Spyridonos.	
  “Community	
  detec8on	
  in	
  Social	
  Media”.	
  In	
  
Data	
  Mining	
  and	
  Knowledge	
  Discovery,	
  Springer,	
  2011	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
•  k-­‐clique	
  
•  N-­‐clique	
  
•  k-­‐core	
  
subgraph	
  discovery	
  (structure)	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  1	
  
k=3	
  (triangle)	
   k=4	
   k=5	
  
N=2	
  (star)	
  
0-­‐core	
  
1-­‐core	
  
2-­‐core	
  
4-­‐core	
  
3-­‐core	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
•  (μ,ε)-­‐core:	
  	
  
–  based	
  on	
  the	
  concept	
  of	
  structural	
  similarity	
  
subgraph	
  discovery	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  2	
  
(μ,ε)-­‐core	
  
μ	
  =	
  5,	
  ε	
  =	
  0.72	
  
(μ,ε)-­‐core	
  
μ	
  =	
  6,	
  ε	
  =	
  0.675	
  
hub	
  
outlier	
  
Percentage	
  of	
  
common	
  neighbors	
  
for	
  each	
  edge	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
•  Betweenness	
  centrality	
  
–  Being	
  in	
  many	
  shortest	
  paths	
  	
  
•  Closeness	
  	
  
–  Being	
  close	
  to	
  many	
  nodes	
  	
  
•  Eigenvector	
  centrality	
  
–  End	
  of	
  many	
  paths	
  	
  
•  Degree	
  centrality	
  
–  High	
  degree	
  	
  
	
  
hJps://commons.wikimedia.org/wiki/File:6_centrality_measures.png#/
media/File:6_centrality_measures.png	
  
Carlos	
  Cas8llo,	
  Social	
  Media	
  Mining	
  and	
  Retrieval,	
  
hJp://www.slideshare.net/ChaToX/social-­‐media-­‐mining-­‐and-­‐retrieval	
  
	
  
centrality	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
•  Find	
  edges	
  that	
  stand	
  between	
  communi8es.	
  
•  Progressively	
  remove	
  more	
  “central”	
  edges	
  un8l	
  the	
  
graph	
  breaks	
  into	
  separate	
   	
   	
   	
  	
  	
  	
  	
  	
  	
  	
  	
  
communi8es.	
  
•  As	
  the	
  graph	
  spli†ng	
   	
   	
   	
   	
  	
  	
  
progresses,	
  new	
  communi8es	
   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  emerge	
  that	
  
are	
  assigned	
  to	
  a	
  hierarchical	
   	
   	
   	
  	
  	
  	
  	
  	
  	
  
structure.	
  
•  Edge	
  centrality	
  is	
  defined	
   	
   	
   	
  	
  	
  	
  	
  	
  	
  	
  	
  
similarly	
  to	
  node	
  centrality:	
  
60	
  
divisive	
  -­‐	
  use	
  of	
  edge	
  centrality	
  
Depic8on	
  of	
  node	
  centrality:	
  	
  
	
  red	
  (min)	
  à	
  blue	
  (max)	
  
∑ ∈
≠≠=
Vts
vts
ts
ts v
vbc
,
,
, )(
)(
σ
σ
)(, vtsσ
ts,σ
:	
  number	
  of	
  paths	
  from	
  node	
  s	
  to	
  t	
  	
  
that	
  include	
  node	
  v	
  
:	
  total	
  number	
  of	
  paths	
  from	
  s	
  to	
  t	
  
Betweenness centrality quantifies
the number of times a node acts
as a bridge along the shortest path
between two other nodes.
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
•  GN	
  algorithm	
  is	
  one	
  of	
  the	
  most	
  important	
  algorithms	
  
s8mula8ng	
  a	
  whole	
  wave	
  of	
  community	
  detec8on	
  methods.	
  
•  Basic	
  principle:	
  
–  Compute	
  betweenness	
  centrality	
  for	
  each	
  edge.	
  
–  Remove	
  edge	
  with	
  highest	
  score.	
  
–  Re-­‐compute	
  all	
  scores.	
  
–  Repeat	
  2nd	
  step.	
  
•  Complexity:	
  Ο(n3)	
  
•  Many	
  varia8ons	
  have	
  been	
  presented	
  to 	
   	
   	
  	
  	
  	
  	
  	
  	
  	
  
improve	
  precision	
  by	
  use	
  of	
  different	
  betweenness	
  measures	
  
or	
  reduce	
  complexity,	
  e.g.	
  by	
  sampling	
  or	
  local	
  computa8ons.	
  
Girvan	
  -­‐	
  Newman	
  algorithm	
  
Girvan,	
  M.,	
  Newman,	
  M.E.J.	
  “Community	
  structure	
  in	
  social	
  and	
  biological	
  networks”.	
  In	
  
Proceedings	
  of	
  Na8onal	
  Academy	
  of	
  Science,	
  U.	
  S.	
  A.	
  99(12),	
  7821–7826,	
  2002	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Girvan	
  -­‐	
  Newman	
  (example)	
  
Social	
  network	
  in	
  Zachary	
  	
  
karate	
  club	
  
Hierarchical	
  community	
  structure	
  
detected	
  by	
  the	
  algorithm.	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Visual	
  Event	
  Summariza2on	
  on	
  Social	
  Media	
  using	
  
Topic	
  Modelling	
  and	
  Graph-­‐based	
  Ranking	
  Algorithms	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Large-­‐scale	
  real	
  world	
  events	
  (1)	
  
•  Long-­‐running	
  events	
  →	
  Consist	
  of	
  several	
  sub-­‐events	
  
e.g.	
  10	
  days	
  of	
  Sundance	
  Film	
  Fes8val	
  include	
  opening	
  
and	
  awards	
  ceremonies,	
  screenings	
  etc.	
  
•  A	
  lot	
  of	
  involved	
  persons	
  that	
  use	
  social	
  media	
  →	
  huge	
  
amount	
  of	
  event-­‐related	
  micro-­‐blogging	
  messages	
  	
  
•  A	
  growing	
  number	
  of	
  these	
  messages	
  carry	
  
mul2media	
  content	
  	
  
•  The	
  existence	
  of	
  an	
  image	
  in	
  a	
  micro-­‐post	
  can	
  convey	
  a	
  
much	
  beJer	
  impression	
  for	
  the	
  specific	
  moment	
  of	
  the	
  
ongoing	
  event	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Large-­‐scale	
  real	
  world	
  events	
  (2)	
  
	
  	
  	
  	
  	
  	
  #nbafinals	
  →	
  2.6M	
  tweets	
  in	
  one	
  month	
  
#BaltimoreRiots 29 April-2 May 2015
à1.3M tweets in 5 days
E3 conference 2015 16-18 June
>5M tweets before conference
2M tweets during conference
new game releases à multimedia content
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Large-­‐scale	
  real	
  world	
  events	
  (3)	
  
But…	
  
•  the	
  huge	
  number	
  of	
  messages,	
  makes	
  it	
  very	
  
challenging	
  for	
  interested	
  users	
  to	
  monitor	
  the	
  
evolu8on	
  of	
  the	
  event	
  
•  many	
  messages	
  can	
  be	
  considered	
  as	
  spam	
  or	
  non-­‐
informa2ve	
  
•  In	
  case	
  of	
  mul8media:	
  internet	
  memes,	
  
screenshots,	
  images	
  of	
  low	
  quality…	
  
•  Redundancy	
  due	
  to	
  near	
  duplicate	
  messages	
  and	
  
images	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Large-­‐scale	
  real	
  world	
  events	
  (4)	
  
#nbafinals	
  	
  
Irrelevant
Duplicates with
no explicit
association
Non-informative
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Event	
  related	
  collec$on	
  is	
  available	
  	
  
	
  
Visual	
  Event	
  Summariza2on	
  
Visual	
  Event	
  Summariza2on	
  is	
  the	
  problem	
  of	
  selec8ng	
  
a	
  concise	
  set	
  of	
  images	
  that	
  are	
  highly	
  relevant	
  to	
  the	
  
event	
  and	
  contain	
  visually,	
  the	
  key	
  aspects	
  of	
  the	
  
event.	
  
Event-­‐based	
  
Visual	
  
Summarizer	
  
List	
  of	
  all	
  event	
  images	
  
Set	
  of	
  Selected	
  	
  
Representa2ve	
  
and	
  Diverse	
  
Images	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Exis2ng	
  Approaches:	
  Text-­‐based	
  
Radev	
  et	
  al.	
  (2004)	
  
•  summary	
  consists	
  of	
  messages	
  that	
  are	
  closest	
  to	
  their	
  N·∙idf	
  centroid	
  
Erkan	
  et	
  al.	
  (2004),	
  LexRank	
  &	
  Mihalcea	
  et	
  al.	
  (2004),	
  TextRank	
  	
  
•  finding	
  salient	
  sentences	
  by	
  using	
  the	
  centrality	
  of	
  each	
  sentence	
  in	
  a	
  similarity	
  
graph	
  	
  
•  adapted	
  for	
  mul8-­‐document	
  summariza8on	
  using	
  each	
  message	
  as	
  a	
  sentence.	
  
•  outperforms	
  naïve	
  centroid-­‐based	
  approach.	
  
Shen	
  at	
  al.	
  (2013)	
  
•  mixture	
  model	
  to	
  detect	
  sub-­‐events	
  at	
  par8cipant	
  level	
  
•  N·∙idf	
  centroid	
  to	
  find	
  a	
  summary	
  of	
  each	
  sub-­‐event	
  	
  
Chakrabar2	
  and	
  Punera	
  (2011)	
  
•  Hidden	
  Markov	
  Model	
  to	
  obtain	
  a	
  8me-­‐based	
  segmenta8on	
  of	
  tweets	
  
•  N·∙idf	
  centroid	
  to	
  find	
  a	
  summary	
  of	
  each	
  8me	
  segment	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Exis2ng	
  Approaches:	
  Mul2media	
  
Bian	
  et	
  al.	
  (2013)	
  
•  mul8modal	
  extension	
  of	
  LDA	
  	
  
•  textual	
  and	
  visual	
  features	
  	
  
Lin	
  et	
  al.	
  (2012)	
  
•  mul8-­‐graph	
  of	
  objects	
  capturing	
  visual,	
  textual	
  and	
  temporal	
  
proximity	
  
•  8me-­‐ordered	
  sequence	
  of	
  important	
  objects	
  via	
  graph	
  
op8miza8on	
  
McParlane	
  et	
  al.	
  (2014)	
  –	
  state-­‐of-­‐the-­‐art	
  baseline	
  
•  visual	
  features	
  +	
  SVM	
  to	
  discard	
  irrelevant	
  images	
  
•  clustering	
  in	
  subtopics	
  and	
  selec8on	
  of	
  popular	
  images	
  for	
  
each	
  subtopic	
  based	
  on	
  popularity	
  and	
  specificity	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
MGraph:	
  Framework	
  Overview	
  
1.  create	
  message	
  mul8-­‐graph	
  using	
  textual,	
  visual	
  and	
  temporal	
  proximity	
  
2.  find	
  underlying	
  topics	
  using	
  SCAN	
  algorithm	
  
3.  calculate	
  prior	
  scores	
  of	
  images	
  based	
  on	
  topics	
  and	
  popularity	
  (relevance)	
  
4.  diversify	
  using	
  DivRank	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Pre-­‐processing	
  /	
  Filtering	
  
Text-­‐based	
  filtering	
  
•  heuris8c	
  rules	
  for	
  spam	
  filtering	
  →	
  discard	
  very	
  short	
  messages	
  &	
  
messages	
  with	
  many	
  men8ons,	
  URLs	
  or	
  hashtags.	
  
•  filtering	
  of	
  unstructured	
  messages	
  using	
  POS	
  tagging	
  
	
  Accept	
  	
  →	
  (determiner?	
  adjec$ve*	
  noun+	
  verb)+	
  
Visual-­‐based	
  filtering	
  
•  discard	
  small	
  images	
  
•  detect	
  and	
  discard	
  memes,	
  screenshots	
  and	
  images	
  containing	
  
heavy	
  text	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Pre-­‐processing	
  /	
  Filtering	
  
Text-­‐based	
  filtering	
  
Visual-based filtering
Tweet length
POS tagging filtering
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Mul2-­‐graph	
  Genera2on	
  (1)	
  
Given	
  a	
  set	
  of	
  (original)	
  messages	
  M={m1,	
  m2,	
  ...,	
  mn}	
  we	
  construct	
  a	
  
mul8-­‐graph	
  GM	
  =	
  {V,	
  Etextual,	
  Evisual,	
  Esocial,	
  E2me}	
  
	
  
•  vertex	
  vi	
  ∈	
  V	
  corresponds	
  to	
  message	
  mi	
  	
  
•  Etextual	
  →	
  undirected	
  edges	
  expressing	
  the	
  textual	
  similarity	
  (cosine	
  
similarity)	
  between	
  nodes	
  (Z·∙idf	
  vector	
  vm)	
  
•  Evisual	
  →	
  undirected	
  edges	
  that	
  represent	
  the	
  visual	
  similarity	
  (L2	
  
distance)	
  between	
  nodes	
  with	
  images	
  (VLAD+SURF	
  vectors)	
  	
  
Thresholding:	
  add	
  an	
  edge	
  in	
  Etextual	
  or	
  Evisual,	
  only	
  if	
  the	
  textual	
  or	
  visual	
  similarity	
  
between	
  the	
  corresponding	
  nodes	
  is	
  higher	
  than	
  thtextual	
  or	
  thvisual	
  respec8vely	
  
	
  	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Mul2-­‐graph	
  Genera2on	
  (2)	
  
	
  	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Example	
  mul2-­‐modal	
  sub-­‐graph	
  
#	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Visual	
  deduplica2on	
  
•  Visual	
  duplicates	
  for	
  which	
  there	
  is	
  no	
  explicit	
  connec8on	
  →	
  
apply	
  Clique	
  Percola8on	
  Method	
  (CPM)	
  on	
  sub-­‐graph	
  Gvisual	
  =	
  
{V,	
  Evisual}	
  	
  
•  Represent	
  detected	
  cliques	
  as	
  single	
  messages:	
  
–  VLAD	
  aggrega8on	
  on	
  SURF	
  descriptors	
  of	
  all	
  images	
  in	
  the	
  clique	
  	
  
–  mean	
  value	
  of	
  publica8on	
  8me	
  
–  aggregated	
  value	
  of	
  reposts	
  of	
  each	
  message.	
  	
  
–  merged	
  w·∙idf	
  vector	
  
•  Replace	
  clustered	
  messages	
  in	
  GM	
  with	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
cliques	
  and	
  re-­‐calculate	
  the	
  corresponding	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
edges	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Visual	
  deduplica2on	
  
GM
Gvisual
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Topic	
  Detec2on	
  
•  Apply	
  Structural	
  Clustering	
  Algorithm	
  for	
  Networks	
  
(SCAN)	
  →	
  iden8fy	
  dense	
  sub-­‐graphs	
  of	
  messages	
  in	
  GM	
  	
  
•  Sub-­‐graphs	
  represent	
  the	
  topics	
  that	
  exist	
  in	
  the	
  
stream	
  of	
  messages	
  
•  Each	
  topici	
  contains	
  messages	
  {Mi}	
  and	
  is	
  represented	
  
as	
  a	
  merged	
  N·∙idf	
  vector	
  Vi	
  
•  A	
  substan8al	
  amount	
  of	
  messages	
  is	
  kept	
  outside	
  of	
  
the	
  detected	
  clusters	
  
–  Hubs	
  &	
  Outliers	
  most	
  probably	
  are	
  non-­‐informa8ve	
  
–  May	
  include	
  valuable	
  informa8on	
  →	
  also	
  considered	
  in	
  
summariza8on	
  process	
  as	
  single-­‐item	
  clusters	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Message	
  Selec2on	
  Score	
  
	
  	
  	
  
reposts
relevance x
cluster size
x specificity
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Specificity	
  
High	
  specificity	
   Low	
  specificity	
  
rare	
  across	
  all	
  
topics	
  of	
  the	
  
event	
  
	
  
common	
  
across	
  
topics	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Image	
  Ranking	
  &	
  Diversifica2on	
  
	
  	
  
variant	
  of	
  
PageRank	
  aiming	
  
diversity	
  
	
  
	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Dataset	
  and	
  Event	
  Descrip2on	
  
•  dataset	
  of	
  McMinn	
  et	
  al.	
  having	
  more	
  than	
  500	
  events	
  
from	
  different	
  	
  domains	
  	
  	
  
•  we	
  used	
  the	
  50	
  largest	
  events	
  in	
  terms	
  of	
  tweets	
  
•  sports	
  events	
  	
  (e.g.,	
  the	
  Sochi	
  winter	
  Olympics),	
  	
  
poli8cal	
  events	
  (Ukraine	
  	
  crisis,	
  Venezuelan	
  protests),	
  
disasters,	
  etc.	
  
•  364,005	
  tweets,	
  on	
  average	
  4,730	
  tweets/event	
  
•  296,160	
  remaining	
  tweets,	
  due	
  to	
  suspended	
  	
  
accounts	
  	
  and	
  deleted	
  	
  messages	
  
•  about	
  3,51%	
  of	
  these,	
  i.e.	
  12,772	
  tweets,	
  contain	
  an	
  
embedded	
  image	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Relevance	
  Judgments	
  
Each	
  image	
  is	
  shown	
  to	
  3	
  par8cipants	
  (20	
  img-­‐20	
  part)	
  without	
  ranking	
  
informa8on	
  
Task	
  Descrip2on:	
  You	
  are	
  presented	
  with	
  an	
  image	
  and	
  an	
  event	
  8tle	
  
describing	
  a	
  trending	
  topic	
  in	
  TwiJer.	
  For	
  each	
  image	
  and	
  event	
  8tle,	
  you	
  are	
  
asked	
  to	
  answer	
  the	
  following	
  ques8on:	
  
	
  
Is	
  this	
  image	
  relevant	
  to	
  the	
  event?	
  
1.  The	
  image	
  is	
  clearly	
  not	
  relevant	
  to	
  the	
  event.	
  
2.  The	
  image	
  is	
  probably	
  not	
  relevant	
  to	
  the	
  event,	
  but	
  I	
  am	
  not	
  en8rely	
  sure.	
  
3.  The	
  image	
  is	
  somewhat	
  relevant	
  to	
  the	
  event,	
  but	
  I	
  have	
  my	
  doubts	
  on	
  
whether	
  I	
  would	
  like	
  to	
  see	
  it	
  in	
  a	
  photo	
  coverage	
  of	
  the	
  event.	
  
4.  The	
  image	
  is	
  clearly	
  relevant	
  to	
  the	
  event,	
  and	
  I	
  would	
  like	
  to	
  see	
  it	
  in	
  a	
  photo	
  
coverage	
  of	
  the	
  event.	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Experimental	
  Se{ng	
  
•  VLAD+SURF	
  extrac8on	
  
–  64–dimensional	
  SURF	
  descriptors	
  
–  four	
  codebooks	
  of	
  128	
  visual	
  words	
  (in	
  total	
  512)	
  to	
  quan8ze	
  each	
  descriptor	
  	
  
–  aggregate	
  SURF	
  descriptors	
  into	
  a	
  single	
  vector	
  of	
  64*512	
  =	
  32.768	
  dimensions	
  	
  using	
  
VLAD	
  scheme	
  
–  PCA	
  to	
  create	
  a	
  1024-­‐dimensional	
  L2-­‐normalized	
  reduced	
  vector	
  that	
  represents	
  the	
  
visual	
  content	
  of	
  the	
  image	
  
•  Mul8-­‐graph	
  genera8on	
  
–  k	
  =	
  500	
  nearest	
  neighbors	
  
–  visual	
  and	
  textual	
  similarity	
  thresholds	
  were	
  set	
  to	
  0.5	
  and	
  0.6	
  
–  σ2	
  of	
  the	
  temporal	
  kernel	
  was	
  empirically	
  set	
  to	
  24	
  hours	
  
•  SCAN	
  parameters	
  were	
  set	
  to	
  	
  μ=2	
  and	
  	
  ε=0.65	
  
•  DivRank’s	
  dumping	
  factor	
  was	
  set	
  to	
  d=0.75	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Evalua2on	
  metrics	
  (1)	
  
Precision-­‐oriented	
  metrics	
  
•  Precision	
  (P@N):	
  The	
  percentage	
  of	
  images	
  among	
  the	
  top	
  N	
  
that	
  are	
  relevant	
  (answers	
  3&4)	
  to	
  the	
  corresponding	
  event,	
  
averaged	
  among	
  all	
  events.	
  We	
  calculate	
  precision	
  for	
  N	
  equal	
  
to	
  1,	
  5,	
  and	
  10.	
  
•  Success	
  (S@N):	
  Percentage	
  of	
  events,	
  where	
  there	
  exist	
  at	
  
least	
  one	
  relevant	
  image	
  among	
  the	
  top	
  N	
  returned,	
  for	
  N=10.	
  
•  Mean	
  Reciprocal	
  Rank	
  (MRR)	
  :	
  Computed	
  as	
  1/r,	
  where	
  r	
  is	
  
the	
  rank	
  of	
  the	
  first	
  relevant	
  image	
  returned,	
  averaged	
  over	
  all	
  
events.	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Evalua2on	
  metrics	
  (2)	
  
Diversity-­‐oriented	
  metrics	
  
•  α-­‐normalized	
  Discounted	
  Cumula2ve	
  Gain	
  :	
  α-­‐nDCG@N	
  
measures	
  the	
  usefulness,	
  or	
  gain,	
  of	
  the	
  returned	
  images	
  
based	
  on	
  their	
  posi8on	
  in	
  the	
  summary	
  (N=10).	
  
•  Average	
  Visual	
  Similarity:	
  AVS@N	
  measures	
  the	
  average	
  
visual	
  similarity	
  among	
  all	
  pairs	
  of	
  images	
  in	
  the	
  top	
  N	
  selected	
  
images,	
  averaged	
  over	
  all	
  events.	
  Lower	
  AVS	
  values	
  are	
  
preferable	
  since	
  they	
  imply	
  higher	
  diversity	
  in	
  terms	
  of	
  visual	
  
content.	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Baselines	
  
•  Random:	
  randomly	
  selects	
  N	
  images	
  from	
  the	
  filtered	
  set	
  of	
  images	
  as	
  the	
  
summary	
  set	
  
•  MostPopular:	
  picks	
  up	
  the	
  N	
  most	
  popular	
  images	
  in	
  terms	
  of	
  reposts	
  
•  LexRank:	
  uses	
  items	
  graph	
  GM,	
  ranks	
  the	
  nodes	
  using	
  the	
  LexRank	
  and	
  
selects	
  the	
  top	
  N	
  nodes	
  that	
  contain	
  images	
  	
  
•  TopicBased:	
  selects	
  the	
  N	
  most	
  relevant	
  messages	
  from	
  the	
  most	
  
significant	
  topics	
  (S_cov)	
  (relevance,	
  no	
  specificity	
  &	
  diversity)	
  
•  P-­‐TWR:	
  ranks	
  images	
  in	
  descending	
  order	
  using	
  the	
  weigh8ng	
  scheme	
  
described	
  in	
  McParlane	
  et	
  al.	
  (popularity)	
  
•  S-­‐TWR:	
  groups	
  the	
  tweets	
  of	
  each	
  event	
  into	
  sub-­‐clusters	
  and	
  select	
  the	
  
highest	
  ranked	
  item	
  of	
  each	
  cluster	
  using	
  the	
  previous	
  weigh8ng	
  scheme	
  
(specificity)	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Results	
  (1)	
  –	
  Precision	
  oriented	
  metrics	
  	
  
89	
  
•  MGraph	
  outperforms	
  all	
  of	
  the	
  compe8ng	
  methods	
  
•  Popularity-­‐based	
  approach	
  performs	
  well	
  for	
  P@1	
  but	
  drops	
  
significantly	
  for	
  N=5,10	
  	
  
•  LexRank	
  and	
  TopicBased	
  approaches	
  achieve	
  lower	
  but	
  more	
  
steady	
  results	
  	
  
First relevant in
positions 1 - 2
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Results:	
  Canada	
  Team	
  in	
  #Sochi	
  
Popularity-based
S-TWR
MGraph
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Results	
  (2)	
  –	
  Diversity	
  oriented	
  metrics	
  	
  
•  MGraph	
  achieves	
  the	
  best	
  score	
  for	
  α-­‐nDCG@10	
  
•  Best	
  values	
  of	
  AVS	
  achieved	
  by	
  S-­‐TWR	
  
•  The	
  worst	
  results	
  in	
  terms	
  of	
  AVS	
  are	
  obtained	
  using	
  LexRank	
  	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Results	
  (3)	
  
Performance	
  of	
  MGraph	
  across	
  different	
  categories	
  
•  Best	
  P@10	
  measure	
  is	
  obtained	
  for	
  events	
  about	
  Science	
  &	
  Technology	
  
•  The	
  second	
  best	
  P@10	
  is	
  obtained	
  for	
  events	
  about	
  Arts	
  &	
  Entertainment	
  	
  
•  Difficult	
  to	
  diversify	
  
•  The	
  best	
  value	
  of	
  AVS	
  is	
  achieved	
  for	
  events	
  about	
  disasters	
  &	
  accidents	
  
e.g.,	
  earthquakes	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Results	
  (4)	
  
Impact	
  of	
  the	
  dumping	
  factor	
  d	
  on	
  P@10,	
  S@5,	
  MRR	
  and	
  α-­‐nDCG@10	
  
•  The	
  worst	
  results	
  for	
  all	
  
metrics	
  are	
  obtained	
  for	
  
d=0	
  	
  (no	
  re-­‐ranking)	
  
•  The	
  best	
  results	
  are	
  
achieved	
  for	
  0.7<d<0.8	
  
•  slight	
  decrease	
  for	
  d>0.8	
  	
  
•  more	
  diverse	
  →	
  less	
  
relevant	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Conclusions	
  
•  Graph-­‐based	
  approach	
  for	
  visual	
  summaries	
  for	
  real-­‐world	
  events	
  
•  Maximizes	
  relevance	
  and	
  diversity	
  
•  Mul8modal	
  approach	
  taking	
  into	
  account	
  
•  Textual	
  content	
  
•  Visual	
  content	
  
•  Social	
  	
  
•  Interac8ons	
  (replies)	
  
•  Popularity	
  
•  Time	
  
•  Introduc8on	
  of	
  user	
  related	
  features	
  (e.g.	
  influence)	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Monitoring	
  and	
  intelligence	
  
system	
  for	
  Web	
  mul2media	
  
verifica2on	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Can	
  mul2media	
  on	
  the	
  Web	
  be	
  trusted?	
  
#96	
  
Real	
  photo	
  
captured	
  April	
  2011	
  by	
  WSJ	
  
but	
  
heavily	
  tweeted	
  during	
  Hurricane	
  Sandy	
  
(29	
  Oct	
  2012)	
  
	
  
Tweeted	
  by	
  mul8ple	
  sources	
  &	
  
retweeted	
  mul8ple	
  8mes	
  
	
  
Original	
  online	
  at:	
  
	
  
	
  
	
  
	
  
hJp://blogs.wsj.com/metropolis/2011/04/28/weather-­‐
journal-­‐clouds-­‐gathered-­‐but-­‐no-­‐tornado-­‐damage/	
  	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
The	
  Problem	
  
•  Everyone	
  can	
  easily	
  publish	
  content	
  on	
  the	
  Web	
  
•  Content	
  can	
  be	
  easily	
  repurposed	
  and	
  manipulated	
  
•  News	
  outlets	
  are	
  compe8ng	
  for	
  views	
  and	
  clicks	
  à	
  
Pressure	
  for	
  airing	
  stories	
  very	
  quickly	
  leaves	
  very	
  
liJle	
  room	
  for	
  verifica8on.	
  à	
  Very	
  oten,	
  even	
  well-­‐
reputed	
  news	
  providers	
  fall	
  for	
  fake	
  news	
  content.	
  
•  Mul8ple	
  tools	
  and	
  services	
  available	
  for	
  individual	
  
tasks	
  à	
  complex	
  verifica8on	
  process	
  
Very	
  hard	
  and	
  2me	
  consuming	
  to	
  check	
  the	
  veracity	
  of	
  
Web	
  mul2media	
  
#97	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Media	
  REVEALr	
  
•  Developed	
  within	
  the	
  REVEAL	
  project:	
   	
   	
  
	
   	
   	
   	
  hJp://revealproject.eu/	
  	
  
•  Framework	
  for	
  collec8ng,	
  indexing	
  and	
  browsing	
  
mul8media	
  content	
  from	
  the	
  Web	
  and	
  social	
  media	
  
•  Support	
  for	
  verifica8on:	
  
–  Near-­‐duplicate	
  detec8on	
  against	
  an	
  indexed	
  collec8on	
  
–  Clustering	
  of	
  social	
  media	
  posts	
  by	
  visual	
  similarity	
  à	
  
compara8ve	
  view	
  of	
  the	
  same	
  incident	
  
–  Aggrega8on	
  and	
  visualiza8on	
  of	
  Named	
  En88es	
  around	
  an	
  
incident	
  
#98	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Related	
  Work	
  
•  Majority	
  of	
  works	
  have	
  focused	
  on	
  problem	
  of	
  topic	
  
detec8on	
  and	
  summariza8on:	
  
–  TwitInfo	
  (Marcus	
  et	
  al.,	
  2011)	
  
–  TwiJermonitor	
  (Mathioudakis	
  &	
  Koudas,	
  2010)	
  
–  Meme	
  detec8on	
  &	
  predic8on	
  (Weng	
  et	
  al.,	
  2014)	
  
•  Visual	
  memes	
  and	
  clustering	
  
–  Visual	
  meme	
  tracking	
  (Xie	
  et	
  al.,	
  2011)	
  
–  Supervised	
  mul8modal	
  clustering	
  (Petkos	
  et	
  al.,	
  2012)	
  
•  Image	
  manipula8on	
  tracking	
  
–  Internet	
  image	
  archaeology	
  (Kennedy	
  &	
  Chang,	
  2008)	
  
#99	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Overview	
  of	
  Media	
  REVEALr	
  
#100	
  
Media	
  collec8on	
  
Media	
  pre-­‐processing	
  &	
  
feature	
  extrac8on	
  
Media	
  analysis,	
  mining	
  &	
  
indexing	
  
Persistence	
  (storage,	
  indexing)	
  
Access	
  (API)	
  
Visualiza8on,	
  front-­‐end	
  
TEXT	
   VISUAL	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Named	
  En2ty	
  Detec2on	
  
•  Brevity	
  and	
  noisy	
  nature	
  of	
  text	
  in	
  social	
  media	
  poses	
  
a	
  serious	
  challenge	
  
•  Employed	
  solu8on:	
  
–  Pre-­‐processing:	
  tokeniza8on,	
  user	
  men8on	
  resolu8on,	
  text	
  
cleaning	
  
–  Stanford	
  NER	
  +	
  user	
  men8on	
  resolu8on	
  
–  Regular	
  expressions	
  to	
  remove	
  special	
  characters	
  and	
  
symbols	
  (e.g.,	
  #,	
  @,	
  URLs,	
  etc.)	
  
#101	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Visual	
  Indexing	
  
•  Content-­‐based	
  image	
  retrieval	
  to	
  solve	
  Near-­‐
Duplicate	
  Search	
  (NDS)	
  problem	
  	
  
•  Based	
  on	
  local	
  descriptors	
  (SURF),	
  aggrega8on	
  
(VLAD),	
  dimensionality	
  reduc8on	
  (PCA),	
  quan8za8on	
  
(PQ)	
  and	
  indexing	
  (IVFADC)	
  
•  State-­‐of-­‐the-­‐art	
  visual	
  similarity	
  search	
  
–  High	
  precision/recall	
  
–  Very	
  efficient	
  and	
  scalable	
  implementa8on	
  (search	
  many	
  
millions	
  of	
  images	
  in	
  a	
  few	
  msec,	
  maintain	
  full	
  index	
  in	
  
memory	
  using	
  ~1GB/10M	
  images)	
  
#102	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Improving	
  NDS	
  Resilience	
  (NDS+)	
  
•  Oten,	
  NDS	
  performance	
  suffers	
  from	
  overlay	
  
graphics	
  and	
  fonts	
  
•  To	
  address	
  this	
  issue,	
  we	
  integrate	
  a	
  descriptor-­‐level	
  
classifier	
  that	
  tries	
  to	
  remove	
  the	
  font/graphic	
  
descriptors	
  from	
  the	
  VLAD	
  vector	
  
#103	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Example:	
  Filtering	
  Out	
  Font	
  Descriptors	
  
•  Assuming	
  that	
  in	
  most	
  cases	
  the	
  classifier	
  is	
  correct,	
  
the	
  resul8ng	
  VLAD	
  vector	
  is	
  of	
  much	
  higher	
  quality	
  
compared	
  to	
  the	
  one	
  without	
  filtering	
  
#104	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Classifier	
  Details	
  
•  Random	
  Forest	
  used	
  as	
  base	
  classifier	
  
•  Cost	
  Sensi8ve	
  meta-­‐classifier	
  to	
  penalize	
  
misclassifica8on	
  of	
  True	
  Posi8ves	
  
•  Challenge	
  due	
  to	
  Class	
  Imbalance	
  (overlay	
  
descriptors	
  <<	
  useful	
  image	
  content	
  descriptors)	
  
–  Cost	
  Sensi8ve	
  meta-­‐classifier	
  performs	
  over-­‐sampling	
  of	
  
minority	
  class	
  to	
  balance	
  the	
  training	
  set	
  
•  Training	
  set	
  created	
  by	
  collec8ng	
  images	
  with	
  
overlays	
  (e.g.,	
  memes)	
  from	
  the	
  Web	
  and	
  manually	
  
annota8ng	
  them	
  (selec8ng	
  areas	
  w.	
  fonts/overlays)	
  
#105	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Mining:	
  Clustering	
  and	
  Aggrega2on	
  
•  Visual	
  aggrega8on	
  
–  DBSCAN	
  on	
  the	
  visual	
  feature	
  representa8on	
  (PCA-­‐
reduced	
  VLAD	
  vectors)	
  
–  Element	
  (tweet)	
  selected	
  based	
  on	
  the	
  largest	
  amount	
  of	
  
keywords	
  (expected	
  to	
  result	
  in	
  more	
  informa8on)	
  
•  En8ty	
  aggrega8on	
  
–  NER	
  on	
  individual	
  items	
  
–  En8ty	
  categoriza8on	
  (à	
  Persons,	
  Loca8on,	
  Organiza8ons)	
  
–  En8ty	
  ranking	
  based	
  on	
  frequency	
  of	
  occurrence	
  
	
  
#106	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
User	
  Interface:	
  Collec2ons	
  View	
  
#107	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
User	
  Interface:	
  Items	
  View	
  &	
  Search	
  
#108	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
User	
  Interface:	
  Clusters	
  View	
  
#109	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
User	
  Interface:	
  En22es	
  View	
  
#110	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Evalua2on:	
  NER	
  
•  Manual	
  annota8on	
  of	
  400	
  tweets	
  from	
  the	
  SNOW	
  
Data	
  Challenge	
  dataset	
  (Papadopoulos	
  et	
  al.,	
  2014)	
  
•  Measure:	
  Accuracy	
  à	
  instance	
  is	
  considered	
  correct	
  
when	
  both	
  en8ty	
  and	
  type	
  are	
  correctly	
  iden8fied	
  
•  Three	
  compe8ng	
  solu8ons:	
  	
  
–  Base	
  Stanford	
  NER	
  (S-­‐NER)	
  
–  S-­‐NER	
  +	
  Extensions/Post-­‐processing	
  (S-­‐NER+)	
  
–  Ellogon	
  library	
  (hJp://www.ellogon.org)	
  	
  
#111	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Evalua2on:	
  NDS	
  
•  Benchmark	
  Datasets	
  
–  Holidays:	
  1,491	
  images,	
  500	
  queries	
  (Jegou	
  et	
  al.,	
  2008)	
  
–  Oxford:	
  5,063	
  images,	
  55	
  queries	
  (Philbin	
  et	
  al.,	
  2008)	
  
–  Paris:	
  6,412	
  images,	
  55	
  queries	
  (Philbin	
  et	
  al.,	
  2008)	
  
•  Accuracy:	
  mean	
  Average	
  Precision	
  (mAP)	
  
#112	
  
CLEAN	
  DATASET	
   NOISY	
  DATASET	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Evalua2on:	
  NDS	
  
•  Execu8on	
  Time	
  (msec)	
  
•  Example	
  
#113	
  
INDEXED	
  IMAGE	
  
QUERY	
  IMAGE	
  
NDS:	
  	
  #27	
  
NDS+:	
  #1	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Use	
  Cases:	
  Real-­‐world	
  Datasets	
  
#114	
  
sandy	
   boston	
   malaysia	
   ferry	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
NDS	
  Use	
  Case	
  (boston)	
  
#115	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Clustering	
  Use	
  Case	
  (boston)	
  
•  Visual	
  clustering	
  enables	
  compara8ve	
  view	
  and	
  analysis	
  over	
  
8me	
  (in	
  this	
  case	
  showing	
  increasing	
  confidence	
  on	
  picture).	
  
•  When	
  journalists	
  see	
  many	
  similar	
  photos	
  of	
  the	
  same	
  scene,	
  
they	
  have	
  more	
  confidence	
  that	
  it	
  is	
  real	
  and	
  not	
  fabricated.	
  
#116	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
En2ty	
  Aggrega2on	
  Use	
  Case	
  (snow)	
  	
  
#117	
  
LOCATIONS	
   PERSONS	
   ORGANIZATIONS	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Conclusion	
  
•  Key	
  contribu8ons	
  
–  Framework	
  and	
  web	
  applica8on	
  offering	
  valuable	
  
verifica8on	
  support	
  for	
  Web	
  mul8media	
  
–  High-­‐quality	
  individual	
  components	
  for	
  NER,	
  NDS,	
  
clustering	
  and	
  aggrega8on	
  
•  Future	
  Work	
  
–  Incremental	
  image	
  clustering	
  
–  Temporal	
  views	
  to	
  explore	
  evolu8on	
  of	
  a	
  story	
  
–  Mul8media	
  forensics	
  toolbox	
  (splice,	
  copy-­‐move	
  
detec8on)	
  
#118	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Computa2onal	
  Verifica2on	
  in	
  Social	
  Media	
  
•  Create	
  a	
  computa$onal	
  verifica$on	
  framework	
  to	
  
classify	
  tweets	
  with	
  unreliable	
  media	
  content.	
  
•  Events	
  used	
  for	
  experimenta8on	
  
#119	
  
Fake	
  images	
  posted	
  during	
  Hurricane	
  Sandy	
  natural	
  disaster	
   Fake	
  images	
  posted	
  during	
  Boston	
  Marathon	
  bombings	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Methodology	
  
#120	
  
Tweet	
  
Extrac8on	
  
• Use	
  Topsy	
  
machine	
  to	
  collect	
  
tweets	
  with	
  
certain	
  keywords	
  
Image	
  
Indexing	
  
• Create	
  a	
  
predefined	
  set	
  of	
  
verified	
  fake	
  and	
  
real	
  images	
  	
  
• Keep	
  the	
  tweets	
  
with	
  iden8cal	
  or	
  
near-­‐duplicate	
  
images	
  
Feature	
  
Extrac8on	
  
• Extract	
  Content	
  
and	
  User	
  features	
  
for	
  each	
  tweet	
  
collected	
  and	
  
their	
  combina8on	
  
Dataset	
  	
  
• Annotate	
  each	
  
tweet	
  as	
  fake	
  or	
  
real	
  based	
  on	
  the	
  
image	
  
• Keep	
  only	
  tweets	
  
wriJen	
  in	
  English,	
  
Spanish	
  or	
  
German	
  
Classifica8on	
  
• Test	
  using	
  cross-­‐
valida$on	
  
approach	
  
• Test	
  using	
  the	
  two	
  
dis8nct	
  datasets	
  
• Test	
  using	
  
different	
  training	
  
and	
  tes8ng	
  
dataset	
  
Content	
  features	
  
• Length	
  of	
  the	
  tweet	
  
• Number	
  of	
  words	
  
• Contains	
  exclama8on	
  mark	
  and	
  their	
  number	
  
• Contains	
  quota8on	
  mark	
  and	
  their	
  number	
  
• If	
  the	
  text	
  contains	
  emo8con	
  (happy	
  or	
  sad)	
  
• Number	
  of	
  uppercase	
  characters	
  
• Number	
  of	
  hashtags	
  
• Number	
  of	
  men8ons	
  
• Number	
  of	
  pronouns	
  
• Number	
  of	
  urls	
  
• Number	
  of	
  sen8ment	
  words	
  
• Number	
  of	
  retweets	
  	
  
User	
  features	
  
• Username	
  
• Number	
  of	
  friends	
  
• Number	
  of	
  followers	
  
• Number	
  of	
  followers/number	
  of	
  friends	
  ra8o	
  
• Number	
  of	
  8mes	
  the	
  user	
  was	
  listed	
  
• If	
  the	
  status	
  of	
  the	
  user	
  contains	
  url	
  
• If	
  the	
  user	
  is	
  verified	
  or	
  not	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Results	
  
•  Tweet	
  Sta8s8cs	
  	
  
	
  
•  Approaches	
  
#121	
  
Tweets	
  with	
  URLs	
   343939	
  
Tweets	
  with	
  fake	
  images	
   10758	
  
Tweets	
  with	
  real	
  images	
   3540	
  
Hurricane	
  Sandy	
   Boston	
  Marathon	
  
Tweets	
  with	
  URLs	
   112449	
  
Tweets	
  with	
  fake	
  images	
   281	
  
Tweets	
  with	
  real	
  images	
   460	
  
Classifier	
   Classified	
  correctly(%)	
  
Content	
  
features	
  
User	
  	
  
features	
  
Total	
  
features	
  
J48	
  tree	
   81.41	
   67.72	
   80.68	
  
KStar	
   81.28	
   71.16	
   81.38	
  
Random	
  
Forest	
  
80.59	
   70.15	
   80.94	
  
Detec8on	
  accuracy	
  using	
  cross	
  –	
  valida8on	
  approach	
  	
  
Classifier	
   Classified	
  correctly(%)	
  
Content	
  
features	
  
User	
  	
  
features	
  
Total	
  
features	
  
J48	
  tree	
   76.45	
   70.81	
   81.25	
  
KStar	
   81.28	
   74.12	
   75.78	
  
Random	
  
Forest	
  
78.59	
   76.15	
   79.10	
  
Hurricane	
  Sandy	
   Boston	
  Marathon	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Results(2)	
  
#122	
  
Classifier	
   Classified	
  correctly(%)	
  
Content	
  
features	
  
User	
  	
  
features	
  
Total	
  
features	
  
J48	
  tree	
   73.79	
   51.06	
   65.06	
  
KStar	
   75.30	
   62.29	
   53.31	
  
Random	
  
Forest	
  
74.02	
   63.10	
   65.96	
  
Detec8on	
  accuracy	
  using	
  different	
  training	
  and	
  tes8ng	
  set	
  in	
  Hurricane	
  Sandy	
  
Classifier	
   Classified	
  correctly(%)	
  
Content	
  
features	
  
User	
  	
  
features	
  
Total	
  
features	
  
J48	
  tree	
   55.05	
   50.12	
   54.10	
  
KStar	
   50.01	
   50.10	
   50.97	
  
Random	
  
Forest	
  
58.75	
   51.03	
   58.78	
  
Detec8on	
  accuracy	
  using	
  Hurricane	
  Sandy	
  for	
  training	
  and	
  Boston	
  Marathon	
  for	
  tes8ng	
  	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #123	
  
Other	
  approaches	
  
•  Graph-­‐based	
  mul8modal	
  clustering	
  for	
  social	
  event	
  
detec8on	
  in	
  large	
  collec8ons	
  of	
  images	
  
–  automa8c	
  organiza8on	
  of	
  a	
  mul8media	
  collec8on	
  into	
  
groups	
  of	
  items,	
  each	
  (group)	
  of	
  which	
  corresponds	
  to	
  a	
  
dis8nct	
  event.	
  
•  Unsupervised	
  concept	
  learning	
  detec8on	
  using	
  social	
  
media	
  as	
  training	
  data	
  
•  Text	
  analysis	
  for	
  en88es	
  matching	
  and	
  sen8ment	
  
analysis	
  	
  
•  Placing	
  images	
  based	
  on	
  content-­‐features	
  
•  Retrieving	
  diverse	
  images	
  for	
  same	
  en8ty	
  	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #124	
  
Demos	
  -­‐	
  Applica2ons	
  
MM	
  News	
  Demo	
  
Clusrour	
  
ThesFest	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Mul2media	
  Demo	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #126	
  
Mul2media	
  Demo	
  Architecture	
  
#126	
  
StreamManager	
  
TwiJer	
   Facebook	
   Flickr	
   YouTube	
   RSS	
   Instagram	
  
160.xx.xx.207	
  
MongoDBWrapper	
  
160.xx.xx.207	
  
TextIndexer	
  	
  	
  (Solr)	
  
160.xx.xx.207	
  
160.xx.xx.207	
  
MediaFetcher,	
  FeatureExtractor	
  (HDFS)	
  
160.xx.xx.58	
   160.xx.xx.107	
  
Social	
  Focused	
  Crawler	
  (HDFS)	
  
160.xx.xx.187	
  
Nutch	
  
Nutch	
   VLAD	
  
FeatureIndexer	
  (HDFS)	
  
160.xx.xx.207	
  
IVFADC	
  
Data	
  Mining	
  
160.xx.xx.191	
  
Visual	
  Clust.	
   Geo	
  Clust.	
   Sta8s8cs	
  
Web	
  server	
  
160.xx.xx.116	
  
API	
  (3)	
  API	
  (4)	
  
API	
  (1)	
   API	
  (2)	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
MongoDB	
  
Document-­‐oriented	
  database	
  →	
  support	
  of	
  json	
  
Current	
  stable	
  version:	
  3.0.6	
  	
  	
  hJps://www.mongodb.org/	
  
	
  
Flexible	
  Data	
  Model	
  →	
  schemeless,	
  usefulll	
  for	
  social	
  media	
  data	
  that	
  change	
  
over	
  8me	
  
Horizontal	
  scaling	
  via	
  shards	
  and	
  replica	
  sets	
  	
  
	
  
Storage	
  of	
  social	
  media	
  items	
  as	
  json	
  objects	
  →	
  millions	
  of	
  documents	
  can	
  
be	
  handled	
  
Number	
  of	
  different	
  index	
  types	
  →	
  single	
  field,	
  compound,	
  mul8key	
  indexes.	
  	
  
Example:	
  Store	
  facebook	
  posts	
  and	
  index	
  them	
  by	
  publica8on	
  8me	
  and	
  
number	
  of	
  likes	
  
Query:	
  get	
  most	
  recent	
  posts	
  sorted	
  by	
  popularity	
  (#likes)	
  
Na8ve	
  support	
  of	
  map-­‐reduce	
  jobs	
  →	
  get	
  most	
  shared	
  images	
  in	
  a	
  collec8on	
  
of	
  tweets	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Apache	
  Solr	
  
Full-­‐text	
  search	
  plaworm	
  built	
  on	
  top	
  ofApache	
  Lucene	
  
Current	
  version:	
  5.3.0	
  hJp://lucene.apache.org/solr/	
  
	
  
Indexing	
  of	
  social	
  media	
  items	
  e.g.	
  Tweets,	
  FB	
  posts,	
  metadata	
  of	
  Youtube	
  videos	
  
etc.	
  	
  
Addi2onal	
  features	
  	
  
l  Faceted	
  Search	
  and	
  Filtering	
  →	
  get	
  top	
  N	
  per	
  field	
  e.g.	
  users	
  
l  Spa8al	
  index	
  &	
  Search	
  →	
  very	
  usefull	
  in	
  geo-­‐tagged	
  documents	
  e.g.	
  Tweets.	
  
l  Plugin-­‐based	
  archtecture	
  →	
  language	
  detec8on,	
  NLP	
  etc	
  as	
  steps	
  of	
  indexing	
  
pipeline	
  
	
  
Get	
  tweets	
  containg	
  the	
  name	
  “Barack	
  Obama”	
  OR	
  the	
  phrase	
  “us	
  elec8ons”	
  
having	
  geo-­‐loca8on	
  around	
  New	
  York	
  	
  	
  
	
  
SolrCloud	
  →	
  Cluster	
  of	
  Solr	
  instances	
  
Automa8c	
  load	
  balancing	
  and	
  fail-­‐over	
  for	
  queries	
  
ZooKeeper	
  integra8on	
  for	
  cluster	
  coordina8on	
  and	
  configura8on	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Storm	
  
Distributed	
  real-­‐8me	
  computa8on	
  system	
  hJps://storm.apache.org	
  	
  
Topologies	
  →	
  processing	
  logic	
  
Stream:	
  unbounded	
  sequence	
  of	
  tuples	
  e.g.	
  tweets	
  or	
  URLs	
  	
  
	
  
	
  
Spouts:	
  source	
  of	
  streams	
  
Bolts:	
  processing,	
  filtering,	
  etc	
  
Processing	
  of	
  URLS	
  shared	
  in	
  social	
  media	
  →	
  
storm	
  pipeline	
  
l  Expand	
  short	
  URLs	
  
l  Fetch	
  new	
  URLs	
  
l  Extract	
  content	
  e.g.	
  ar8cles	
  and	
  images	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Redis	
  
Key	
  -­‐	
  Value	
  cache	
  and	
  store	
  
Current	
  stable	
  version:	
  3.0 	
  hJps://storm.apache.org/	
  
Par22oning	
  →	
  distribu8on	
  of	
  data	
  among	
  mul8ple	
  Redis	
  instances	
  
Keys	
  can	
  contain	
  strings,	
  hashes,	
  lists,	
  sets,	
  sorted	
  sets,	
  etc	
  
Atomic	
  opera2ons:	
  set,	
  increment,	
  push	
  etc	
  
	
  
Store	
  crawling	
  status	
  of	
  URLs,	
  sharing	
  informa8on	
  of	
  URLs	
  and	
  images	
  
	
  
Addi8onal	
  Feature	
  
l  Implementa8on	
  of	
  Publisher/Subscriber	
  paJern	
  
l  Communica8on	
  of	
  different	
  components	
  in	
  a	
  system	
  for	
  social	
  
media	
  analy8cs	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
tags:	
  sagrada	
  familia,	
  
cathedral,	
  barcelona	
  
taken:	
  12	
  May	
  2009	
  
lat:	
  41.4036,	
  lon:	
  2.1743	
  
PHOTOS	
  &	
  METADATA	
  
SPATIAL	
  CLUSTERING	
  +	
  TEMPORAL	
  ANALYSIS	
  
COMMUNITY	
  DETECTION	
  
CLASSIFICATION	
  TO	
  LANDMARKS/EVENTS	
  
VISUAL	
  
TAG	
  
HYBRID	
  
[2	
  years,	
  50	
  users	
  /	
  120	
  photos]	
  
#users	
  /	
  #photos	
  
dura8on	
  
[1	
  day,	
  2	
  users	
  /	
  10	
  photos]	
  
S.	
   Papadopoulos,	
   C.	
   Zigkolis,	
   Y.	
   Kompatsiaris,	
   A.	
   Vakali.	
   “Cluster-­‐based	
   Landmark	
   and	
   Event	
   Detec8on	
   on	
   Tagged	
   Photo	
  
Collec8ons”.	
  In	
  IEEE	
  Mul8media	
  Magazine	
  18(1),	
  pp.	
  52-­‐63,	
  2011	
  
City	
  profile	
  crea2on	
  (Clusrour)	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #132	
  
City	
  profile	
  crea2on	
  (Clusrour)	
  
Community	
  detec2on	
  on	
  
image	
  similarity	
  graphs	
  
Nodes:	
  photos	
  
Edges:	
  visual	
  and	
  tag	
  
similarity	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #134	
  
ThessFest	
  
•  Thessaloniki	
  
Interna8onal	
  Film	
  
Fes8val	
  
•  Support	
  twiJer/
comment	
  usage	
  
within	
  the	
  app	
  
•  Ra8ngs	
  and	
  
comments	
  per	
  film	
  
•  Feedback	
  
aggrega8on	
  
•  Votes	
  
•  Tweets	
  
•  Real-­‐8me	
  feedback	
  
to	
  the	
  organisa8on	
  
and	
  visitors	
  
ThessFest
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Fête	
  de	
  la	
  Musique	
  Berlin	
  app	
  
•  FETEberlin	
  in	
  App	
  Store	
  and	
  Google	
  Play	
  
•  More	
  than	
  100K	
  visitors	
  
•  About	
  5K	
  musicians	
  
•  More	
  than	
  5K	
  app	
  downloads,	
  25K	
  
sessions	
  
App	
  features	
  
•  Browse	
  and	
  filter	
  detailed	
  program	
  
•  Interac8ve	
  maps	
  and	
  rou8ng	
  	
  
•  Social	
  Sharing	
  
•  Ar8sts’	
  and	
  Stages	
  Details	
  
•  Social	
  Monitoring	
  
Main	
  benefits	
  for	
  arendants	
  
•  Visitors	
  can	
  browse	
  through	
  maps	
  and	
  
don’t	
  get	
  lost	
  as	
  stages	
  are	
  numerous	
  
•  Event	
  schedule	
  is	
  available	
  always	
  and	
  
per	
  stage	
  	
  
–  Very	
  useful	
  when	
  the	
  server	
  was	
  down	
  and	
  
there	
  was	
  no	
  access	
  to	
  the	
  online	
  schedule	
  
#135	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #136	
  
Topic	
  analysis	
  
•  Top-­‐10	
  topics	
  
•  Manual	
  inspec8on	
  
of	
  clusters:	
  
–  53.8%	
  of	
  topic	
  8tles	
  
considered	
  
informa8ve	
  
–  98.5%	
  of	
  clusters	
  
were	
  found	
  to	
  be	
  
“clean”	
  
•  Topics	
  in	
  8me	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Other	
  Applica2on	
  Areas	
  
•  Science	
  
–  Sociology,	
  machine	
  learning	
  (machine	
  as	
  a	
  teacher),	
  computer	
  vision	
  
(annota8on)	
  
•  Tourism	
  –	
  Leisure	
  –	
  Culture	
  
–  Off-­‐the-­‐beaten	
  path	
  POI	
  extrac8on	
  
•  Marke8ng	
  
–  Brand	
  monitoring,	
  personalised	
  ads	
  
•  Predic8on	
  	
  
–  Poli8cs:	
  elec8on	
  results	
  
•  News	
  
–  Topics,	
  trends	
  event	
  detec8on	
  
•  Others	
  
–  Environment,	
  emergency	
  response,	
  energy	
  saving,	
  etc	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Reusable	
  results	
  
•  Star2ng	
  point:	
  hJp://www.socialsensor.eu/results	
  	
  
–  	
  Deliverables	
  
–  	
  Publica8ons	
  	
  
–  	
  Datasets	
  
–  	
  Sotware	
  
–  	
  e-­‐leJer:	
  hJp://stcsn.ieee.net/e-­‐leJer/vol-­‐1-­‐no-­‐3	
  
•  Open-­‐source	
  projects	
  (Apache	
  License	
  v2):	
  	
  	
  	
  	
   	
   	
  
	
   	
  hJps://github.com/socialsensor	
  	
  
–  	
  Data	
  collec8on	
  (stream-­‐manager,	
  storm-­‐focused-­‐crawler)	
  
–  	
  Indexing	
  (framework-­‐client,	
  mul8media-­‐indexing)	
  
–  	
  Mining	
  (topic-­‐detec8on,	
  mul8media-­‐analysis,	
  community-­‐evolu8on-­‐
analysis,	
  social-­‐event-­‐detec8on)	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #139	
  
Benchmarking	
  -­‐	
  Datasets	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
dataset:	
  SNOW	
  2014	
  Data	
  Challenge	
  
•  A	
  set	
  of	
  ~1M	
  tweets	
  collected	
  using	
  a	
  list	
  of	
  5000	
  UK-­‐
focused	
  “news	
  hounds”	
  and	
  the	
  keywords	
  “Syria”,	
  
“terror”,	
  “Ukraine”,	
  and	
  “bitcoin”	
  for	
  a	
  period	
  of	
  24	
  
hours	
  star8ng	
  from	
  Feb	
  25,	
  18:00.	
  
•  Average	
  rate:	
  ~720	
  tweets/minute	
  
•  Number	
  of	
  unique	
  twiJer	
  accounts:	
  ~556K	
  
•  Number	
  of	
  retweets:	
  ~648K	
  
•  Number	
  of	
  replies:	
  ~135K	
  
•  Ground	
  truth	
  topics:	
  	
  	
  	
  	
  	
  
	
  hJp://figshare.com/ar8cles/SNOW_2014_Data_Challenge/1003755	
  
#140	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Overview	
  of	
  Challenge	
  
•  Goal:	
  Detec8on	
  of	
  newsworthy	
  topics	
  in	
  a	
  large	
  and	
  
noisy	
  set	
  of	
  tweets	
  
•  Topic:	
  a	
  news	
  story	
  represented	
  by	
  a	
  headline	
  +	
  tags	
  
+	
  representa8ve	
  tweets	
  +	
  representa8ve	
  images	
  
(op8onal)	
  
•  Newsworthy:	
  A	
  topic	
  that	
  ends	
  up	
  being	
  covered	
  by	
  
at	
  least	
  some	
  major	
  online	
  news	
  sources	
  
•  Topics	
  are	
  detected	
  per	
  2meslot	
  (small	
  equally-­‐sized	
  
8me	
  intervals)	
  
•  We	
  want	
  a	
  maximum	
  number	
  of	
  topics	
  per	
  8meslot	
  
#141	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Challenge	
  Ac2vity	
  Log	
  
•  Challenge	
  defini8on	
  (Dec	
  2013)	
  
•  Challenge	
  toolkit	
  and	
  registra8on	
  (Jan	
  20,	
  2014)	
  
•  Development	
  dataset	
  collec8on	
  (Feb	
  3,	
  2014)	
  
•  Rehearsal	
  dataset	
  collec8on	
  (Feb	
  17,	
  2014)	
  
•  Test	
  dataset	
  collec8on	
  (Feb	
  25,	
  2014)	
  
•  Results	
  submission	
  (Mar	
  4,	
  2014)	
  
•  Paper	
  submission	
  (Mar	
  9,	
  2014)	
  
•  Results	
  evalua8on	
  (Mar	
  5-­‐18,	
  2014)	
  
•  Workshop	
  (Apr	
  7,	
  2014)	
  
#142	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Some	
  sta2s2cs	
  
•  Registered	
  par8cipants:	
  25	
  
–  India:	
  4,	
  Belgium:	
  3,	
  Germany:	
  3,	
  UK:	
  3,	
  Greece:	
  3,	
  	
  	
  	
  
Ireland:	
  2,	
  USA:	
  2,	
  France:	
  2,	
  Italy:	
  1,	
  Spain:	
  1,	
  Russia:	
  1	
  
•  Par8cipants	
  that	
  signed	
  the	
  Challenge	
  agreement:	
  19	
  
•  Par8cipants	
  that	
  submiJed	
  results:	
  11	
  
•  Par8cipants	
  that	
  submiJed	
  papers:	
  9	
  
#143	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
Evalua2on	
  Protocol	
  
•  Defined	
  several	
  evalua8on	
  criteria:	
  
–  Newsworthiness	
  à	
  Precision/Recall,	
  F-­‐score	
  
–  Readability	
  à	
  scale	
  [1-­‐5]	
  
–  Coherence	
  à	
  scale	
  [1-­‐5]	
  
–  Diversity	
  à	
  scale	
  [1-­‐5]	
  
•  List	
  of	
  reference	
  topics	
  
•  Set	
  up	
  precise	
  evalua8on	
  guidelines	
  
•  Blind	
  evalua8on	
  (i.e.	
  evaluator	
  not	
  aware	
  of	
  which	
  
method	
  a	
  topic	
  comes	
  from)	
  based	
  on	
  Web	
  UI	
  
•  Par8cipants	
  submiJed	
  topics	
  for	
  96	
  8meslots,	
  but	
  
manual	
  evalua8on	
  happened	
  for	
  5	
  sample	
  8meslots.	
  
•  Result	
  valida8on	
  and	
  analysis	
  
#144	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
social	
  event	
  detec2on	
  
	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
a	
  bit	
  of	
  background...	
  
•  mediaeval	
  
–  well-­‐known	
  benchmarking	
  ac8vity	
  since	
  2010	
  (started	
  as	
  
VideoCLEF	
  in	
  2008)	
  
–  consists	
  of	
  several	
  tasks	
  dedicated	
  to	
  specific	
  challenges	
  
•  social	
  event	
  detec2on	
  (SED)	
  
–  first	
  run	
  in	
  2011	
  (7	
  par8cipants)	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
task	
  defini2on	
  &	
  dataset	
  
•  2011	
  	
  collec8on:	
  73,645	
  flickr	
  photos	
  from	
  five	
  ci8es,	
  May	
  2009	
  
	
   	
  	
  	
  	
  	
  find	
  events	
  related	
  to	
  two	
  target	
  categories	
  
	
   	
  	
  	
  	
  	
  >	
  soccer	
  matches	
  in	
  Barcelona	
  and	
  Rome	
  
	
   	
  	
  	
  	
  	
  >	
  concerts	
  in	
  venues	
  Paradiso	
  and	
  Parc	
  del	
  Forum	
  
	
  
•  2012	
  	
  collec8on:	
  167,332	
  flickr	
  photos	
  from	
  five	
  ci8es,	
  2009-­‐2011	
  
	
   	
  	
  	
  	
  find	
  events	
  related	
  to	
  three	
  target	
  categories	
  
	
   	
  	
  	
  	
  >	
  technical	
  events	
  (e.g.	
  exhibi8ons,	
  fairs)	
  in	
  Germany	
  
	
   	
  	
  	
  	
  >	
  soccer	
  events	
  in	
  Hamburg	
  and	
  Madrid	
  
	
   	
  	
  	
  	
  >	
  Indignados	
  movement	
  in	
  Madrid	
  
	
  
•  2013	
  	
  collec8on	
  1:	
  437,370	
  flickr	
  photos	
  +	
  1,327	
  YouTube	
  videos	
  
	
   	
  	
  collec8on	
  2:	
  57,165	
  Instagram	
  photos	
  
	
   	
  	
  cluster	
  collec8on	
  1	
  into	
  events	
  (aJach	
  YouTube	
  videos	
  to	
  them)	
  
	
   	
  	
  categorize	
  collec8on	
  2	
  images	
  into	
  eight	
  event	
  types	
  or	
  non-­‐event	
  
variant	
  1	
  
variant	
  4	
  
variant	
  4	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
sed2012:	
  evalua2on	
  setup	
  
•  ground	
  truth:	
  photos	
  clustered	
  around	
  149	
  events	
  
(18	
  technical,	
  79	
  soccer,	
  52	
  Indignados)	
  
•  assess	
  the	
  following	
  aspects:	
  
–  accuracy	
  of	
  same-­‐event	
  classifica8on	
  
–  compare	
  clustering	
  quality	
  between	
  item-­‐to-­‐cluster	
  and	
  
the	
  two	
  versions	
  of	
  item-­‐to-­‐item	
  (batch	
  &	
  incremental)	
  
–  measure	
  contribu8ons	
  of	
  different	
  features	
  
–  study	
  generaliza8on	
  abili8es	
  of	
  same	
  event	
  model	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  
evalua2on:	
  main	
  caveat	
  
•  crea8on	
  strategy	
  of	
  benchmark	
  dataset	
  can	
  
drama8cally	
  affect	
  how	
  hard	
  (or	
  easy)	
  the	
  problem	
  is	
  
–  if	
  events	
  are	
  very	
  sparsely	
  distributed	
  over	
  8me,	
  then	
  a	
  
simple	
  8me-­‐based	
  clustering	
  could	
  be	
  sufficient	
  
–  if	
  events	
  correspond	
  to	
  users	
  one-­‐to-­‐one,	
  then	
  a	
  simple	
  
user-­‐based	
  look-­‐up	
  could	
  yield	
  very	
  high	
  accuracy	
  
–  using	
  the	
  same	
  source	
  for	
  training/tes8ng	
  makes	
  it	
  easy	
  
•  need	
  to	
  explore	
  new	
  challenging	
  se†ngs	
  
–  mul8ple	
  sources	
  of	
  mul8media	
  
–  huge	
  amounts	
  of	
  non-­‐event	
  content	
  
–  very	
  dense	
  coverage	
  of	
  feature	
  space	
  by	
  test	
  events	
  
S3P	
  2015,	
  Garda	
  Lake,	
  Italy	
  	
   	
  Processing	
  Large	
  Complex	
  Data	
  #150	
  
Conclusions	
  
Processing Large Complex Data
Processing Large Complex Data
Processing Large Complex Data
Processing Large Complex Data
Processing Large Complex Data

Mais conteúdo relacionado

Mais procurados

Researching Social Media – Big Data and Social Media Analysis
Researching Social Media – Big Data and Social Media AnalysisResearching Social Media – Big Data and Social Media Analysis
Researching Social Media – Big Data and Social Media AnalysisFarida Vis
 
30 Tools and Tips to Speed Up Your Digital Workflow
30 Tools and Tips to Speed Up Your Digital Workflow 30 Tools and Tips to Speed Up Your Digital Workflow
30 Tools and Tips to Speed Up Your Digital Workflow Mike Kujawski
 
Social media mining for sensing and responding to real-world trends and events
Social media mining for sensing and responding to real-world trends and eventsSocial media mining for sensing and responding to real-world trends and events
Social media mining for sensing and responding to real-world trends and eventsYiannis Kompatsiaris
 
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013Digital Methods Initiative
 
Event detection in twitter using text and image fusion
Event detection in twitter using text and image fusionEvent detection in twitter using text and image fusion
Event detection in twitter using text and image fusioncsandit
 
"Big Data for Development: Opportunities & Challenges” - UN Global Pulse
"Big Data for Development: Opportunities & Challenges” - UN Global Pulse"Big Data for Development: Opportunities & Challenges” - UN Global Pulse
"Big Data for Development: Opportunities & Challenges” - UN Global PulseUN Global Pulse
 
Techfugees:Group1:RefugeesMap
Techfugees:Group1:RefugeesMapTechfugees:Group1:RefugeesMap
Techfugees:Group1:RefugeesMapChantal MARIN
 
GitHub as Transparency Device in Data Journalism, Open Data and Data Activism
GitHub as Transparency Device in  Data Journalism, Open Data and Data ActivismGitHub as Transparency Device in  Data Journalism, Open Data and Data Activism
GitHub as Transparency Device in Data Journalism, Open Data and Data ActivismLiliana Bounegru
 
Social Media Analysis: Present and Future
Social Media Analysis: Present and FutureSocial Media Analysis: Present and Future
Social Media Analysis: Present and Futurematthewhurst
 
MMRA / QRCA Mobile Qualitative - Using Mobile to Understand Customers
MMRA / QRCA Mobile Qualitative - Using Mobile to Understand CustomersMMRA / QRCA Mobile Qualitative - Using Mobile to Understand Customers
MMRA / QRCA Mobile Qualitative - Using Mobile to Understand CustomersThreads Qualitative Research
 
Sentiment Analysis and Social Media: How and Why
Sentiment Analysis and Social Media: How and WhySentiment Analysis and Social Media: How and Why
Sentiment Analysis and Social Media: How and WhyDavide Feltoni Gurini
 
Social Media in Australia: The Case of Twitter
Social Media in Australia: The Case of TwitterSocial Media in Australia: The Case of Twitter
Social Media in Australia: The Case of TwitterAxel Bruns
 

Mais procurados (13)

Researching Social Media – Big Data and Social Media Analysis
Researching Social Media – Big Data and Social Media AnalysisResearching Social Media – Big Data and Social Media Analysis
Researching Social Media – Big Data and Social Media Analysis
 
30 Tools and Tips to Speed Up Your Digital Workflow
30 Tools and Tips to Speed Up Your Digital Workflow 30 Tools and Tips to Speed Up Your Digital Workflow
30 Tools and Tips to Speed Up Your Digital Workflow
 
Social media mining for sensing and responding to real-world trends and events
Social media mining for sensing and responding to real-world trends and eventsSocial media mining for sensing and responding to real-world trends and events
Social media mining for sensing and responding to real-world trends and events
 
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
 
Event detection in twitter using text and image fusion
Event detection in twitter using text and image fusionEvent detection in twitter using text and image fusion
Event detection in twitter using text and image fusion
 
"Big Data for Development: Opportunities & Challenges” - UN Global Pulse
"Big Data for Development: Opportunities & Challenges” - UN Global Pulse"Big Data for Development: Opportunities & Challenges” - UN Global Pulse
"Big Data for Development: Opportunities & Challenges” - UN Global Pulse
 
Threats_Report_2013
Threats_Report_2013Threats_Report_2013
Threats_Report_2013
 
Techfugees:Group1:RefugeesMap
Techfugees:Group1:RefugeesMapTechfugees:Group1:RefugeesMap
Techfugees:Group1:RefugeesMap
 
GitHub as Transparency Device in Data Journalism, Open Data and Data Activism
GitHub as Transparency Device in  Data Journalism, Open Data and Data ActivismGitHub as Transparency Device in  Data Journalism, Open Data and Data Activism
GitHub as Transparency Device in Data Journalism, Open Data and Data Activism
 
Social Media Analysis: Present and Future
Social Media Analysis: Present and FutureSocial Media Analysis: Present and Future
Social Media Analysis: Present and Future
 
MMRA / QRCA Mobile Qualitative - Using Mobile to Understand Customers
MMRA / QRCA Mobile Qualitative - Using Mobile to Understand CustomersMMRA / QRCA Mobile Qualitative - Using Mobile to Understand Customers
MMRA / QRCA Mobile Qualitative - Using Mobile to Understand Customers
 
Sentiment Analysis and Social Media: How and Why
Sentiment Analysis and Social Media: How and WhySentiment Analysis and Social Media: How and Why
Sentiment Analysis and Social Media: How and Why
 
Social Media in Australia: The Case of Twitter
Social Media in Australia: The Case of TwitterSocial Media in Australia: The Case of Twitter
Social Media in Australia: The Case of Twitter
 

Semelhante a Processing Large Complex Data

From Research to Applications: What Can We Extract with Social Media Sensing?
From Research to Applications: What Can We Extract with Social Media Sensing?From Research to Applications: What Can We Extract with Social Media Sensing?
From Research to Applications: What Can We Extract with Social Media Sensing?Yiannis Kompatsiaris
 
Visual Information Analysis for Crisis and Natural Disasters Management and R...
Visual Information Analysis for Crisis and Natural Disasters Management and R...Visual Information Analysis for Crisis and Natural Disasters Management and R...
Visual Information Analysis for Crisis and Natural Disasters Management and R...Yiannis Kompatsiaris
 
Weather events identification in social media streams: tools to detect their ...
Weather events identification in social media streams: tools to detect their ...Weather events identification in social media streams: tools to detect their ...
Weather events identification in social media streams: tools to detect their ...Alfonso Crisci
 
Social Data and Multimedia Analytics for News and Events Applications
Social Data and Multimedia Analytics for News and Events ApplicationsSocial Data and Multimedia Analytics for News and Events Applications
Social Data and Multimedia Analytics for News and Events ApplicationsYiannis Kompatsiaris
 
Data Science For Social Good: Tackling the Challenge of Homelessness
Data Science For Social Good: Tackling the Challenge of HomelessnessData Science For Social Good: Tackling the Challenge of Homelessness
Data Science For Social Good: Tackling the Challenge of HomelessnessAnita Luthra
 
Future Internet - Webinar UNIFACS Laureate 2015 - With Access Link
Future Internet - Webinar UNIFACS Laureate 2015 - With Access LinkFuture Internet - Webinar UNIFACS Laureate 2015 - With Access Link
Future Internet - Webinar UNIFACS Laureate 2015 - With Access LinkJoberto Martins
 
Vision about Social Networks Content Exploitation (EC Concertation meeting)
Vision about Social Networks Content Exploitation (EC Concertation meeting)Vision about Social Networks Content Exploitation (EC Concertation meeting)
Vision about Social Networks Content Exploitation (EC Concertation meeting)Yiannis Kompatsiaris
 
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...Farida Vis
 
Fusing text and image for event
Fusing text and image for eventFusing text and image for event
Fusing text and image for eventijma
 
Big data privacy issues in public social media
Big data privacy issues in public social mediaBig data privacy issues in public social media
Big data privacy issues in public social mediaSupriya Radhakrishna
 
Using Data for Science Journalism
Using Data for Science JournalismUsing Data for Science Journalism
Using Data for Science JournalismLiliana Bounegru
 
Using Data for Science Journalism
Using Data for Science JournalismUsing Data for Science Journalism
Using Data for Science JournalismJonathan Gray
 
Geoparsing and Real-time Social Media Analytics - technical and social challe...
Geoparsing and Real-time Social Media Analytics - technical and social challe...Geoparsing and Real-time Social Media Analytics - technical and social challe...
Geoparsing and Real-time Social Media Analytics - technical and social challe...REVEAL - Social Media Verification
 
SoBigData. European Research Infrastructure for Big Data and Social Mining
SoBigData. European Research Infrastructure for Big Data and Social MiningSoBigData. European Research Infrastructure for Big Data and Social Mining
SoBigData. European Research Infrastructure for Big Data and Social MiningResearch Data Alliance
 
Information spreading in FriendFeed
Information spreading in FriendFeedInformation spreading in FriendFeed
Information spreading in FriendFeedLuca Rossi
 
Social Big Data in Government
Social Big Data in GovernmentSocial Big Data in Government
Social Big Data in GovernmentAdegboyega Ojo
 
Spark Social Media
Spark Social Media Spark Social Media
Spark Social Media suresh sood
 
Smart Data Module 1 introduction to big and smart data
Smart Data Module 1 introduction to big and smart dataSmart Data Module 1 introduction to big and smart data
Smart Data Module 1 introduction to big and smart datacaniceconsulting
 
Keystone summer school 2015 paolo-missier-provenance
Keystone summer school 2015 paolo-missier-provenanceKeystone summer school 2015 paolo-missier-provenance
Keystone summer school 2015 paolo-missier-provenancePaolo Missier
 

Semelhante a Processing Large Complex Data (20)

From Research to Applications: What Can We Extract with Social Media Sensing?
From Research to Applications: What Can We Extract with Social Media Sensing?From Research to Applications: What Can We Extract with Social Media Sensing?
From Research to Applications: What Can We Extract with Social Media Sensing?
 
Visual Information Analysis for Crisis and Natural Disasters Management and R...
Visual Information Analysis for Crisis and Natural Disasters Management and R...Visual Information Analysis for Crisis and Natural Disasters Management and R...
Visual Information Analysis for Crisis and Natural Disasters Management and R...
 
Weather events identification in social media streams: tools to detect their ...
Weather events identification in social media streams: tools to detect their ...Weather events identification in social media streams: tools to detect their ...
Weather events identification in social media streams: tools to detect their ...
 
Social Data and Multimedia Analytics for News and Events Applications
Social Data and Multimedia Analytics for News and Events ApplicationsSocial Data and Multimedia Analytics for News and Events Applications
Social Data and Multimedia Analytics for News and Events Applications
 
Data Science For Social Good: Tackling the Challenge of Homelessness
Data Science For Social Good: Tackling the Challenge of HomelessnessData Science For Social Good: Tackling the Challenge of Homelessness
Data Science For Social Good: Tackling the Challenge of Homelessness
 
Future Internet - Webinar UNIFACS Laureate 2015 - With Access Link
Future Internet - Webinar UNIFACS Laureate 2015 - With Access LinkFuture Internet - Webinar UNIFACS Laureate 2015 - With Access Link
Future Internet - Webinar UNIFACS Laureate 2015 - With Access Link
 
Vision about Social Networks Content Exploitation (EC Concertation meeting)
Vision about Social Networks Content Exploitation (EC Concertation meeting)Vision about Social Networks Content Exploitation (EC Concertation meeting)
Vision about Social Networks Content Exploitation (EC Concertation meeting)
 
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...
 
Fusing text and image for event
Fusing text and image for eventFusing text and image for event
Fusing text and image for event
 
Big data privacy issues in public social media
Big data privacy issues in public social mediaBig data privacy issues in public social media
Big data privacy issues in public social media
 
Using Data for Science Journalism
Using Data for Science JournalismUsing Data for Science Journalism
Using Data for Science Journalism
 
Using Data for Science Journalism
Using Data for Science JournalismUsing Data for Science Journalism
Using Data for Science Journalism
 
Geoparsing and Real-time Social Media Analytics - technical and social challe...
Geoparsing and Real-time Social Media Analytics - technical and social challe...Geoparsing and Real-time Social Media Analytics - technical and social challe...
Geoparsing and Real-time Social Media Analytics - technical and social challe...
 
SoBigData. European Research Infrastructure for Big Data and Social Mining
SoBigData. European Research Infrastructure for Big Data and Social MiningSoBigData. European Research Infrastructure for Big Data and Social Mining
SoBigData. European Research Infrastructure for Big Data and Social Mining
 
Information spreading in FriendFeed
Information spreading in FriendFeedInformation spreading in FriendFeed
Information spreading in FriendFeed
 
Future trends jan12 final
Future trends jan12 finalFuture trends jan12 final
Future trends jan12 final
 
Social Big Data in Government
Social Big Data in GovernmentSocial Big Data in Government
Social Big Data in Government
 
Spark Social Media
Spark Social Media Spark Social Media
Spark Social Media
 
Smart Data Module 1 introduction to big and smart data
Smart Data Module 1 introduction to big and smart dataSmart Data Module 1 introduction to big and smart data
Smart Data Module 1 introduction to big and smart data
 
Keystone summer school 2015 paolo-missier-provenance
Keystone summer school 2015 paolo-missier-provenanceKeystone summer school 2015 paolo-missier-provenance
Keystone summer school 2015 paolo-missier-provenance
 

Mais de Yiannis Kompatsiaris

AI4Media - European Leadership in Human-Centred Trustworthy AI session
AI4Media - European Leadership in Human-Centred Trustworthy AI sessionAI4Media - European Leadership in Human-Centred Trustworthy AI session
AI4Media - European Leadership in Human-Centred Trustworthy AI sessionYiannis Kompatsiaris
 
Sensor Based Ambient Assisted Living
Sensor Based Ambient Assisted LivingSensor Based Ambient Assisted Living
Sensor Based Ambient Assisted LivingYiannis Kompatsiaris
 
Social Media Analytics for Graph-Based Event Detection
Social Media Analytics for Graph-Based Event DetectionSocial Media Analytics for Graph-Based Event Detection
Social Media Analytics for Graph-Based Event DetectionYiannis Kompatsiaris
 
Social Media Verification Challenges, Approaches and Applications
Social Media Verification  Challenges, Approaches and ApplicationsSocial Media Verification  Challenges, Approaches and Applications
Social Media Verification Challenges, Approaches and ApplicationsYiannis Kompatsiaris
 
The DemaWare Service-Oriented AAL Platform for People with Dementia
The DemaWare Service-Oriented AAL Platform for People with DementiaThe DemaWare Service-Oriented AAL Platform for People with Dementia
The DemaWare Service-Oriented AAL Platform for People with DementiaYiannis Kompatsiaris
 
Social Media Crawling and Mining Seminar (Motivation Part)
Social Media Crawling and Mining Seminar (Motivation Part)Social Media Crawling and Mining Seminar (Motivation Part)
Social Media Crawling and Mining Seminar (Motivation Part)Yiannis Kompatsiaris
 
"Μια πόλη από το μέλλον": Πως ο πολίτης μπορεί να γίνει συμμέτοχος μέσω της χ...
"Μια πόλη από το μέλλον": Πως ο πολίτης μπορεί να γίνει συμμέτοχος μέσω της χ..."Μια πόλη από το μέλλον": Πως ο πολίτης μπορεί να γίνει συμμέτοχος μέσω της χ...
"Μια πόλη από το μέλλον": Πως ο πολίτης μπορεί να γίνει συμμέτοχος μέσω της χ...Yiannis Kompatsiaris
 
Τεχνικές Αναγνώρισης Προτύπων και Μηχανικής Μάθησης για Εφαρμογές Ανάλυσης Πο...
Τεχνικές Αναγνώρισης Προτύπων και Μηχανικής Μάθησης για Εφαρμογές Ανάλυσης Πο...Τεχνικές Αναγνώρισης Προτύπων και Μηχανικής Μάθησης για Εφαρμογές Ανάλυσης Πο...
Τεχνικές Αναγνώρισης Προτύπων και Μηχανικής Μάθησης για Εφαρμογές Ανάλυσης Πο...Yiannis Kompatsiaris
 
Άνοια στο σπίτι: Τεχνολογίες για παρακολούθηση από απόσταση και ανεξάρτητη δ...
 Άνοια στο σπίτι: Τεχνολογίες για παρακολούθηση από απόσταση και ανεξάρτητη δ... Άνοια στο σπίτι: Τεχνολογίες για παρακολούθηση από απόσταση και ανεξάρτητη δ...
Άνοια στο σπίτι: Τεχνολογίες για παρακολούθηση από απόσταση και ανεξάρτητη δ...Yiannis Kompatsiaris
 
SocialSensor Project: Sensing User Generated Input for Improved Media Discove...
SocialSensor Project: Sensing User Generated Input for Improved Media Discove...SocialSensor Project: Sensing User Generated Input for Improved Media Discove...
SocialSensor Project: Sensing User Generated Input for Improved Media Discove...Yiannis Kompatsiaris
 
Improve My City: App for Citizens Reporting Issues in Municipalities – Regions
Improve My City: App for Citizens Reporting Issues in Municipalities – RegionsImprove My City: App for Citizens Reporting Issues in Municipalities – Regions
Improve My City: App for Citizens Reporting Issues in Municipalities – RegionsYiannis Kompatsiaris
 
Socialsensor project overview and topic discovery in tweeter streams
Socialsensor project overview and topic discovery in tweeter streams Socialsensor project overview and topic discovery in tweeter streams
Socialsensor project overview and topic discovery in tweeter streams Yiannis Kompatsiaris
 
Introduction for the Summer School on Social Media Modeling and Search 2012
Introduction for the Summer School on Social Media Modeling and Search 2012Introduction for the Summer School on Social Media Modeling and Search 2012
Introduction for the Summer School on Social Media Modeling and Search 2012Yiannis Kompatsiaris
 
Social media mining and multimedia analysis research and applications
Social media mining and multimedia analysis research and applicationsSocial media mining and multimedia analysis research and applications
Social media mining and multimedia analysis research and applicationsYiannis Kompatsiaris
 

Mais de Yiannis Kompatsiaris (15)

AI4Media - European Leadership in Human-Centred Trustworthy AI session
AI4Media - European Leadership in Human-Centred Trustworthy AI sessionAI4Media - European Leadership in Human-Centred Trustworthy AI session
AI4Media - European Leadership in Human-Centred Trustworthy AI session
 
Sensor Based Ambient Assisted Living
Sensor Based Ambient Assisted LivingSensor Based Ambient Assisted Living
Sensor Based Ambient Assisted Living
 
Social Media Analytics for Graph-Based Event Detection
Social Media Analytics for Graph-Based Event DetectionSocial Media Analytics for Graph-Based Event Detection
Social Media Analytics for Graph-Based Event Detection
 
Social Media Verification Challenges, Approaches and Applications
Social Media Verification  Challenges, Approaches and ApplicationsSocial Media Verification  Challenges, Approaches and Applications
Social Media Verification Challenges, Approaches and Applications
 
The DemaWare Service-Oriented AAL Platform for People with Dementia
The DemaWare Service-Oriented AAL Platform for People with DementiaThe DemaWare Service-Oriented AAL Platform for People with Dementia
The DemaWare Service-Oriented AAL Platform for People with Dementia
 
Dem@care Project Short Overview
Dem@care Project Short OverviewDem@care Project Short Overview
Dem@care Project Short Overview
 
Social Media Crawling and Mining Seminar (Motivation Part)
Social Media Crawling and Mining Seminar (Motivation Part)Social Media Crawling and Mining Seminar (Motivation Part)
Social Media Crawling and Mining Seminar (Motivation Part)
 
"Μια πόλη από το μέλλον": Πως ο πολίτης μπορεί να γίνει συμμέτοχος μέσω της χ...
"Μια πόλη από το μέλλον": Πως ο πολίτης μπορεί να γίνει συμμέτοχος μέσω της χ..."Μια πόλη από το μέλλον": Πως ο πολίτης μπορεί να γίνει συμμέτοχος μέσω της χ...
"Μια πόλη από το μέλλον": Πως ο πολίτης μπορεί να γίνει συμμέτοχος μέσω της χ...
 
Τεχνικές Αναγνώρισης Προτύπων και Μηχανικής Μάθησης για Εφαρμογές Ανάλυσης Πο...
Τεχνικές Αναγνώρισης Προτύπων και Μηχανικής Μάθησης για Εφαρμογές Ανάλυσης Πο...Τεχνικές Αναγνώρισης Προτύπων και Μηχανικής Μάθησης για Εφαρμογές Ανάλυσης Πο...
Τεχνικές Αναγνώρισης Προτύπων και Μηχανικής Μάθησης για Εφαρμογές Ανάλυσης Πο...
 
Άνοια στο σπίτι: Τεχνολογίες για παρακολούθηση από απόσταση και ανεξάρτητη δ...
 Άνοια στο σπίτι: Τεχνολογίες για παρακολούθηση από απόσταση και ανεξάρτητη δ... Άνοια στο σπίτι: Τεχνολογίες για παρακολούθηση από απόσταση και ανεξάρτητη δ...
Άνοια στο σπίτι: Τεχνολογίες για παρακολούθηση από απόσταση και ανεξάρτητη δ...
 
SocialSensor Project: Sensing User Generated Input for Improved Media Discove...
SocialSensor Project: Sensing User Generated Input for Improved Media Discove...SocialSensor Project: Sensing User Generated Input for Improved Media Discove...
SocialSensor Project: Sensing User Generated Input for Improved Media Discove...
 
Improve My City: App for Citizens Reporting Issues in Municipalities – Regions
Improve My City: App for Citizens Reporting Issues in Municipalities – RegionsImprove My City: App for Citizens Reporting Issues in Municipalities – Regions
Improve My City: App for Citizens Reporting Issues in Municipalities – Regions
 
Socialsensor project overview and topic discovery in tweeter streams
Socialsensor project overview and topic discovery in tweeter streams Socialsensor project overview and topic discovery in tweeter streams
Socialsensor project overview and topic discovery in tweeter streams
 
Introduction for the Summer School on Social Media Modeling and Search 2012
Introduction for the Summer School on Social Media Modeling and Search 2012Introduction for the Summer School on Social Media Modeling and Search 2012
Introduction for the Summer School on Social Media Modeling and Search 2012
 
Social media mining and multimedia analysis research and applications
Social media mining and multimedia analysis research and applicationsSocial media mining and multimedia analysis research and applications
Social media mining and multimedia analysis research and applications
 

Último

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 

Último (20)

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 

Processing Large Complex Data

  • 1. Processing  Large  Complex  Data   Social  Data  and  Mul8media  Analy8cs  for  News  and  Events   Applica8ons   Dr.  Yiannis  Kompatsiaris,  ikom@i2.gr   Mul$media,  Knowledge  and  Social  Media  Analy$cs  Lab,  Head   CERTH-­‐ITI   2015  IEEE  SPS  Italy  Chapter  Summer  School  on  Signal   Processing  (S3P)  
  • 2. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #2   Overview   •  Introduc8on   –  Mo8va8on  –  Challenges   •  Example  Use  Cases   •  Research  Approaches   –  Large-­‐Scale  visual  search   –  Graphs  -­‐  Community  Detec8on  -­‐  Clustering   –  Social  Event  Detec8on   –  Verifica8on   •  Demos  –  Applica8ons   –  MM  News  Demo   –  ClusJour   –  Thessfest   •  Evalua8on  -­‐  Benchmarking   •  Conclusions  
  • 3. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #3   Introduc2on   Mo2va2on   Example  Applica2ons   Conceptual  Architecture   Challenges  
  • 4. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #4   Pope  Francis   Pope  Benedict   2007:  iPhone  release   2008:  Android  release   2010:  iPad  release   http://petapixel.com/2013/03/14/a-starry-sea-of-cameras-at-the-unveiling-of-pope-francis/
  • 5. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   hJp://www.puzzlemarketer.com/digital-­‐social-­‐brands-­‐in-­‐60-­‐seconds/    (Apr,  2012)  
  • 6. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  6   rise  of  the  networks  
  • 7. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Social  Networks  as  Graphs   10# social#web#as#a#graph# nodes&=&twi+er&users& edges&=&retweets&on&#jan25&hashtag& announcement&of&Mubarak’s&resigna<on& h1p://gephi.org/2011/the7egyp9an7revolu9on7on7twi1er/#
  • 8. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #8   Social  Networks  as  Graphs   “Social  networks  have  emergent   proper$es.  Emergent  proper$es   are  new  aFributes  of  a  whole  that   arise  from  the  interac$on  and   interconnec$on  of  the  parts”   •  Emo8ons,  Health,  Sexual   rela8onships  do  not  depend   just  on  our  connec8ons  (e.g.   number  of  them)  but  on  our   posi8on  -­‐  structure  in  the  social   graph   –  Central  –  Hub   –  Outlier   –  Transi8vity  (connec8ons  between   friends)  
  • 9. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Social  Networks  as  Real-­‐Life  Sensors   •  Social  Networks  is  a  data  source  with  an   extremely  dynamic  nature  that  reflects   events  and  the  evolu8on  of  community   focus  (user’s  interests)   •  Huge  smartphones  and  mobile  devices   penetra2on  provides  real-­‐8me  and   loca8on-­‐based  user  feedback   •  Transform  individually  rare  but   collec2vely  frequent  media  to  meaningful   topics,  events,  points  of  interest,   emo8onal  states  and  social  connec8ons   •  Present  in  an  efficient  way  for  a  variety  of   applica8ons  (news,  marke8ng,  science,   health,  entertainment)  
  • 10. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Caption Time User Profile Favs Comms Tags Social  Media  aspects    
  • 11. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Examples  -­‐  Science   Xin  Jin,  Andrew  Gallagher,  Liangliang  Cao,  Jiebo  Luo,  and   Jiawei  Han.  The  wisdom  of  social  mulHmedia:   using  flickr  for  predicHon  and  forecast,   Interna8onal  conference  on  Mul8media  (MM  '10).  ACM.   11   “…if  you're  more  than  100  km  away  from  the  epicenter   [of  an  earthquake]  you  can  read  about  the  quake  on   twiJer  before  it  hits  you…”   Many  twiJer  examples  at:  What  can  TwiJer  tell  us  about  the  real  world?  TwiJer  and  the  Real   World  CIKM'13  Tutorial,  hJps://sites.google.com/site/twiJerandtherealworld/home    
  • 12. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Examples  -­‐  Science   12  
  • 13. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Examples  -­‐  Science   13   Be  careful  of  correla8on  diagrams  
  • 14. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Example  –  News  (Boston  bombing)   #14   “Following  the  Boston  Marathon  bombings,  one  quarter  of   Americans  reportedly  looked  to  Facebook,  TwiJer  and   other  social  networking  sites  for  informa8on,  according  to   The  Pew  Research  Center.  When  the  Boston  Police   Department  posted  its  final  “CAPTURED!!!”  tweet  of  the   manhunt,  more  than  140,000  people  retweeted  it.”     “Authori8es  have  recognized  that  one  the  first   places  people  go  in  events  like  this  is  to  social   media,  to  see  what  the  crowd  is  saying  about  what   to  do  next”   "I  have  been  following  my  friend's   Facebook  [account]  who  is  near  the  scene   and  she  is  upda2ng  everyone  before  it   even  gets  to  the  news”  
  • 15. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Example  –  Crisis  –  Humanitarian  (Syria)   #15   Syria  Tracker  offers  a  crisis  mapping  system  that  uses  crowdsourced  text,  photo   and  video  reports  and  data  mining  techniques  forming  a  live  map  of  the  Syrian   conflict  since  March  2011   …stream  of   content-­‐filtered   media  from   news,  social   media  (TwiJer   and  Facebook)   and  official   sources  
  • 16. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Events  -­‐  Fes2vals   #16   http://www.eventmanagerblog.com/uploads/2012/12/event-technology-infographic.jpg
  • 17. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Many  other  examples:  smellymaps   #17   Smell  related  words  in  geo-­‐located  social  media   hJp://researchswinger.org/smellymaps/  
  • 18. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   API  Wrapper   Website  Wrapper   Scheduler   CRAWLING   Visual  Indexing   Near-­‐duplicates   Text  Indexing   INDEXING   Media  Fetcher   SNA   Sen2ment  -­‐  Influence   Trends  -­‐  Topics   MINING   Model  Building   Concepts   Relevance   Diversity   Popularity   RANKING   Veracity   Crawling  Specs   Sources   Interac2on   Responsiveness     Aggrega2on   VISUALIZATION   Aesthe2cs   Conceptual  Architecture  
  • 19. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Challenges  –  Content  (Mining)   •  Mul2-­‐modality:  e.g.  image  +  tags,  video,  audio   •  Rich  social  context:  spa8o-­‐temporal,  social  connec8ons,   rela8ons  and  social  graph   •  Specific  messages:  short,  conversa8ons,  errors,  no  context   •  Inconsistent  quality:  noise,  spam,  fake,  propaganda   •  Huge  volume:  Massively  produced  and  disseminated   •  Mul2-­‐source:  may  be  generated  by  different  applica8ons   and  user  communi8es   •  Dynamic:  Fast  updates,  real-­‐8me  
  • 20. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Policy  –  Licensing  –  Legal  challenges   •   Fragmented  access  to  data   –  Separate  wrappers/APIs  for  each  source  (TwiJer,  Facebook,  etc.)   –  Different  data  collec8on/crawling  policies   •   Limita8ons  imposed  by  API  providers  (“Walled  Gardens”)   •  Full  access  to  data  impossible  or  extremely  expensive  (e.g.  see  data    licensing  plans  for  GNIP  and  DataSit   •  Non-­‐transparent  data  access  prac8ces  (e.g.  access  is  provided  to  an    organiza8on/person  if  they  have  a  contact  in  TwiJer)     •   Constant  change  of  model  and  ToS  of  social  APIs   –  No  backwards  compa8bility,  addi8onal  development  costs   •   Ephemeral  nature  of  content   •  Social  search  results  oten  lead  to  removed  content  à  inconsistent    and  unreliable  referencing   •   User  Privacy  &  Purpose  of  use   •  Fuzzy  regulatory  framework  regarding  mining  user-­‐contributed  data
  • 21. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #21   Example  Use  Cases   Events  and  News  
  • 22. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   SocialSensor  Project  Objec2ve   SocialSensor  quickly  surfaces  trusted  and  relevant  material     from  social  media  –  with  context.   DySCO   behaviour   loca8on   8me  content   usage   social  context   Massive  social  media   and  unstructured  web   Social  media  mining   Aggrega8on  &  indexing   News  -­‐  Infotainment   Personalised  access    Ad-­‐hoc  P2P  networks  
  • 23. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #23   “It has changed the way we do news”(MSN) “Social media is the key place for emerging stories – internationally, nationally, locally” (BBC) “Social media is transforming the way we do journalism” (New York Times) Source: picture alliance / dpa
  • 24. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #24                                                                  Source:  GeJy  Images   “It’s really hard to find the nuggets of useful stuff in an ocean of content” (BBC) “Things that aren’t relevant crowd out the content you are looking for” (MSN) “The filters aren’t configurable enough” (CNN)
  • 25. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Verifica2on  was  simpler  in  the  past...   Source: Frank Grätz #25  
  • 26. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #26   News  Use  Case  Requirements   Quickly  surface  trusted  and  relevant  material  from   social  media  –  with  context.   •  “quickly”:  in  real  8me   •  “surfaces”:  automa8cally  discovers,  clusters  and  searches     •  “trusted”:  automa8c  support  in  verifica8on  process   •  “relevant”:  to  the  specific  event   •  “material”:  any  material  (text,  image,  audio,  video  =   mul8media),  aggregated  with  other  sources  (e.g.  web)   •  “social  media”:  across  all  relevant  social  media  plaworms   •  “with  context”:  loca8on,  8me,  sen8ment,  influence  
  • 27. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #27   Infotainment   •  Events  with  large  numbers   of  visitors   •  Thessaloniki  Interna8onal   Film  Fes8val     –  80,000  viewers  /  100,000   visitors  in  10  days   –  150  films,  350  screenings   •  Discovery  and  presenta8on   of  relevant  aggregated   social  media   –  Trending  Topics   –  Sen8ment   –  Tweet  –  film  matching   –  Visualiza8on  (Social  Walls)  
  • 28. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #28   Conceptual  Architecture  and  Main  components   SEMANTIC  MIDDLEWARE   Public   Data   SEARCH  &  RECOMMENDATION   USER  MODELLING  &  PRESENTATION   INDEXING  MINING   STORAGE   DATA  COLLECTION  /  CRAWLING   •  Real  8me  dynamic  topic   and  event  clustering   •  Trend,  popularity   and  sen8ment  analysis   •  Calculate  trust/influence   scores  around  people   •  Personalized  search,   access  &  presenta8on   based  on  social  network   interac8ons   •  Seman8c  enrichment   and  discovery  of  services  
  • 29. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #29   Research  Approaches     Large-­‐Scale  Visual  Search   Graphs  –  Clustering/Community  Detec2on   Visual  Event  Summariza2on   Social  Media  Verifica2on  
  • 30. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #30   Scalable  visual  feature  aggrega2on  &   indexing   •  Problem:  Example-­‐based  image  search   –  Find  images  that  represent  same  or  similar  object  or  scene   with  a  given  query  image   –  Viewed  from  different  viewpoints,    occlusions,    cluJer   •  Challenge:  Large-­‐scale   –  Searching  databases  with  tens  of  millions  of  images   –  Objec8ves  to  be  full-­‐filed:   •  Sufficient  discrimina8ve  power   •  Fast  response  8mes   •  Efficient  memory  usage  
  • 31. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #31   Large-­‐scale  visual  search   image  collec8on   from  social  media/   Web   image  local  feature   extrac8on   feature  aggrega8on   feature  indexing  kNN  visual   similarity  search   concept-­‐based   image  annota8on   image  clustering   image  (geo)tagging   concept-­‐based   search/filtering   duplicate  detec2on  
  • 32. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #32   Framework   •  Implementa8on  and  evalua8on  of  the  effec8veness   of  VLAD  in  combina8on  with  SURF   •  Scalable  image  indexing   E.  Spyromitros-­‐Xioufis,  S.  Papadopoulos,  Y.  Kompatsiaris,  G.   Tsoumakas,  I.  Vlahavas,  "A  Comprehensive  Study  over  VLAD  and   Product  Quan8za8on  in  Large-­‐scale  Image  Retrieval",  IEEE   Transac8ons  on  Mul8media  16(6),  pp.  1713-­‐1728,  October  2014.   image   local   descriptor   extrac8on   descriptor   aggrega8on   dimensionality   reduc8on  set  of  local   descriptors   fixed  size   vector   encoding  &   indexing   low  dimensional     vector   SIFT  /  SURF   BOW  /  VLAD   PCA   PQ  +  ADC/IVFADC  
  • 33. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #33   Scalable  indexing  of  features   •  ADC  16x8  requires  16  bytes  per  image   –  ~67M  images  per  GB   •  IVFADC  requires  4  addi8onal  bytes  per  image   –  ~53.6M  images  per  GB   •  In  current  implementa8on  we  achieve  only  half  of  above  numbers  due  to   using  short  int[]  instead  of  byte[],  but  possible  to  improve.   •  Ideally,  1  billion  images  could  be  indexed  on  a  server  with   20GB  of  RAM  (projec2on).   •  Query  8me  (for  1M  vectors):   –  Exhaus8ve  search  of  VLAD  vectors  (d’=128):    0.50  sec   –  Product  Quan8za8on  with  ADC  16x8:    0.10  sec  (x5  faster)   –  Product  Quan8za8on  with  IVFADC  16x8:    0.02  sec  (x25  faster)  
  • 34. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #34   VLAD+SIFT  vs.  VLAD+SURF         Accuracy  vs.  dimensionality   •  VLAD+SURF  improves  VLAD+SIFT  and  FV+SIFT  across  all  dimensions  in   both  Holidays  and  Oxford  datasets   Results  in  rows  star8ng  with  *  are  taken  from  Jégou  et  al.,  2011,    hence  the  missing  values  for  some  entries.   SIFT  corresponds    to  PCA  reduced  SIFT  which  yielded  beJer  results  than  standard  SIFT  in  Jegou  et  al.,  2011  
  • 35. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #35   Clustering  –  Community  Detec2on    
  • 36. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   graph   G  =  (V,  E)   nodes   edges   An  abstract  data  type  represen8ng  rela8onships  or  connec8ons  
  • 37. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Some  Examples   Webpage  www.x.com   href=“www.y.com”   href  =  “www.z.com”   Webpage  www.y.com   href=“www.x.com”   href  =  “www.a.com”   href  =  “www.b.com”   Webpage  www.z.com   href=“www.a.com”   y   a   x   z   b  
  • 38. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Biology  example   Nodes  –  Proteins     Edges  –  Interac8ons     Visualiza8on  plays  an  important  role  
  • 39. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   blogosphere  as  a  graph   nodes  =  blogs   edges  =  hyperlinks   technical  -­‐  gadgets   society  -­‐  poli2cs   hJp://datamining.typepad.com/gallery/blog-­‐map-­‐gallery.html  
  • 40. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   social  web  as  a  graph   nodes  =  twirer  users   edges  =  retweets  on  #jan25  hashtag   announcement  of  Mubarak’s  resigna2on   hJp://gephi.org/2011/the-­‐egyp8an-­‐revolu8on-­‐on-­‐twiJer/  
  • 41. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   •  graphs  on  the  web  present  certain  structural   characteris8cs   •  groups  of  nodes  interac8ng  with  each  other  à    dense  inter-­‐connec2ons  à              func8onal/topical  associa8ons   •  what  can  we  gain  by  studying  them?   –  topic  analysis   –  photo  clustering   –  improved  recommenda8on  methods   –  detect  influencers   emerging  structures  
  • 42. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Community  and  graphs                                  Communi8es  correspond  to  groups  of  nodes  on  a  graph  that   share  common  proper8es  or  have  a  common  role  in  the   organiza8on/opera8on  of  the  system.   S.  Fortunato,  C.  Castellano.  Community  structure  in  graphs.  arXiv:0712.2716v1,  Dec  2007.  
  • 43. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   •  Pairs  of  nodes  are  more  likely  to  be  connected  if  they  are   both  members  of  the  same  community,  and  less  likely  to   be  connected  if  they  do  not  share  communi8es.   •  explicit   –  the  result  of  conscious  human  decision     •  implicit   –  emerging  from  the  interac8ons  &  ac8vi8es  of  users     –  need  special  methods  to  be  discovered   –  Community  detec8on,  par88on,  clustering   Community  types  
  • 44. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   •  Oten  communi8es  are  defined  with  respect  to  a   graph,    G  =  (V,E)  represen8ng  a  set  of  objects  (V)  and   their  rela8ons  (E).   •  Even  if  such  graph  is  not  explicit  in  the  raw  data,  it  is   usually  possible  to  construct,  e.g.  feature  vectors  à   distances  à  thresholding  à  graph   •  Given  a  graph,  a  community  is  defined  as  a  set  of   nodes  that  are  more  densely  connected  to  each   other  than  to  the  rest  of  the  network  nodes.   communi2es  and  graphs  
  • 45. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   communi2es  and  graphs  -­‐  example   inter-­‐community  edge   intra-­‐community  edge  
  • 46. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   community  arributes   overlap   weighted  par8cipa8on   roles   hierarchy   evolu8on  
  • 47. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   •  Given  nodes  u  and  v  of  graph  G  =  (V,E)  a  cut  is  a  set   of  edges  C  ⊂  E,  such  that  the  two  nodes  are   unconnected  on  the  graph  G΄=  (V,E-­‐C).   •  Using  s  to  denote  a  “source”  node  and  t  to  denote  a   “terminal”  node,  a  cut  (S,T)  of  G  =  (V,E)  is  a  par88on   of  V  in  sets  S  and  Τ  =  V-­‐S,  such  that  s  ∈  S  and  t∈T.   graph  cuts   s t T S
  • 48. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   •  A  graph  can  be  split  into  communi8es  in  numerous  ways,  i.e.   for  each  graph  there  are  many  possible  community   structures.  In  the  simple  case,  a  community  structure  is   defined  as  a  graph  par88on  into  a  set  of  node  sets            C  =  {Ci}   •  To  provide  a  measure  of  the  quality  of  a  community  structure,   we  make  use  of  modularity.   •  The  modularity  maximiza8on  method  detects  communi8es  by   searching  over  possible  divisions  of  a  network  for  one  or  more   that  have  par8cularly  high  modularity.     •  Modularity  quan8fies  the  extent  to  which  a  given  graph   par88on  into  communi8es  presents  a  systema8c  tendency  to   have  more  intra-­‐community  links  than  the  same  community   structure  would  present  if  the  links  would  be  rewired  under   ER  (Erdos-­‐Renyi)  graph  model.   Modularity  maximiza2on  
  • 49. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   graph  degress   deg(vi)  =  ki  =  number  of  neighbors   In  directed  graphs,  we  differen8ate  between  in-­‐  and  out-­‐degree.   Αij  =  link  between  nodes  i  and  j   0  à  no  link   1  à  link   α  à  link  with  weight  equal  to  α   node  degree   adjacency  matrix  
  • 50. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Degrees  &  Adjancency   v1   v2   v3   v4  v5   Adjacency  matrix  on  an  undirected  graph    :  A(i,j),    i,j  <=  n     degree  of  a  vertex  v     (number  of  edges  incident  upon  it):   ∑= w v wvAk ),(
  • 51. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   •  Modularity  is  computed  as  follows:       –  Αij:  adjacency  matrix   –  ki:  degree  of  node  i   –  ci:  community  of  node  i   –  δ(ci,cj)  =  1  if  i,  j  belong  to  the  same  community   –  m:  number  of  edges  on  the  graph   modularity  computa2on   ∑ −= ji ji ji ij cc m kk A m Q , ),() 2 ( 2 1 δ Expected number of edges between i and j, if edges are placed randomly. Observed number of intra-community edges.
  • 52. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   •  In  a  random  graph  (ER  model),  we  expect  that  any   possible  par88on  would  lead  to  Q  =  0.   •  Typically,  in  non-­‐random  graphs  modularity  takes   values  between  0.3  and  0.7.     modularity  -­‐  example   Q = 0.60 clear community structure Q = 0.37 fuzzy communities
  • 53. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   •  Exhaus8ve  search  over  all  possible  divisions  is  usually   intractable   •  Algorithms  based  on  approximate  op8miza8on   –  greedy  algorithms   –  simulated  annealing   –  spectral  op8miza8on   –  local-­‐based  op8miza8on   •  Balances  between  speed  and  accuracy   Modularity  maximiza2on  approaches  
  • 54. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   •  other  community-­‐ness  measures:   –  conductance   –  density   •  defini8ons  to  sa8sfy   –  each  member  should  be  connected  to  more  nodes  within   the  community  than  to  nodes  outside  it   –  each  member  should  be  connected  to  all  other  members   (k-­‐clique)   •  result  of  a  process   –  if  I  start  removing  edges  with  a  certain  order,  the  graph   will  break  into  pieces  à  communi8es   other  means  to  define  communi2es  
  • 55. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   •  Given  a  graph  G=(V,E),  find  a  par88on  of  V  in  k  disjoint   subsets,  such  that  the  number  of  edges  in  Ε  of  which  the   endpoints  belong  to  different  subsets  is  minimized.   •  Various  solu8ons:  Kernighan-­‐Lin  algorithm  [Kernighan70],   spectral  bisec8on  [Pothen90].   •  Mul8-­‐level  par88on  (me8s)  [Karypis99]:  Repeated  applica8on   of  bisec8on  un8l  the  graph  is  par88oned  into  k  parts  under   constraint  to  the  sizes  of  the  subsets.   •  Not  sa8sfactory  solu8on,  since  the  number  of  communi8es   needs  to  be  provided  as  input  to  the  algorithm.  Some8mes   event  the  community  sizes  need  to  be  provided  as  inputs.   graph  par22on   B.  W.  Kernighan,  S.  Lin.  An  Efficient  Heuris8c  Procedure  for  Par88oning  of  Electrical  Circuits.  Bell   Systems  Technical  Journal,  Vol.  49,  No.  2,  pp.  291-­‐  307,  February  1970.     A.  Pothen,  H.D.  Simon  and  K.-­‐P.  Liou.  Par88oning  sparse  matrices  with  eigenvectors  of  graphs.   SIAM  journal  of  Matrix  Analysis  and  Applica8ons,  11:  430-­‐452,  1990.      G.  Karypis  and  V.  Kumar,  A  fast  and  high  quality  mul8level  scheme  for  par88oning    irregular  graphs,  SIAM  J.  Sci.  Comput.  20  (1):  359–392,  1999.  
  • 56. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   taxonomy   S.  Papadopoulos,  Y.  Kompatsiaris,  A.  Vakali,  P.  Spyridonos.  “Community  detec8on  in  Social  Media”.  In   Data  Mining  and  Knowledge  Discovery,  Springer,  2011  
  • 57. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   •  k-­‐clique   •  N-­‐clique   •  k-­‐core   subgraph  discovery  (structure)                                      1   k=3  (triangle)   k=4   k=5   N=2  (star)   0-­‐core   1-­‐core   2-­‐core   4-­‐core   3-­‐core  
  • 58. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   •  (μ,ε)-­‐core:     –  based  on  the  concept  of  structural  similarity   subgraph  discovery                                                          2   (μ,ε)-­‐core   μ  =  5,  ε  =  0.72   (μ,ε)-­‐core   μ  =  6,  ε  =  0.675   hub   outlier   Percentage  of   common  neighbors   for  each  edge  
  • 59. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   •  Betweenness  centrality   –  Being  in  many  shortest  paths     •  Closeness     –  Being  close  to  many  nodes     •  Eigenvector  centrality   –  End  of  many  paths     •  Degree  centrality   –  High  degree       hJps://commons.wikimedia.org/wiki/File:6_centrality_measures.png#/ media/File:6_centrality_measures.png   Carlos  Cas8llo,  Social  Media  Mining  and  Retrieval,   hJp://www.slideshare.net/ChaToX/social-­‐media-­‐mining-­‐and-­‐retrieval     centrality  
  • 60. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   •  Find  edges  that  stand  between  communi8es.   •  Progressively  remove  more  “central”  edges  un8l  the   graph  breaks  into  separate                         communi8es.   •  As  the  graph  spli†ng               progresses,  new  communi8es                                          emerge  that   are  assigned  to  a  hierarchical                     structure.   •  Edge  centrality  is  defined                         similarly  to  node  centrality:   60   divisive  -­‐  use  of  edge  centrality   Depic8on  of  node  centrality:      red  (min)  à  blue  (max)   ∑ ∈ ≠≠= Vts vts ts ts v vbc , , , )( )( σ σ )(, vtsσ ts,σ :  number  of  paths  from  node  s  to  t     that  include  node  v   :  total  number  of  paths  from  s  to  t   Betweenness centrality quantifies the number of times a node acts as a bridge along the shortest path between two other nodes.
  • 61. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   •  GN  algorithm  is  one  of  the  most  important  algorithms   s8mula8ng  a  whole  wave  of  community  detec8on  methods.   •  Basic  principle:   –  Compute  betweenness  centrality  for  each  edge.   –  Remove  edge  with  highest  score.   –  Re-­‐compute  all  scores.   –  Repeat  2nd  step.   •  Complexity:  Ο(n3)   •  Many  varia8ons  have  been  presented  to                     improve  precision  by  use  of  different  betweenness  measures   or  reduce  complexity,  e.g.  by  sampling  or  local  computa8ons.   Girvan  -­‐  Newman  algorithm   Girvan,  M.,  Newman,  M.E.J.  “Community  structure  in  social  and  biological  networks”.  In   Proceedings  of  Na8onal  Academy  of  Science,  U.  S.  A.  99(12),  7821–7826,  2002  
  • 62. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Girvan  -­‐  Newman  (example)   Social  network  in  Zachary     karate  club   Hierarchical  community  structure   detected  by  the  algorithm.  
  • 63. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Visual  Event  Summariza2on  on  Social  Media  using   Topic  Modelling  and  Graph-­‐based  Ranking  Algorithms  
  • 64. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Large-­‐scale  real  world  events  (1)   •  Long-­‐running  events  →  Consist  of  several  sub-­‐events   e.g.  10  days  of  Sundance  Film  Fes8val  include  opening   and  awards  ceremonies,  screenings  etc.   •  A  lot  of  involved  persons  that  use  social  media  →  huge   amount  of  event-­‐related  micro-­‐blogging  messages     •  A  growing  number  of  these  messages  carry   mul2media  content     •  The  existence  of  an  image  in  a  micro-­‐post  can  convey  a   much  beJer  impression  for  the  specific  moment  of  the   ongoing  event  
  • 65. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Large-­‐scale  real  world  events  (2)              #nbafinals  →  2.6M  tweets  in  one  month   #BaltimoreRiots 29 April-2 May 2015 à1.3M tweets in 5 days E3 conference 2015 16-18 June >5M tweets before conference 2M tweets during conference new game releases à multimedia content
  • 66. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Large-­‐scale  real  world  events  (3)   But…   •  the  huge  number  of  messages,  makes  it  very   challenging  for  interested  users  to  monitor  the   evolu8on  of  the  event   •  many  messages  can  be  considered  as  spam  or  non-­‐ informa2ve   •  In  case  of  mul8media:  internet  memes,   screenshots,  images  of  low  quality…   •  Redundancy  due  to  near  duplicate  messages  and   images  
  • 67. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Large-­‐scale  real  world  events  (4)   #nbafinals     Irrelevant Duplicates with no explicit association Non-informative
  • 68. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Event  related  collec$on  is  available       Visual  Event  Summariza2on   Visual  Event  Summariza2on  is  the  problem  of  selec8ng   a  concise  set  of  images  that  are  highly  relevant  to  the   event  and  contain  visually,  the  key  aspects  of  the   event.   Event-­‐based   Visual   Summarizer   List  of  all  event  images   Set  of  Selected     Representa2ve   and  Diverse   Images  
  • 69. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Exis2ng  Approaches:  Text-­‐based   Radev  et  al.  (2004)   •  summary  consists  of  messages  that  are  closest  to  their  N·∙idf  centroid   Erkan  et  al.  (2004),  LexRank  &  Mihalcea  et  al.  (2004),  TextRank     •  finding  salient  sentences  by  using  the  centrality  of  each  sentence  in  a  similarity   graph     •  adapted  for  mul8-­‐document  summariza8on  using  each  message  as  a  sentence.   •  outperforms  naïve  centroid-­‐based  approach.   Shen  at  al.  (2013)   •  mixture  model  to  detect  sub-­‐events  at  par8cipant  level   •  N·∙idf  centroid  to  find  a  summary  of  each  sub-­‐event     Chakrabar2  and  Punera  (2011)   •  Hidden  Markov  Model  to  obtain  a  8me-­‐based  segmenta8on  of  tweets   •  N·∙idf  centroid  to  find  a  summary  of  each  8me  segment  
  • 70. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Exis2ng  Approaches:  Mul2media   Bian  et  al.  (2013)   •  mul8modal  extension  of  LDA     •  textual  and  visual  features     Lin  et  al.  (2012)   •  mul8-­‐graph  of  objects  capturing  visual,  textual  and  temporal   proximity   •  8me-­‐ordered  sequence  of  important  objects  via  graph   op8miza8on   McParlane  et  al.  (2014)  –  state-­‐of-­‐the-­‐art  baseline   •  visual  features  +  SVM  to  discard  irrelevant  images   •  clustering  in  subtopics  and  selec8on  of  popular  images  for   each  subtopic  based  on  popularity  and  specificity  
  • 71. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   MGraph:  Framework  Overview   1.  create  message  mul8-­‐graph  using  textual,  visual  and  temporal  proximity   2.  find  underlying  topics  using  SCAN  algorithm   3.  calculate  prior  scores  of  images  based  on  topics  and  popularity  (relevance)   4.  diversify  using  DivRank  
  • 72. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Pre-­‐processing  /  Filtering   Text-­‐based  filtering   •  heuris8c  rules  for  spam  filtering  →  discard  very  short  messages  &   messages  with  many  men8ons,  URLs  or  hashtags.   •  filtering  of  unstructured  messages  using  POS  tagging    Accept    →  (determiner?  adjec$ve*  noun+  verb)+   Visual-­‐based  filtering   •  discard  small  images   •  detect  and  discard  memes,  screenshots  and  images  containing   heavy  text  
  • 73. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Pre-­‐processing  /  Filtering   Text-­‐based  filtering   Visual-based filtering Tweet length POS tagging filtering
  • 74. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Mul2-­‐graph  Genera2on  (1)   Given  a  set  of  (original)  messages  M={m1,  m2,  ...,  mn}  we  construct  a   mul8-­‐graph  GM  =  {V,  Etextual,  Evisual,  Esocial,  E2me}     •  vertex  vi  ∈  V  corresponds  to  message  mi     •  Etextual  →  undirected  edges  expressing  the  textual  similarity  (cosine   similarity)  between  nodes  (Z·∙idf  vector  vm)   •  Evisual  →  undirected  edges  that  represent  the  visual  similarity  (L2   distance)  between  nodes  with  images  (VLAD+SURF  vectors)     Thresholding:  add  an  edge  in  Etextual  or  Evisual,  only  if  the  textual  or  visual  similarity   between  the  corresponding  nodes  is  higher  than  thtextual  or  thvisual  respec8vely      
  • 75. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Mul2-­‐graph  Genera2on  (2)      
  • 76. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Example  mul2-­‐modal  sub-­‐graph   #  
  • 77. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Visual  deduplica2on   •  Visual  duplicates  for  which  there  is  no  explicit  connec8on  →   apply  Clique  Percola8on  Method  (CPM)  on  sub-­‐graph  Gvisual  =   {V,  Evisual}     •  Represent  detected  cliques  as  single  messages:   –  VLAD  aggrega8on  on  SURF  descriptors  of  all  images  in  the  clique     –  mean  value  of  publica8on  8me   –  aggregated  value  of  reposts  of  each  message.     –  merged  w·∙idf  vector   •  Replace  clustered  messages  in  GM  with                                                                 cliques  and  re-­‐calculate  the  corresponding                                                             edges  
  • 78. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Visual  deduplica2on   GM Gvisual
  • 79. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Topic  Detec2on   •  Apply  Structural  Clustering  Algorithm  for  Networks   (SCAN)  →  iden8fy  dense  sub-­‐graphs  of  messages  in  GM     •  Sub-­‐graphs  represent  the  topics  that  exist  in  the   stream  of  messages   •  Each  topici  contains  messages  {Mi}  and  is  represented   as  a  merged  N·∙idf  vector  Vi   •  A  substan8al  amount  of  messages  is  kept  outside  of   the  detected  clusters   –  Hubs  &  Outliers  most  probably  are  non-­‐informa8ve   –  May  include  valuable  informa8on  →  also  considered  in   summariza8on  process  as  single-­‐item  clusters  
  • 80. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Message  Selec2on  Score         reposts relevance x cluster size x specificity
  • 81. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Specificity   High  specificity   Low  specificity   rare  across  all   topics  of  the   event     common   across   topics  
  • 82. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Image  Ranking  &  Diversifica2on       variant  of   PageRank  aiming   diversity      
  • 83. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Dataset  and  Event  Descrip2on   •  dataset  of  McMinn  et  al.  having  more  than  500  events   from  different    domains       •  we  used  the  50  largest  events  in  terms  of  tweets   •  sports  events    (e.g.,  the  Sochi  winter  Olympics),     poli8cal  events  (Ukraine    crisis,  Venezuelan  protests),   disasters,  etc.   •  364,005  tweets,  on  average  4,730  tweets/event   •  296,160  remaining  tweets,  due  to  suspended     accounts    and  deleted    messages   •  about  3,51%  of  these,  i.e.  12,772  tweets,  contain  an   embedded  image  
  • 84. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Relevance  Judgments   Each  image  is  shown  to  3  par8cipants  (20  img-­‐20  part)  without  ranking   informa8on   Task  Descrip2on:  You  are  presented  with  an  image  and  an  event  8tle   describing  a  trending  topic  in  TwiJer.  For  each  image  and  event  8tle,  you  are   asked  to  answer  the  following  ques8on:     Is  this  image  relevant  to  the  event?   1.  The  image  is  clearly  not  relevant  to  the  event.   2.  The  image  is  probably  not  relevant  to  the  event,  but  I  am  not  en8rely  sure.   3.  The  image  is  somewhat  relevant  to  the  event,  but  I  have  my  doubts  on   whether  I  would  like  to  see  it  in  a  photo  coverage  of  the  event.   4.  The  image  is  clearly  relevant  to  the  event,  and  I  would  like  to  see  it  in  a  photo   coverage  of  the  event.  
  • 85. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Experimental  Se{ng   •  VLAD+SURF  extrac8on   –  64–dimensional  SURF  descriptors   –  four  codebooks  of  128  visual  words  (in  total  512)  to  quan8ze  each  descriptor     –  aggregate  SURF  descriptors  into  a  single  vector  of  64*512  =  32.768  dimensions    using   VLAD  scheme   –  PCA  to  create  a  1024-­‐dimensional  L2-­‐normalized  reduced  vector  that  represents  the   visual  content  of  the  image   •  Mul8-­‐graph  genera8on   –  k  =  500  nearest  neighbors   –  visual  and  textual  similarity  thresholds  were  set  to  0.5  and  0.6   –  σ2  of  the  temporal  kernel  was  empirically  set  to  24  hours   •  SCAN  parameters  were  set  to    μ=2  and    ε=0.65   •  DivRank’s  dumping  factor  was  set  to  d=0.75  
  • 86. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Evalua2on  metrics  (1)   Precision-­‐oriented  metrics   •  Precision  (P@N):  The  percentage  of  images  among  the  top  N   that  are  relevant  (answers  3&4)  to  the  corresponding  event,   averaged  among  all  events.  We  calculate  precision  for  N  equal   to  1,  5,  and  10.   •  Success  (S@N):  Percentage  of  events,  where  there  exist  at   least  one  relevant  image  among  the  top  N  returned,  for  N=10.   •  Mean  Reciprocal  Rank  (MRR)  :  Computed  as  1/r,  where  r  is   the  rank  of  the  first  relevant  image  returned,  averaged  over  all   events.  
  • 87. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Evalua2on  metrics  (2)   Diversity-­‐oriented  metrics   •  α-­‐normalized  Discounted  Cumula2ve  Gain  :  α-­‐nDCG@N   measures  the  usefulness,  or  gain,  of  the  returned  images   based  on  their  posi8on  in  the  summary  (N=10).   •  Average  Visual  Similarity:  AVS@N  measures  the  average   visual  similarity  among  all  pairs  of  images  in  the  top  N  selected   images,  averaged  over  all  events.  Lower  AVS  values  are   preferable  since  they  imply  higher  diversity  in  terms  of  visual   content.  
  • 88. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Baselines   •  Random:  randomly  selects  N  images  from  the  filtered  set  of  images  as  the   summary  set   •  MostPopular:  picks  up  the  N  most  popular  images  in  terms  of  reposts   •  LexRank:  uses  items  graph  GM,  ranks  the  nodes  using  the  LexRank  and   selects  the  top  N  nodes  that  contain  images     •  TopicBased:  selects  the  N  most  relevant  messages  from  the  most   significant  topics  (S_cov)  (relevance,  no  specificity  &  diversity)   •  P-­‐TWR:  ranks  images  in  descending  order  using  the  weigh8ng  scheme   described  in  McParlane  et  al.  (popularity)   •  S-­‐TWR:  groups  the  tweets  of  each  event  into  sub-­‐clusters  and  select  the   highest  ranked  item  of  each  cluster  using  the  previous  weigh8ng  scheme   (specificity)  
  • 89. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Results  (1)  –  Precision  oriented  metrics     89   •  MGraph  outperforms  all  of  the  compe8ng  methods   •  Popularity-­‐based  approach  performs  well  for  P@1  but  drops   significantly  for  N=5,10     •  LexRank  and  TopicBased  approaches  achieve  lower  but  more   steady  results     First relevant in positions 1 - 2
  • 90. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Results:  Canada  Team  in  #Sochi   Popularity-based S-TWR MGraph
  • 91. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Results  (2)  –  Diversity  oriented  metrics     •  MGraph  achieves  the  best  score  for  α-­‐nDCG@10   •  Best  values  of  AVS  achieved  by  S-­‐TWR   •  The  worst  results  in  terms  of  AVS  are  obtained  using  LexRank    
  • 92. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Results  (3)   Performance  of  MGraph  across  different  categories   •  Best  P@10  measure  is  obtained  for  events  about  Science  &  Technology   •  The  second  best  P@10  is  obtained  for  events  about  Arts  &  Entertainment     •  Difficult  to  diversify   •  The  best  value  of  AVS  is  achieved  for  events  about  disasters  &  accidents   e.g.,  earthquakes  
  • 93. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Results  (4)   Impact  of  the  dumping  factor  d  on  P@10,  S@5,  MRR  and  α-­‐nDCG@10   •  The  worst  results  for  all   metrics  are  obtained  for   d=0    (no  re-­‐ranking)   •  The  best  results  are   achieved  for  0.7<d<0.8   •  slight  decrease  for  d>0.8     •  more  diverse  →  less   relevant  
  • 94. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Conclusions   •  Graph-­‐based  approach  for  visual  summaries  for  real-­‐world  events   •  Maximizes  relevance  and  diversity   •  Mul8modal  approach  taking  into  account   •  Textual  content   •  Visual  content   •  Social     •  Interac8ons  (replies)   •  Popularity   •  Time   •  Introduc8on  of  user  related  features  (e.g.  influence)  
  • 95. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Monitoring  and  intelligence   system  for  Web  mul2media   verifica2on  
  • 96. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Can  mul2media  on  the  Web  be  trusted?   #96   Real  photo   captured  April  2011  by  WSJ   but   heavily  tweeted  during  Hurricane  Sandy   (29  Oct  2012)     Tweeted  by  mul8ple  sources  &   retweeted  mul8ple  8mes     Original  online  at:           hJp://blogs.wsj.com/metropolis/2011/04/28/weather-­‐ journal-­‐clouds-­‐gathered-­‐but-­‐no-­‐tornado-­‐damage/    
  • 97. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   The  Problem   •  Everyone  can  easily  publish  content  on  the  Web   •  Content  can  be  easily  repurposed  and  manipulated   •  News  outlets  are  compe8ng  for  views  and  clicks  à   Pressure  for  airing  stories  very  quickly  leaves  very   liJle  room  for  verifica8on.  à  Very  oten,  even  well-­‐ reputed  news  providers  fall  for  fake  news  content.   •  Mul8ple  tools  and  services  available  for  individual   tasks  à  complex  verifica8on  process   Very  hard  and  2me  consuming  to  check  the  veracity  of   Web  mul2media   #97  
  • 98. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Media  REVEALr   •  Developed  within  the  REVEAL  project:              hJp://revealproject.eu/     •  Framework  for  collec8ng,  indexing  and  browsing   mul8media  content  from  the  Web  and  social  media   •  Support  for  verifica8on:   –  Near-­‐duplicate  detec8on  against  an  indexed  collec8on   –  Clustering  of  social  media  posts  by  visual  similarity  à   compara8ve  view  of  the  same  incident   –  Aggrega8on  and  visualiza8on  of  Named  En88es  around  an   incident   #98  
  • 99. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Related  Work   •  Majority  of  works  have  focused  on  problem  of  topic   detec8on  and  summariza8on:   –  TwitInfo  (Marcus  et  al.,  2011)   –  TwiJermonitor  (Mathioudakis  &  Koudas,  2010)   –  Meme  detec8on  &  predic8on  (Weng  et  al.,  2014)   •  Visual  memes  and  clustering   –  Visual  meme  tracking  (Xie  et  al.,  2011)   –  Supervised  mul8modal  clustering  (Petkos  et  al.,  2012)   •  Image  manipula8on  tracking   –  Internet  image  archaeology  (Kennedy  &  Chang,  2008)   #99  
  • 100. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Overview  of  Media  REVEALr   #100   Media  collec8on   Media  pre-­‐processing  &   feature  extrac8on   Media  analysis,  mining  &   indexing   Persistence  (storage,  indexing)   Access  (API)   Visualiza8on,  front-­‐end   TEXT   VISUAL  
  • 101. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Named  En2ty  Detec2on   •  Brevity  and  noisy  nature  of  text  in  social  media  poses   a  serious  challenge   •  Employed  solu8on:   –  Pre-­‐processing:  tokeniza8on,  user  men8on  resolu8on,  text   cleaning   –  Stanford  NER  +  user  men8on  resolu8on   –  Regular  expressions  to  remove  special  characters  and   symbols  (e.g.,  #,  @,  URLs,  etc.)   #101  
  • 102. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Visual  Indexing   •  Content-­‐based  image  retrieval  to  solve  Near-­‐ Duplicate  Search  (NDS)  problem     •  Based  on  local  descriptors  (SURF),  aggrega8on   (VLAD),  dimensionality  reduc8on  (PCA),  quan8za8on   (PQ)  and  indexing  (IVFADC)   •  State-­‐of-­‐the-­‐art  visual  similarity  search   –  High  precision/recall   –  Very  efficient  and  scalable  implementa8on  (search  many   millions  of  images  in  a  few  msec,  maintain  full  index  in   memory  using  ~1GB/10M  images)   #102  
  • 103. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Improving  NDS  Resilience  (NDS+)   •  Oten,  NDS  performance  suffers  from  overlay   graphics  and  fonts   •  To  address  this  issue,  we  integrate  a  descriptor-­‐level   classifier  that  tries  to  remove  the  font/graphic   descriptors  from  the  VLAD  vector   #103  
  • 104. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Example:  Filtering  Out  Font  Descriptors   •  Assuming  that  in  most  cases  the  classifier  is  correct,   the  resul8ng  VLAD  vector  is  of  much  higher  quality   compared  to  the  one  without  filtering   #104  
  • 105. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Classifier  Details   •  Random  Forest  used  as  base  classifier   •  Cost  Sensi8ve  meta-­‐classifier  to  penalize   misclassifica8on  of  True  Posi8ves   •  Challenge  due  to  Class  Imbalance  (overlay   descriptors  <<  useful  image  content  descriptors)   –  Cost  Sensi8ve  meta-­‐classifier  performs  over-­‐sampling  of   minority  class  to  balance  the  training  set   •  Training  set  created  by  collec8ng  images  with   overlays  (e.g.,  memes)  from  the  Web  and  manually   annota8ng  them  (selec8ng  areas  w.  fonts/overlays)   #105  
  • 106. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Mining:  Clustering  and  Aggrega2on   •  Visual  aggrega8on   –  DBSCAN  on  the  visual  feature  representa8on  (PCA-­‐ reduced  VLAD  vectors)   –  Element  (tweet)  selected  based  on  the  largest  amount  of   keywords  (expected  to  result  in  more  informa8on)   •  En8ty  aggrega8on   –  NER  on  individual  items   –  En8ty  categoriza8on  (à  Persons,  Loca8on,  Organiza8ons)   –  En8ty  ranking  based  on  frequency  of  occurrence     #106  
  • 107. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   User  Interface:  Collec2ons  View   #107  
  • 108. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   User  Interface:  Items  View  &  Search   #108  
  • 109. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   User  Interface:  Clusters  View   #109  
  • 110. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   User  Interface:  En22es  View   #110  
  • 111. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Evalua2on:  NER   •  Manual  annota8on  of  400  tweets  from  the  SNOW   Data  Challenge  dataset  (Papadopoulos  et  al.,  2014)   •  Measure:  Accuracy  à  instance  is  considered  correct   when  both  en8ty  and  type  are  correctly  iden8fied   •  Three  compe8ng  solu8ons:     –  Base  Stanford  NER  (S-­‐NER)   –  S-­‐NER  +  Extensions/Post-­‐processing  (S-­‐NER+)   –  Ellogon  library  (hJp://www.ellogon.org)     #111  
  • 112. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Evalua2on:  NDS   •  Benchmark  Datasets   –  Holidays:  1,491  images,  500  queries  (Jegou  et  al.,  2008)   –  Oxford:  5,063  images,  55  queries  (Philbin  et  al.,  2008)   –  Paris:  6,412  images,  55  queries  (Philbin  et  al.,  2008)   •  Accuracy:  mean  Average  Precision  (mAP)   #112   CLEAN  DATASET   NOISY  DATASET  
  • 113. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Evalua2on:  NDS   •  Execu8on  Time  (msec)   •  Example   #113   INDEXED  IMAGE   QUERY  IMAGE   NDS:    #27   NDS+:  #1  
  • 114. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Use  Cases:  Real-­‐world  Datasets   #114   sandy   boston   malaysia   ferry  
  • 115. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   NDS  Use  Case  (boston)   #115  
  • 116. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Clustering  Use  Case  (boston)   •  Visual  clustering  enables  compara8ve  view  and  analysis  over   8me  (in  this  case  showing  increasing  confidence  on  picture).   •  When  journalists  see  many  similar  photos  of  the  same  scene,   they  have  more  confidence  that  it  is  real  and  not  fabricated.   #116  
  • 117. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   En2ty  Aggrega2on  Use  Case  (snow)     #117   LOCATIONS   PERSONS   ORGANIZATIONS  
  • 118. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Conclusion   •  Key  contribu8ons   –  Framework  and  web  applica8on  offering  valuable   verifica8on  support  for  Web  mul8media   –  High-­‐quality  individual  components  for  NER,  NDS,   clustering  and  aggrega8on   •  Future  Work   –  Incremental  image  clustering   –  Temporal  views  to  explore  evolu8on  of  a  story   –  Mul8media  forensics  toolbox  (splice,  copy-­‐move   detec8on)   #118  
  • 119. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Computa2onal  Verifica2on  in  Social  Media   •  Create  a  computa$onal  verifica$on  framework  to   classify  tweets  with  unreliable  media  content.   •  Events  used  for  experimenta8on   #119   Fake  images  posted  during  Hurricane  Sandy  natural  disaster   Fake  images  posted  during  Boston  Marathon  bombings  
  • 120. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Methodology   #120   Tweet   Extrac8on   • Use  Topsy   machine  to  collect   tweets  with   certain  keywords   Image   Indexing   • Create  a   predefined  set  of   verified  fake  and   real  images     • Keep  the  tweets   with  iden8cal  or   near-­‐duplicate   images   Feature   Extrac8on   • Extract  Content   and  User  features   for  each  tweet   collected  and   their  combina8on   Dataset     • Annotate  each   tweet  as  fake  or   real  based  on  the   image   • Keep  only  tweets   wriJen  in  English,   Spanish  or   German   Classifica8on   • Test  using  cross-­‐ valida$on   approach   • Test  using  the  two   dis8nct  datasets   • Test  using   different  training   and  tes8ng   dataset   Content  features   • Length  of  the  tweet   • Number  of  words   • Contains  exclama8on  mark  and  their  number   • Contains  quota8on  mark  and  their  number   • If  the  text  contains  emo8con  (happy  or  sad)   • Number  of  uppercase  characters   • Number  of  hashtags   • Number  of  men8ons   • Number  of  pronouns   • Number  of  urls   • Number  of  sen8ment  words   • Number  of  retweets     User  features   • Username   • Number  of  friends   • Number  of  followers   • Number  of  followers/number  of  friends  ra8o   • Number  of  8mes  the  user  was  listed   • If  the  status  of  the  user  contains  url   • If  the  user  is  verified  or  not  
  • 121. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Results   •  Tweet  Sta8s8cs       •  Approaches   #121   Tweets  with  URLs   343939   Tweets  with  fake  images   10758   Tweets  with  real  images   3540   Hurricane  Sandy   Boston  Marathon   Tweets  with  URLs   112449   Tweets  with  fake  images   281   Tweets  with  real  images   460   Classifier   Classified  correctly(%)   Content   features   User     features   Total   features   J48  tree   81.41   67.72   80.68   KStar   81.28   71.16   81.38   Random   Forest   80.59   70.15   80.94   Detec8on  accuracy  using  cross  –  valida8on  approach     Classifier   Classified  correctly(%)   Content   features   User     features   Total   features   J48  tree   76.45   70.81   81.25   KStar   81.28   74.12   75.78   Random   Forest   78.59   76.15   79.10   Hurricane  Sandy   Boston  Marathon  
  • 122. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Results(2)   #122   Classifier   Classified  correctly(%)   Content   features   User     features   Total   features   J48  tree   73.79   51.06   65.06   KStar   75.30   62.29   53.31   Random   Forest   74.02   63.10   65.96   Detec8on  accuracy  using  different  training  and  tes8ng  set  in  Hurricane  Sandy   Classifier   Classified  correctly(%)   Content   features   User     features   Total   features   J48  tree   55.05   50.12   54.10   KStar   50.01   50.10   50.97   Random   Forest   58.75   51.03   58.78   Detec8on  accuracy  using  Hurricane  Sandy  for  training  and  Boston  Marathon  for  tes8ng    
  • 123. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #123   Other  approaches   •  Graph-­‐based  mul8modal  clustering  for  social  event   detec8on  in  large  collec8ons  of  images   –  automa8c  organiza8on  of  a  mul8media  collec8on  into   groups  of  items,  each  (group)  of  which  corresponds  to  a   dis8nct  event.   •  Unsupervised  concept  learning  detec8on  using  social   media  as  training  data   •  Text  analysis  for  en88es  matching  and  sen8ment   analysis     •  Placing  images  based  on  content-­‐features   •  Retrieving  diverse  images  for  same  en8ty    
  • 124. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #124   Demos  -­‐  Applica2ons   MM  News  Demo   Clusrour   ThesFest  
  • 125. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Mul2media  Demo  
  • 126. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #126   Mul2media  Demo  Architecture   #126   StreamManager   TwiJer   Facebook   Flickr   YouTube   RSS   Instagram   160.xx.xx.207   MongoDBWrapper   160.xx.xx.207   TextIndexer      (Solr)   160.xx.xx.207   160.xx.xx.207   MediaFetcher,  FeatureExtractor  (HDFS)   160.xx.xx.58   160.xx.xx.107   Social  Focused  Crawler  (HDFS)   160.xx.xx.187   Nutch   Nutch   VLAD   FeatureIndexer  (HDFS)   160.xx.xx.207   IVFADC   Data  Mining   160.xx.xx.191   Visual  Clust.   Geo  Clust.   Sta8s8cs   Web  server   160.xx.xx.116   API  (3)  API  (4)   API  (1)   API  (2)  
  • 127. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   MongoDB   Document-­‐oriented  database  →  support  of  json   Current  stable  version:  3.0.6      hJps://www.mongodb.org/     Flexible  Data  Model  →  schemeless,  usefulll  for  social  media  data  that  change   over  8me   Horizontal  scaling  via  shards  and  replica  sets       Storage  of  social  media  items  as  json  objects  →  millions  of  documents  can   be  handled   Number  of  different  index  types  →  single  field,  compound,  mul8key  indexes.     Example:  Store  facebook  posts  and  index  them  by  publica8on  8me  and   number  of  likes   Query:  get  most  recent  posts  sorted  by  popularity  (#likes)   Na8ve  support  of  map-­‐reduce  jobs  →  get  most  shared  images  in  a  collec8on   of  tweets  
  • 128. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Apache  Solr   Full-­‐text  search  plaworm  built  on  top  ofApache  Lucene   Current  version:  5.3.0  hJp://lucene.apache.org/solr/     Indexing  of  social  media  items  e.g.  Tweets,  FB  posts,  metadata  of  Youtube  videos   etc.     Addi2onal  features     l  Faceted  Search  and  Filtering  →  get  top  N  per  field  e.g.  users   l  Spa8al  index  &  Search  →  very  usefull  in  geo-­‐tagged  documents  e.g.  Tweets.   l  Plugin-­‐based  archtecture  →  language  detec8on,  NLP  etc  as  steps  of  indexing   pipeline     Get  tweets  containg  the  name  “Barack  Obama”  OR  the  phrase  “us  elec8ons”   having  geo-­‐loca8on  around  New  York         SolrCloud  →  Cluster  of  Solr  instances   Automa8c  load  balancing  and  fail-­‐over  for  queries   ZooKeeper  integra8on  for  cluster  coordina8on  and  configura8on  
  • 129. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Storm   Distributed  real-­‐8me  computa8on  system  hJps://storm.apache.org     Topologies  →  processing  logic   Stream:  unbounded  sequence  of  tuples  e.g.  tweets  or  URLs         Spouts:  source  of  streams   Bolts:  processing,  filtering,  etc   Processing  of  URLS  shared  in  social  media  →   storm  pipeline   l  Expand  short  URLs   l  Fetch  new  URLs   l  Extract  content  e.g.  ar8cles  and  images  
  • 130. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Redis   Key  -­‐  Value  cache  and  store   Current  stable  version:  3.0  hJps://storm.apache.org/   Par22oning  →  distribu8on  of  data  among  mul8ple  Redis  instances   Keys  can  contain  strings,  hashes,  lists,  sets,  sorted  sets,  etc   Atomic  opera2ons:  set,  increment,  push  etc     Store  crawling  status  of  URLs,  sharing  informa8on  of  URLs  and  images     Addi8onal  Feature   l  Implementa8on  of  Publisher/Subscriber  paJern   l  Communica8on  of  different  components  in  a  system  for  social   media  analy8cs  
  • 131. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   tags:  sagrada  familia,   cathedral,  barcelona   taken:  12  May  2009   lat:  41.4036,  lon:  2.1743   PHOTOS  &  METADATA   SPATIAL  CLUSTERING  +  TEMPORAL  ANALYSIS   COMMUNITY  DETECTION   CLASSIFICATION  TO  LANDMARKS/EVENTS   VISUAL   TAG   HYBRID   [2  years,  50  users  /  120  photos]   #users  /  #photos   dura8on   [1  day,  2  users  /  10  photos]   S.   Papadopoulos,   C.   Zigkolis,   Y.   Kompatsiaris,   A.   Vakali.   “Cluster-­‐based   Landmark   and   Event   Detec8on   on   Tagged   Photo   Collec8ons”.  In  IEEE  Mul8media  Magazine  18(1),  pp.  52-­‐63,  2011   City  profile  crea2on  (Clusrour)  
  • 132. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #132   City  profile  crea2on  (Clusrour)   Community  detec2on  on   image  similarity  graphs   Nodes:  photos   Edges:  visual  and  tag   similarity  
  • 133. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  
  • 134. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #134   ThessFest   •  Thessaloniki   Interna8onal  Film   Fes8val   •  Support  twiJer/ comment  usage   within  the  app   •  Ra8ngs  and   comments  per  film   •  Feedback   aggrega8on   •  Votes   •  Tweets   •  Real-­‐8me  feedback   to  the  organisa8on   and  visitors   ThessFest
  • 135. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Fête  de  la  Musique  Berlin  app   •  FETEberlin  in  App  Store  and  Google  Play   •  More  than  100K  visitors   •  About  5K  musicians   •  More  than  5K  app  downloads,  25K   sessions   App  features   •  Browse  and  filter  detailed  program   •  Interac8ve  maps  and  rou8ng     •  Social  Sharing   •  Ar8sts’  and  Stages  Details   •  Social  Monitoring   Main  benefits  for  arendants   •  Visitors  can  browse  through  maps  and   don’t  get  lost  as  stages  are  numerous   •  Event  schedule  is  available  always  and   per  stage     –  Very  useful  when  the  server  was  down  and   there  was  no  access  to  the  online  schedule   #135  
  • 136. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #136   Topic  analysis   •  Top-­‐10  topics   •  Manual  inspec8on   of  clusters:   –  53.8%  of  topic  8tles   considered   informa8ve   –  98.5%  of  clusters   were  found  to  be   “clean”   •  Topics  in  8me  
  • 137. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Other  Applica2on  Areas   •  Science   –  Sociology,  machine  learning  (machine  as  a  teacher),  computer  vision   (annota8on)   •  Tourism  –  Leisure  –  Culture   –  Off-­‐the-­‐beaten  path  POI  extrac8on   •  Marke8ng   –  Brand  monitoring,  personalised  ads   •  Predic8on     –  Poli8cs:  elec8on  results   •  News   –  Topics,  trends  event  detec8on   •  Others   –  Environment,  emergency  response,  energy  saving,  etc  
  • 138. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Reusable  results   •  Star2ng  point:  hJp://www.socialsensor.eu/results     –   Deliverables   –   Publica8ons     –   Datasets   –   Sotware   –   e-­‐leJer:  hJp://stcsn.ieee.net/e-­‐leJer/vol-­‐1-­‐no-­‐3   •  Open-­‐source  projects  (Apache  License  v2):                  hJps://github.com/socialsensor     –   Data  collec8on  (stream-­‐manager,  storm-­‐focused-­‐crawler)   –   Indexing  (framework-­‐client,  mul8media-­‐indexing)   –   Mining  (topic-­‐detec8on,  mul8media-­‐analysis,  community-­‐evolu8on-­‐ analysis,  social-­‐event-­‐detec8on)  
  • 139. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #139   Benchmarking  -­‐  Datasets  
  • 140. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   dataset:  SNOW  2014  Data  Challenge   •  A  set  of  ~1M  tweets  collected  using  a  list  of  5000  UK-­‐ focused  “news  hounds”  and  the  keywords  “Syria”,   “terror”,  “Ukraine”,  and  “bitcoin”  for  a  period  of  24   hours  star8ng  from  Feb  25,  18:00.   •  Average  rate:  ~720  tweets/minute   •  Number  of  unique  twiJer  accounts:  ~556K   •  Number  of  retweets:  ~648K   •  Number  of  replies:  ~135K   •  Ground  truth  topics:              hJp://figshare.com/ar8cles/SNOW_2014_Data_Challenge/1003755   #140  
  • 141. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Overview  of  Challenge   •  Goal:  Detec8on  of  newsworthy  topics  in  a  large  and   noisy  set  of  tweets   •  Topic:  a  news  story  represented  by  a  headline  +  tags   +  representa8ve  tweets  +  representa8ve  images   (op8onal)   •  Newsworthy:  A  topic  that  ends  up  being  covered  by   at  least  some  major  online  news  sources   •  Topics  are  detected  per  2meslot  (small  equally-­‐sized   8me  intervals)   •  We  want  a  maximum  number  of  topics  per  8meslot   #141  
  • 142. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Challenge  Ac2vity  Log   •  Challenge  defini8on  (Dec  2013)   •  Challenge  toolkit  and  registra8on  (Jan  20,  2014)   •  Development  dataset  collec8on  (Feb  3,  2014)   •  Rehearsal  dataset  collec8on  (Feb  17,  2014)   •  Test  dataset  collec8on  (Feb  25,  2014)   •  Results  submission  (Mar  4,  2014)   •  Paper  submission  (Mar  9,  2014)   •  Results  evalua8on  (Mar  5-­‐18,  2014)   •  Workshop  (Apr  7,  2014)   #142  
  • 143. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Some  sta2s2cs   •  Registered  par8cipants:  25   –  India:  4,  Belgium:  3,  Germany:  3,  UK:  3,  Greece:  3,         Ireland:  2,  USA:  2,  France:  2,  Italy:  1,  Spain:  1,  Russia:  1   •  Par8cipants  that  signed  the  Challenge  agreement:  19   •  Par8cipants  that  submiJed  results:  11   •  Par8cipants  that  submiJed  papers:  9   #143  
  • 144. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   Evalua2on  Protocol   •  Defined  several  evalua8on  criteria:   –  Newsworthiness  à  Precision/Recall,  F-­‐score   –  Readability  à  scale  [1-­‐5]   –  Coherence  à  scale  [1-­‐5]   –  Diversity  à  scale  [1-­‐5]   •  List  of  reference  topics   •  Set  up  precise  evalua8on  guidelines   •  Blind  evalua8on  (i.e.  evaluator  not  aware  of  which   method  a  topic  comes  from)  based  on  Web  UI   •  Par8cipants  submiJed  topics  for  96  8meslots,  but   manual  evalua8on  happened  for  5  sample  8meslots.   •  Result  valida8on  and  analysis   #144  
  • 145. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   social  event  detec2on    
  • 146. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   a  bit  of  background...   •  mediaeval   –  well-­‐known  benchmarking  ac8vity  since  2010  (started  as   VideoCLEF  in  2008)   –  consists  of  several  tasks  dedicated  to  specific  challenges   •  social  event  detec2on  (SED)   –  first  run  in  2011  (7  par8cipants)  
  • 147. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   task  defini2on  &  dataset   •  2011    collec8on:  73,645  flickr  photos  from  five  ci8es,  May  2009              find  events  related  to  two  target  categories              >  soccer  matches  in  Barcelona  and  Rome              >  concerts  in  venues  Paradiso  and  Parc  del  Forum     •  2012    collec8on:  167,332  flickr  photos  from  five  ci8es,  2009-­‐2011            find  events  related  to  three  target  categories            >  technical  events  (e.g.  exhibi8ons,  fairs)  in  Germany            >  soccer  events  in  Hamburg  and  Madrid            >  Indignados  movement  in  Madrid     •  2013    collec8on  1:  437,370  flickr  photos  +  1,327  YouTube  videos        collec8on  2:  57,165  Instagram  photos        cluster  collec8on  1  into  events  (aJach  YouTube  videos  to  them)        categorize  collec8on  2  images  into  eight  event  types  or  non-­‐event   variant  1   variant  4   variant  4  
  • 148. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   sed2012:  evalua2on  setup   •  ground  truth:  photos  clustered  around  149  events   (18  technical,  79  soccer,  52  Indignados)   •  assess  the  following  aspects:   –  accuracy  of  same-­‐event  classifica8on   –  compare  clustering  quality  between  item-­‐to-­‐cluster  and   the  two  versions  of  item-­‐to-­‐item  (batch  &  incremental)   –  measure  contribu8ons  of  different  features   –  study  generaliza8on  abili8es  of  same  event  model  
  • 149. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data   evalua2on:  main  caveat   •  crea8on  strategy  of  benchmark  dataset  can   drama8cally  affect  how  hard  (or  easy)  the  problem  is   –  if  events  are  very  sparsely  distributed  over  8me,  then  a   simple  8me-­‐based  clustering  could  be  sufficient   –  if  events  correspond  to  users  one-­‐to-­‐one,  then  a  simple   user-­‐based  look-­‐up  could  yield  very  high  accuracy   –  using  the  same  source  for  training/tes8ng  makes  it  easy   •  need  to  explore  new  challenging  se†ngs   –  mul8ple  sources  of  mul8media   –  huge  amounts  of  non-­‐event  content   –  very  dense  coverage  of  feature  space  by  test  events  
  • 150. S3P  2015,  Garda  Lake,  Italy      Processing  Large  Complex  Data  #150   Conclusions