SlideShare uma empresa Scribd logo
1 de 21
Baixar para ler offline
ZenCrowd	
  
Leveraging	
  Probabilis3c	
  Reasoning	
  and	
  
Crowdsourcing	
  Techniques	
  for	
  Large-­‐Scale	
  
En3ty	
  Linking	
  	
  
	
  
Gianluca	
  Demar3ni,	
  Djellel	
  Eddine	
  
Difallah,	
  and	
  Philippe	
  Cudré-­‐Mauroux	
  
eXascale	
  Infolab,	
  University	
  of	
  Fribourg	
  
Switzerland	
  
MoFvaFon	
  
•  Linked	
  Open	
  Data	
  (LOD)	
  
•  Linking	
  enFty	
  from	
  text	
  to	
  LOD	
  
– Rich	
  snippets	
  
– EnFty-­‐centric	
  search	
  
•  Linking	
  
– Algorithmic	
  
– Manual	
  (NYT	
  arFcles)	
  
HTML+ RDFa
Pages
22-­‐Apr-­‐12	
   Gianluca	
  DemarFni,	
  eXascale	
  Infolab	
   2	
  
Gianluca	
  DemarFni,	
  eXascale	
  Infolab	
   3	
  
hVp://dbpedia.org/resource/Facebook	
  
hVp://dbpedia.org/resource/Instagram	
  
Zase:Instagram	
  
owl:sameAs	
  
Google	
  
Android	
  
<p>Facebook	
  is	
  not	
  waiFng	
  for	
  its	
  iniFal	
  
public	
  offering	
  to	
  make	
  its	
  first	
  big	
  
purchase.</p><p>In	
  its	
  largest	
  
acquisiFon	
  to	
  date,	
  the	
  social	
  network	
  
has	
  purchased	
  Instagram,	
  the	
  popular	
  
photo-­‐sharing	
  applicaFon,	
  for	
  about	
  $1	
  
billion	
  in	
  cash	
  and	
  stock,	
  the	
  company	
  
said	
  Monday.</p>	
  
<p><span	
  about="hVp://dbpedia.org/resource/
Facebook"><cite	
  property=”rdfs:label">Facebook</
cite>	
  is	
  not	
  waiFng	
  for	
  its	
  iniFal	
  public	
  offering	
  to	
  
make	
  its	
  first	
  big	
  purchase.</span></p><p><span	
  
about="hVp://dbpedia.org/resource/Instagram">In	
  
its	
  largest	
  acquisiFon	
  to	
  date,	
  the	
  social	
  network	
  has	
  
purchased	
  <cite	
  property=”rdfs:label">Instagram</
cite>	
  ,	
  the	
  popular	
  photo-­‐sharing	
  applicaFon,	
  for	
  
about	
  $1	
  billion	
  in	
  cash	
  and	
  stock,	
  the	
  company	
  said	
  
Monday.</span></p>	
  
RDFa	
  
enrichment	
  
HTML:	
  
Crowdsourcing	
  
•  Exploit	
  human	
  intelligence	
  to	
  solve	
  
– Tasks	
  simple	
  for	
  humans,	
  complex	
  for	
  machines	
  
– With	
  a	
  large	
  number	
  of	
  humans	
  (the	
  Crowd)	
  
– Small	
  problems:	
  micro-­‐tasks	
  (Amazon	
  MTurk)	
  
•  Examples	
  
– Wikipedia,	
  Image	
  tagging	
  
•  IncenFves	
  
– Financial,	
  fun,	
  visibility	
  
22-­‐Apr-­‐12	
   Gianluca	
  DemarFni,	
  eXascale	
  Infolab	
   4	
  
ZenCrowd	
  
•  Combine	
  both	
  algorithmic	
  and	
  manual	
  linking	
  
•  Automate	
  manual	
  linking	
  via	
  crowdsourcing	
  
•  Dynamically	
  assess	
  human	
  workers	
  with	
  a	
  
probabilisFc	
  reasoning	
  framework	
  
22-­‐Apr-­‐12	
   5	
  
Crowd	
  
Algorithms	
  Machines	
  
Outline	
  
•  Related	
  approaches	
  
•  ZenCrowd:	
  System	
  architecture	
  
•  ProbabilisFc	
  model	
  to	
  combine	
  automaFc	
  
matching	
  and	
  crowdsourcing	
  results	
  
•  Experimental	
  evaluaFon	
  
22-­‐Apr-­‐12	
   Gianluca	
  DemarFni,	
  eXascale	
  Infolab	
   6	
  
Related	
  Approaches	
  
•  EnFty	
  Linking	
  
•  Ad-­‐hoc	
  Object	
  Retrieval	
  
– Keyword	
  queries	
  
– Looking	
  for	
  a	
  specific	
  enFty	
  
– IR	
  indexing	
  and	
  ranking	
  over	
  RDF	
  data	
  
•  Crowdsourcing	
  
– Training	
  set,	
  tagging,	
  annotaFon	
  
– IR	
  evaluaFon	
  
– CrowdDB	
  
22-­‐Apr-­‐12	
   Gianluca	
  DemarFni,	
  eXascale	
  Infolab	
   7	
  
ZenCrowd	
  Architecture	
  
Micro
Matching
Tasks
HTML
Pages
HTML+ RDFa
Pages
LOD Open Data Cloud
Crowdsourcing
Platform
ZenCrowd
Entity
Extractors
LOD Index Get Entity
Input Output
Probabilistic
Network
Decision Engine
Micro-
TaskManager
Workers Decisions
Algorithmic
Matchers
22-­‐Apr-­‐12	
   Gianluca	
  DemarFni,	
  eXascale	
  Infolab	
   8	
  
Algorithmic	
  Matching	
  
•  Inverted	
  index	
  over	
  LOD	
  enFFes	
  
– DBPedia,	
  Freebase,	
  Geonames,	
  NYT	
  
•  TF-­‐IDF	
  (IR	
  ranking	
  funcFon)	
  
•  Top	
  ranked	
  URIs	
  linked	
  to	
  enFFes	
  in	
  docs	
  
•  Threshold	
  on	
  the	
  ranking	
  funcFon	
  or	
  top	
  N	
  
22-­‐Apr-­‐12	
   Gianluca	
  DemarFni,	
  eXascale	
  Infolab	
   9	
  
EnFty	
  Factor	
  Graphs	
  
•  Graph	
  components	
  
– Workers,	
  links,	
  clicks	
  
– Prior	
  probabiliFes	
  
– Link	
  Factors	
  
– Constraints	
  
•  ProbabilisFc	
  
Inference	
  
– Select	
  all	
  links	
  with	
  
posterior	
  prob	
  >τ	
  
w1
w2
l1
l2
pw1( ) pw2( )
lf1( ) lf2( )
pl1( ) pl2( )
l3
lf3( )
pl3( )
c11
c22
c12
c21
c13
c23
u2-3( )sa1-2( )
2	
  workers,	
  6	
  clicks,	
  3	
  candidate	
  links	
  
Link	
  priors	
  
Worker	
  
priors	
  
Observed	
  
variables	
  
Link	
  
factors	
  
SameAs	
  
constraints	
  
Dataset	
  
Unicity	
  
constraints
22-­‐Apr-­‐12	
   Gianluca	
  DemarFni,	
  eXascale	
  Infolab	
   10	
  
EnFty	
  Factor	
  Graphs	
  
•  Training	
  phase	
  
– IniFalize	
  worker	
  priors	
  
– with	
  k	
  matches	
  on	
  known	
  answers	
  
•  UpdaFng	
  worker	
  Priors	
  
– Use	
  link	
  decision	
  as	
  new	
  observaFons	
  
– Compute	
  new	
  worker	
  probabiliFes	
  
•  IdenFfy	
  (and	
  discard)	
  unreliable	
  workers	
  
22-­‐Apr-­‐12	
   Gianluca	
  DemarFni,	
  eXascale	
  Infolab	
   11	
  
Experimental	
  EvaluaFon	
  
•  Datasets	
  
–  25	
  news	
  arFcles	
  from	
  
•  CNN.com	
  (Global	
  news)	
  
•  NYTimes.com	
  (Global	
  news)	
  
•  Washington-­‐post.com	
  (US	
  local	
  news)	
  
•  Timesofindia.indiaFmes.com	
  (India	
  local	
  news)	
  
•  Swissinfo.com	
  (Switzerland	
  local	
  news)	
  
–  40M	
  enFFes	
  (Freebase,	
  DBPedia,	
  Geonames,	
  NYT)	
  
•  80	
  workers	
  on	
  Amazon	
  MTurk,	
  $0.01	
  per	
  enFty	
  
•  Standard	
  evaluaFon	
  measures:	
  Prec,	
  Recall,	
  Acc	
  
•  Gold	
  Standard:	
  editorial	
  selecFon	
  
22-­‐Apr-­‐12	
   Gianluca	
  DemarFni,	
  eXascale	
  Infolab	
   12	
  
The	
  micro-­‐task	
  
22-­‐Apr-­‐12	
   Gianluca	
  DemarFni,	
  eXascale	
  Infolab	
   13	
  
Experimental	
  EvaluaFon	
  
•  AutomaFc	
  Approach:	
  P/R	
  trade-­‐off	
  
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
0.7"
0.8"
0.9"
1"
1" 2" 3" 4" 5"
Precision)/)Recall)
Top)N)Results)
P"
R"
22-­‐Apr-­‐12	
   Gianluca	
  DemarFni,	
  eXascale	
  Infolab	
   14	
  
Experimental	
  EvaluaFon	
  
•  EnFty	
  Linking	
  with	
  Crowdsourcing	
  and	
  
agreement	
  vote	
  (at	
  least	
  2	
  out	
  of	
  5	
  workers	
  
select	
  the	
  same	
  URI)	
  
	
  
	
  
	
  
	
  
Top-­‐1	
  precision:	
  0.70	
  
the URIs with at least 2 votes are selected as valid links
(we tried various thresholds and manually picked 2 in the
end since it leads to the highest precision scores while keep-
ing good recall values for our experiments). We report on
the performance of this crowdsourcing technique in Table 2.
The values are averaged over all linkable entities for di↵erent
document types and worker communities.
Table 2: Performance results for crowdsourcing with
agreement vote over linkable entities.
US Workers Indian Workers
P R A P R A
GL News 0.79 0.85 0.77 0.60 0.80 0.60
US News 0.52 0.61 0.54 0.50 0.74 0.47
IN News 0.62 0.76 0.65 0.64 0.86 0.63
SW News 0.69 0.82 0.69 0.50 0.69 0.56
All News 0.74 0.82 0.73 0.57 0.78 0.59
The first question we examine is whether there is a di↵er-
ence in reliability between the various populations of work-
Figure
textua
Entity L
We n
ference
method
consisti
phase, c
the wor
is know
In ord
ence in
of bad
ered as
clicks o
In our e
swers in
22-­‐Apr-­‐12	
   Gianluca	
  DemarFni,	
  eXascale	
  Infolab	
   15	
  
Experimental	
  EvaluaFon	
  
•  Worker	
  community	
  and	
  textual	
  context	
  
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
0.7"
0.8"
0.9"
1"
0" 0.2" 0.4" 0.6" 0.8" 1"
Recall&
Precision&
US"
India"
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
0.7"
0.8"
0.9"
1"
1" 2" 3" 4" 5" 6" 7" 8" 9" 10"
Precision)
Document)
Simple"
Snippet"
22-­‐Apr-­‐12	
   Gianluca	
  DemarFni,	
  eXascale	
  Infolab	
   16	
  
Experimental	
  EvaluaFon	
  
•  Worker	
  SelecFon	
  
Top$US$
Worker$
0$
0.5$
1$
0$ 250$ 500$
Worker&Precision&
Number&of&Tasks&
US$Workers$
IN$Workers$
0.6$
0.62$
0.64$
0.66$
0.68$
0.7$
0.72$
0.74$
0.76$
0.78$
0.8$
1$ 2$ 3$ 4$ 5$ 6$ 7$ 8$ 9$
Precision)
Top)K)workers)
22-­‐Apr-­‐12	
   Gianluca	
  DemarFni,	
  eXascale	
  Infolab	
   17	
  
Experimental	
  EvaluaFon	
  
•  EnFty	
  Linking	
  with	
  ZenCrowd	
  
– Training	
  with	
  first	
  5	
  enFFes	
  +	
  5%	
  aserwards	
  
– 3	
  consecuFve	
  bad	
  answers	
  lead	
  to	
  blacklisFng	
  
di↵er-
work-
s per-
point
all en-
tasks
higher
orkers
ws as
rms of
ed on
clicks on the links, hence generating noise in our system.
In our experiments, we consider that 3 consecutive bad an-
swers in the training phase is enough to identify the worker
as a spammer and to blacklist him/her. We report the aver-
age results of ZenCrowd when exploiting the training phase,
constraints, and blacklisting in Table 3. As we can observe,
precision and accuracy values are higher in all cases when
compared to the agreement vote approach.
Table 3: Performance results for crowdsourcing with
ZenCrowd over linkable entities.
US Workers Indian Workers
P R A P R A
GL News 0.84 0.87 0.90 0.67 0.64 0.78
US News 0.64 0.68 0.78 0.55 0.63 0.71
IN News 0.84 0.82 0.89 0.75 0.77 0.80
SW News 0.72 0.80 0.85 0.61 0.62 0.73
All News 0.80 0.81 0.88 0.64 0.62 0.76
Finally, we compare ZenCrowd to the state of the art22-­‐Apr-­‐12	
   Gianluca	
  DemarFni,	
  eXascale	
  Infolab	
   18	
  
Comparing	
  3	
  matching	
  techniques	
  
•  ZenCrowd	
  best	
  for	
  75%	
  of	
  documents	
  
22-­‐Apr-­‐12	
   Gianluca	
  DemarFni,	
  eXascale	
  Infolab	
   19	
  
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
0.7"
0.8"
0.9"
1"
1" 2" 3" 4" 5" 6" 7" 8" 9" 10"11"12"13"14"15"16"17"18"19"20"21"22"23"24"25"
Precision)
)
Document)
Agr."Vote"
ZenCrowd"
Top"1"
Simple	
  Crowdsourcing	
  
ZenCrowd	
  
AutomaFc	
  
Lessons	
  Learnt	
  
•  Crowdsourcing	
  +	
  Prob	
  reasoning	
  works!	
  
•  But	
  
– Different	
  worker	
  communiFes	
  perform	
  differently	
  
– Many	
  low	
  quality	
  workers	
  
– No	
  differences	
  w/	
  different	
  contexts	
  
– CompleFon	
  Fme	
  may	
  vary	
  (based	
  on	
  reward)	
  
22-­‐Apr-­‐12	
   Gianluca	
  DemarFni,	
  eXascale	
  Infolab	
   20	
  
Conclusions	
  
•  ZenCrowd:	
  ProbabilisFc	
  reasoning	
  over	
  automaFc	
  and	
  
crowdsourcing	
  methods	
  for	
  enFty	
  linking	
  
•  Standard	
  crowdsourcing	
  improves	
  6%	
  over	
  automaFc	
  
•  4%	
  -­‐	
  35%	
  improvement	
  over	
  standard	
  crowdsourcing	
  
•  14%	
  average	
  improvement	
  over	
  automaFc	
  approaches	
  
•  Next	
  steps	
  
–  Long-­‐term	
  worker	
  behavior	
  analysis	
  
–  More	
  efficient	
  and	
  effecFve	
  linking	
  by	
  LOD	
  dataset	
  pre-­‐
selecFon	
  
hVp://diuf.unifr.ch/xi/zencrowd/	
  
22-­‐Apr-­‐12	
   Gianluca	
  DemarFni,	
  eXascale	
  Infolab	
   21	
  

Mais conteúdo relacionado

Semelhante a ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for Large-Scale Entity Linking

Entities, Graphs, and Crowdsourcing for better Web Search
Entities, Graphs, and Crowdsourcing for better Web SearchEntities, Graphs, and Crowdsourcing for better Web Search
Entities, Graphs, and Crowdsourcing for better Web SearcheXascale Infolab
 
Detection of Related Semantic Datasets Based on Frequent Subgraph Mining
Detection of Related Semantic Datasets Based on Frequent Subgraph MiningDetection of Related Semantic Datasets Based on Frequent Subgraph Mining
Detection of Related Semantic Datasets Based on Frequent Subgraph MiningMikel Emaldi Manrique
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
 
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...Advanced-Concepts-Team
 
Detecting common scientific workflow fragments using templates and execution ...
Detecting common scientific workflow fragments using templates and execution ...Detecting common scientific workflow fragments using templates and execution ...
Detecting common scientific workflow fragments using templates and execution ...dgarijo
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" Joshua Bloom
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkKrishna Sankar
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development Spark Summit
 
Getting Semantics from the Crowd
Getting Semantics from the CrowdGetting Semantics from the Crowd
Getting Semantics from the CrowdeXascale Infolab
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
 
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark TrainingSpark Summit
 
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo..."Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...Lucidworks
 
Understanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesUnderstanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesLynn Langit
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and HadoopDonald Miner
 
Online Index Extraction from Linked Open Data Sources
Online Index Extraction from Linked Open Data SourcesOnline Index Extraction from Linked Open Data Sources
Online Index Extraction from Linked Open Data SourcesFabio Benedetti
 

Semelhante a ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for Large-Scale Entity Linking (20)

Entities, Graphs, and Crowdsourcing for better Web Search
Entities, Graphs, and Crowdsourcing for better Web SearchEntities, Graphs, and Crowdsourcing for better Web Search
Entities, Graphs, and Crowdsourcing for better Web Search
 
Detection of Related Semantic Datasets Based on Frequent Subgraph Mining
Detection of Related Semantic Datasets Based on Frequent Subgraph MiningDetection of Related Semantic Datasets Based on Frequent Subgraph Mining
Detection of Related Semantic Datasets Based on Frequent Subgraph Mining
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
 
Dagstuhl2014
Dagstuhl2014Dagstuhl2014
Dagstuhl2014
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
 
LD4KD 2015 - Demos and tools
LD4KD 2015 - Demos and toolsLD4KD 2015 - Demos and tools
LD4KD 2015 - Demos and tools
 
Distributed deep learning_over_spark_20_nov_2014_ver_2.8
Distributed deep learning_over_spark_20_nov_2014_ver_2.8Distributed deep learning_over_spark_20_nov_2014_ver_2.8
Distributed deep learning_over_spark_20_nov_2014_ver_2.8
 
Detecting common scientific workflow fragments using templates and execution ...
Detecting common scientific workflow fragments using templates and execution ...Detecting common scientific workflow fragments using templates and execution ...
Detecting common scientific workflow fragments using templates and execution ...
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning"
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development
 
Getting Semantics from the Crowd
Getting Semantics from the CrowdGetting Semantics from the Crowd
Getting Semantics from the Crowd
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark Training
 
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo..."Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
 
Understanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesUnderstanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examples
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 
Online Index Extraction from Linked Open Data Sources
Online Index Extraction from Linked Open Data SourcesOnline Index Extraction from Linked Open Data Sources
Online Index Extraction from Linked Open Data Sources
 

Mais de eXascale Infolab

Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link PredictionBeyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link PredictioneXascale Infolab
 
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...eXascale Infolab
 
Representation Learning on Complex Graphs
Representation Learning on Complex GraphsRepresentation Learning on Complex Graphs
Representation Learning on Complex GraphseXascale Infolab
 
A force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory mapA force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory mapeXascale Infolab
 
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...eXascale Infolab
 
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...eXascale Infolab
 
Dependency-Driven Analytics: A Compass for Uncharted Data Oceans
Dependency-Driven Analytics: A Compass for Uncharted Data OceansDependency-Driven Analytics: A Compass for Uncharted Data Oceans
Dependency-Driven Analytics: A Compass for Uncharted Data OceanseXascale Infolab
 
SANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutionSANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutioneXascale Infolab
 
Efficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked DataEfficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked DataeXascale Infolab
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data ManagementeXascale Infolab
 
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked DataLDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked DataeXascale Infolab
 
Executing Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web DataExecuting Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web DataeXascale Infolab
 
The Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task CrowdsourcingThe Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task CrowdsourcingeXascale Infolab
 
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...eXascale Infolab
 
CIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition rankingCIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition rankingeXascale Infolab
 
An Introduction to Big Data
An Introduction to Big DataAn Introduction to Big Data
An Introduction to Big DataeXascale Infolab
 

Mais de eXascale Infolab (20)

Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link PredictionBeyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
 
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
 
Representation Learning on Complex Graphs
Representation Learning on Complex GraphsRepresentation Learning on Complex Graphs
Representation Learning on Complex Graphs
 
A force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory mapA force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory map
 
Cikm 2018
Cikm 2018Cikm 2018
Cikm 2018
 
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
 
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
 
Dependency-Driven Analytics: A Compass for Uncharted Data Oceans
Dependency-Driven Analytics: A Compass for Uncharted Data OceansDependency-Driven Analytics: A Compass for Uncharted Data Oceans
Dependency-Driven Analytics: A Compass for Uncharted Data Oceans
 
Crowd scheduling www2016
Crowd scheduling www2016Crowd scheduling www2016
Crowd scheduling www2016
 
SANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutionSANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference Resolution
 
Efficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked DataEfficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked Data
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data Management
 
SSSW 2015 Sense Making
SSSW 2015 Sense MakingSSSW 2015 Sense Making
SSSW 2015 Sense Making
 
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked DataLDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
 
Executing Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web DataExecuting Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web Data
 
The Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task CrowdsourcingThe Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task Crowdsourcing
 
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
 
CIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition rankingCIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition ranking
 
OLTP-Bench
OLTP-BenchOLTP-Bench
OLTP-Bench
 
An Introduction to Big Data
An Introduction to Big DataAn Introduction to Big Data
An Introduction to Big Data
 

Último

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 

Último (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for Large-Scale Entity Linking

  • 1. ZenCrowd   Leveraging  Probabilis3c  Reasoning  and   Crowdsourcing  Techniques  for  Large-­‐Scale   En3ty  Linking       Gianluca  Demar3ni,  Djellel  Eddine   Difallah,  and  Philippe  Cudré-­‐Mauroux   eXascale  Infolab,  University  of  Fribourg   Switzerland  
  • 2. MoFvaFon   •  Linked  Open  Data  (LOD)   •  Linking  enFty  from  text  to  LOD   – Rich  snippets   – EnFty-­‐centric  search   •  Linking   – Algorithmic   – Manual  (NYT  arFcles)   HTML+ RDFa Pages 22-­‐Apr-­‐12   Gianluca  DemarFni,  eXascale  Infolab   2  
  • 3. Gianluca  DemarFni,  eXascale  Infolab   3   hVp://dbpedia.org/resource/Facebook   hVp://dbpedia.org/resource/Instagram   Zase:Instagram   owl:sameAs   Google   Android   <p>Facebook  is  not  waiFng  for  its  iniFal   public  offering  to  make  its  first  big   purchase.</p><p>In  its  largest   acquisiFon  to  date,  the  social  network   has  purchased  Instagram,  the  popular   photo-­‐sharing  applicaFon,  for  about  $1   billion  in  cash  and  stock,  the  company   said  Monday.</p>   <p><span  about="hVp://dbpedia.org/resource/ Facebook"><cite  property=”rdfs:label">Facebook</ cite>  is  not  waiFng  for  its  iniFal  public  offering  to   make  its  first  big  purchase.</span></p><p><span   about="hVp://dbpedia.org/resource/Instagram">In   its  largest  acquisiFon  to  date,  the  social  network  has   purchased  <cite  property=”rdfs:label">Instagram</ cite>  ,  the  popular  photo-­‐sharing  applicaFon,  for   about  $1  billion  in  cash  and  stock,  the  company  said   Monday.</span></p>   RDFa   enrichment   HTML:  
  • 4. Crowdsourcing   •  Exploit  human  intelligence  to  solve   – Tasks  simple  for  humans,  complex  for  machines   – With  a  large  number  of  humans  (the  Crowd)   – Small  problems:  micro-­‐tasks  (Amazon  MTurk)   •  Examples   – Wikipedia,  Image  tagging   •  IncenFves   – Financial,  fun,  visibility   22-­‐Apr-­‐12   Gianluca  DemarFni,  eXascale  Infolab   4  
  • 5. ZenCrowd   •  Combine  both  algorithmic  and  manual  linking   •  Automate  manual  linking  via  crowdsourcing   •  Dynamically  assess  human  workers  with  a   probabilisFc  reasoning  framework   22-­‐Apr-­‐12   5   Crowd   Algorithms  Machines  
  • 6. Outline   •  Related  approaches   •  ZenCrowd:  System  architecture   •  ProbabilisFc  model  to  combine  automaFc   matching  and  crowdsourcing  results   •  Experimental  evaluaFon   22-­‐Apr-­‐12   Gianluca  DemarFni,  eXascale  Infolab   6  
  • 7. Related  Approaches   •  EnFty  Linking   •  Ad-­‐hoc  Object  Retrieval   – Keyword  queries   – Looking  for  a  specific  enFty   – IR  indexing  and  ranking  over  RDF  data   •  Crowdsourcing   – Training  set,  tagging,  annotaFon   – IR  evaluaFon   – CrowdDB   22-­‐Apr-­‐12   Gianluca  DemarFni,  eXascale  Infolab   7  
  • 8. ZenCrowd  Architecture   Micro Matching Tasks HTML Pages HTML+ RDFa Pages LOD Open Data Cloud Crowdsourcing Platform ZenCrowd Entity Extractors LOD Index Get Entity Input Output Probabilistic Network Decision Engine Micro- TaskManager Workers Decisions Algorithmic Matchers 22-­‐Apr-­‐12   Gianluca  DemarFni,  eXascale  Infolab   8  
  • 9. Algorithmic  Matching   •  Inverted  index  over  LOD  enFFes   – DBPedia,  Freebase,  Geonames,  NYT   •  TF-­‐IDF  (IR  ranking  funcFon)   •  Top  ranked  URIs  linked  to  enFFes  in  docs   •  Threshold  on  the  ranking  funcFon  or  top  N   22-­‐Apr-­‐12   Gianluca  DemarFni,  eXascale  Infolab   9  
  • 10. EnFty  Factor  Graphs   •  Graph  components   – Workers,  links,  clicks   – Prior  probabiliFes   – Link  Factors   – Constraints   •  ProbabilisFc   Inference   – Select  all  links  with   posterior  prob  >τ   w1 w2 l1 l2 pw1( ) pw2( ) lf1( ) lf2( ) pl1( ) pl2( ) l3 lf3( ) pl3( ) c11 c22 c12 c21 c13 c23 u2-3( )sa1-2( ) 2  workers,  6  clicks,  3  candidate  links   Link  priors   Worker   priors   Observed   variables   Link   factors   SameAs   constraints   Dataset   Unicity   constraints 22-­‐Apr-­‐12   Gianluca  DemarFni,  eXascale  Infolab   10  
  • 11. EnFty  Factor  Graphs   •  Training  phase   – IniFalize  worker  priors   – with  k  matches  on  known  answers   •  UpdaFng  worker  Priors   – Use  link  decision  as  new  observaFons   – Compute  new  worker  probabiliFes   •  IdenFfy  (and  discard)  unreliable  workers   22-­‐Apr-­‐12   Gianluca  DemarFni,  eXascale  Infolab   11  
  • 12. Experimental  EvaluaFon   •  Datasets   –  25  news  arFcles  from   •  CNN.com  (Global  news)   •  NYTimes.com  (Global  news)   •  Washington-­‐post.com  (US  local  news)   •  Timesofindia.indiaFmes.com  (India  local  news)   •  Swissinfo.com  (Switzerland  local  news)   –  40M  enFFes  (Freebase,  DBPedia,  Geonames,  NYT)   •  80  workers  on  Amazon  MTurk,  $0.01  per  enFty   •  Standard  evaluaFon  measures:  Prec,  Recall,  Acc   •  Gold  Standard:  editorial  selecFon   22-­‐Apr-­‐12   Gianluca  DemarFni,  eXascale  Infolab   12  
  • 13. The  micro-­‐task   22-­‐Apr-­‐12   Gianluca  DemarFni,  eXascale  Infolab   13  
  • 14. Experimental  EvaluaFon   •  AutomaFc  Approach:  P/R  trade-­‐off   0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" 0.9" 1" 1" 2" 3" 4" 5" Precision)/)Recall) Top)N)Results) P" R" 22-­‐Apr-­‐12   Gianluca  DemarFni,  eXascale  Infolab   14  
  • 15. Experimental  EvaluaFon   •  EnFty  Linking  with  Crowdsourcing  and   agreement  vote  (at  least  2  out  of  5  workers   select  the  same  URI)           Top-­‐1  precision:  0.70   the URIs with at least 2 votes are selected as valid links (we tried various thresholds and manually picked 2 in the end since it leads to the highest precision scores while keep- ing good recall values for our experiments). We report on the performance of this crowdsourcing technique in Table 2. The values are averaged over all linkable entities for di↵erent document types and worker communities. Table 2: Performance results for crowdsourcing with agreement vote over linkable entities. US Workers Indian Workers P R A P R A GL News 0.79 0.85 0.77 0.60 0.80 0.60 US News 0.52 0.61 0.54 0.50 0.74 0.47 IN News 0.62 0.76 0.65 0.64 0.86 0.63 SW News 0.69 0.82 0.69 0.50 0.69 0.56 All News 0.74 0.82 0.73 0.57 0.78 0.59 The first question we examine is whether there is a di↵er- ence in reliability between the various populations of work- Figure textua Entity L We n ference method consisti phase, c the wor is know In ord ence in of bad ered as clicks o In our e swers in 22-­‐Apr-­‐12   Gianluca  DemarFni,  eXascale  Infolab   15  
  • 16. Experimental  EvaluaFon   •  Worker  community  and  textual  context   0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" 0.9" 1" 0" 0.2" 0.4" 0.6" 0.8" 1" Recall& Precision& US" India" 0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" 0.9" 1" 1" 2" 3" 4" 5" 6" 7" 8" 9" 10" Precision) Document) Simple" Snippet" 22-­‐Apr-­‐12   Gianluca  DemarFni,  eXascale  Infolab   16  
  • 17. Experimental  EvaluaFon   •  Worker  SelecFon   Top$US$ Worker$ 0$ 0.5$ 1$ 0$ 250$ 500$ Worker&Precision& Number&of&Tasks& US$Workers$ IN$Workers$ 0.6$ 0.62$ 0.64$ 0.66$ 0.68$ 0.7$ 0.72$ 0.74$ 0.76$ 0.78$ 0.8$ 1$ 2$ 3$ 4$ 5$ 6$ 7$ 8$ 9$ Precision) Top)K)workers) 22-­‐Apr-­‐12   Gianluca  DemarFni,  eXascale  Infolab   17  
  • 18. Experimental  EvaluaFon   •  EnFty  Linking  with  ZenCrowd   – Training  with  first  5  enFFes  +  5%  aserwards   – 3  consecuFve  bad  answers  lead  to  blacklisFng   di↵er- work- s per- point all en- tasks higher orkers ws as rms of ed on clicks on the links, hence generating noise in our system. In our experiments, we consider that 3 consecutive bad an- swers in the training phase is enough to identify the worker as a spammer and to blacklist him/her. We report the aver- age results of ZenCrowd when exploiting the training phase, constraints, and blacklisting in Table 3. As we can observe, precision and accuracy values are higher in all cases when compared to the agreement vote approach. Table 3: Performance results for crowdsourcing with ZenCrowd over linkable entities. US Workers Indian Workers P R A P R A GL News 0.84 0.87 0.90 0.67 0.64 0.78 US News 0.64 0.68 0.78 0.55 0.63 0.71 IN News 0.84 0.82 0.89 0.75 0.77 0.80 SW News 0.72 0.80 0.85 0.61 0.62 0.73 All News 0.80 0.81 0.88 0.64 0.62 0.76 Finally, we compare ZenCrowd to the state of the art22-­‐Apr-­‐12   Gianluca  DemarFni,  eXascale  Infolab   18  
  • 19. Comparing  3  matching  techniques   •  ZenCrowd  best  for  75%  of  documents   22-­‐Apr-­‐12   Gianluca  DemarFni,  eXascale  Infolab   19   0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" 0.9" 1" 1" 2" 3" 4" 5" 6" 7" 8" 9" 10"11"12"13"14"15"16"17"18"19"20"21"22"23"24"25" Precision) ) Document) Agr."Vote" ZenCrowd" Top"1" Simple  Crowdsourcing   ZenCrowd   AutomaFc  
  • 20. Lessons  Learnt   •  Crowdsourcing  +  Prob  reasoning  works!   •  But   – Different  worker  communiFes  perform  differently   – Many  low  quality  workers   – No  differences  w/  different  contexts   – CompleFon  Fme  may  vary  (based  on  reward)   22-­‐Apr-­‐12   Gianluca  DemarFni,  eXascale  Infolab   20  
  • 21. Conclusions   •  ZenCrowd:  ProbabilisFc  reasoning  over  automaFc  and   crowdsourcing  methods  for  enFty  linking   •  Standard  crowdsourcing  improves  6%  over  automaFc   •  4%  -­‐  35%  improvement  over  standard  crowdsourcing   •  14%  average  improvement  over  automaFc  approaches   •  Next  steps   –  Long-­‐term  worker  behavior  analysis   –  More  efficient  and  effecFve  linking  by  LOD  dataset  pre-­‐ selecFon   hVp://diuf.unifr.ch/xi/zencrowd/   22-­‐Apr-­‐12   Gianluca  DemarFni,  eXascale  Infolab   21