2. Architecture of
Linked Data Applications
Presentation Tier
Logic Tier
Data Tier
Integrated
Dataset
Data Access
Component
Republication
Republication
Component
Data Integration Component
Vocabulary
Mapping
Interlinking
SPARQL Wr.
Physical Wrapper
R2R Transf.
Cleansing
LD Wrapper
RDF/
XML
Web Data accessed via APIs
SPARQL
Endpoints
EUCLID – Microtask crowdsourcing
applications for Linked Data
Relational Data
Linked Data
2
3. Data Tier
Data Integration Component
Data Access
Component
Data Integration Component
Vocabulary
Mapping
Interlinking
Cleansing
• Consolidates the data retrieved from heterogeneous sources.
• This component may operate at:
– Schema level: Performs vocabulary mappings in order to translate
data into a single unified schema. Links correspond to RDFS properties
CH 2
or OWL property and class axioms.
– Instance level: Performs entity linking, e.g., entity resolution via
owl:sameAs links
CH 3
EUCLID – Microtask crowdsourcing
applications for Linked Data
3
4. Data Tier (2)
Data Integration Component
Data Access
Component
Data Integration Component
Vocabulary
Mapping
Interlinking
Cleansing
The data integration component can be enhanced by including
microtask crowdsourcing apporaches:
• Cleansing or data assessments: Assessment of DBpedia triples
• Vocabulary mapping: CrowdMAP
• Interlinking: ZenCrowd
EUCLID – Microtask crowdsourcing
applications for Linked Data
4
5. Other Crowdsourcing-based
Solutions for Linked Data Tasks
• Query understanding: CrowdDQ
• Ontology population: OntoGame
• Linked Data curation: Urbanopoly
• …
EUCLID – Microtask crowdsourcing
applications for Linked Data
5
7. Assessing DBpedia Triples
Correct
{s p o .}
Dataset
{s p o .}
Incorrect +
Quality issue
1. Selecting LD quality issues generated by erroneous extraction
mechanisms and that can be detected by the crowd
2. Selecting the appropriate crowdsourcing approaches
3. Designing and generating the interfaces to present the data to the
crowd
EUCLID – Microtask crowdsourcing
applications for Linked Data
8. Selecting LD Quality
Issues to Crowdsource
Three categories of quality problems occur
pervasively in DBpedia [Zaveri2013]
and can be crowdsourced:
• Incorrect object
Example: dbpedia:Dave_Dobbyn dbprop:dateOfBirth “3”.
• Incorrect data type
Example: dbpedia:Torishima_Izu_Islands foaf:name “鳥島”@en.
• Incorrect link to “external Web pages”
Example: dbpedia:John-Two-Hawks dbpediaowl:wikiPageExternalLink
<http://cedarlakedvd.com/>
EUCLID – Microtask crowdsourcing
applications for Linked Data
10. Presenting the Data
to the Crowd
Microtask interfaces: MTurk tasks
Incorrect object
• Selection of foaf:name or
rdfs:label to extract humanreadable descriptions
• Real object values extracted
automatically from Wikipedia
infoboxes
Incorrect data type
• Link to the Wikipedia article via
foaf:isPrimaryTopicOf
Incorrect outlink
• Preview of external pages by
implementing HTML iframe
EUCLID – Microtask crowdsourcing
applications for Linked Data
11. Results
Object values
Data types
Interlinks
Linked Data
experts
0.7151
0.8270
0.1525
MTurk
0.8977
0.4752
0.9412
(majority voting)
• Both forms of crowdsourcing can be applied to detect
certain LD quality issues
• The effort of LD experts must be applied on those tasks
demanding specific-domain skills
• MTurk crowd are exceptionally good at performing
comparison of data entries
EUCLID – Microtask crowdsourcing
applications for Linked Data
11
13. ZenCrowd: Entity Linking by
the Crowd
• Combine both algorithmic and manual linking
• Automate manual linking via crowdsourcing
• Dynamically assess human workers with a
probabilistic reasoning framework
Crowd
Machines
EUCLID – Microtask crowdsourcing
applications for Linked Data
Algorithms
13
14. http://dbpedia.org/resource/Facebook
HTML:
<p>Facebook is not waiting for its initial
public offering to make its first big
purchase.</p><p>In its largest
acquisition to date, the social network
has purchased Instagram, the popular
photo-sharing application, for about $1
billion in cash and stock, the company
said Monday.</p>
http://dbpedia.org/resource/Instagram
owl:sameAs
fbase:Instagram
Google
RDFa
enrichment
Android
<p><span
about="http://dbpedia.org/resource/Facebook"><cit
e property=”rdfs:label">Facebook</cite> is not
waiting for its initial public offering to make its first
big purchase.</span></p><p><span
about="http://dbpedia.org/resource/Instagram">In
its largest acquisition to date, the social network has
purchased <cite
property=”rdfs:label">Instagram</cite> , the popular
photo-sharing application, for about $1 billion in cash
and stock, the company said Monday.</span></p>
EUCLID – Microtask crowdsourcing
applications for Linked Data
14
15. ZenCrowd Architecture
HTML
Pages
Input
Z enCrowd
Micro
Matching
Tasks
MicroTask Manager
Entity
Extractors
Crowdsourcing
Platform
HTML+ RDFa
Pages
Output
Algorithmic
Matchers
Decision Engine
Probabilistic
Network
LOD Index Get Entity
Workers Decisions
LOD Open Data Cloud
Gianluca Demartini, Djellel Eddine Difallah, and Philippe Cudré-Mauroux. ZenCrowd: Leveraging Probabilistic
Reasoning and Crowdsourcing Techniques for Large-Scale Entity Linking. In: 21st International Conference on
World Wide Web (WWW 2012).
EUCLID – Microtask crowdsourcing
applications for Linked Data
15
17. Lessons Learnt
• Crowdsourcing + Prob reasoning works!
• But
– Different worker communities perform differently
– Many low quality workers
– Completion time may vary (based on reward)
• Need to find the right workers for your task
(see WWW13 paper)
EUCLID – Microtask crowdsourcing
applications for Linked Data
17
18. ZenCrowd Summary
• ZenCrowd: Probabilistic reasoning over automatic and
crowdsourcing methods for entity linking
• Standard crowdsourcing improves 6% over automatic
• 4% - 35% improvement over standard crowdsourcing
• 14% average improvement over automatic approaches
http://exascale.info/zencrowd/
• Follow up-work (VLDBJ):
– Also used for instance matching across datasets
– 3-way blocking with the crowd
EUCLID – Microtask crowdsourcing
applications for Linked Data
18
20. Motivation
• Web Search Engines can answer simple factual
queries directly on the result page
• Users with complex information needs are
often unsatisfied
• Purely automatic techniques are not enough
• We want to solve it with Crowdsourcing!
EUCLID – Microtask crowdsourcing
applications for Linked Data
20
21. CrowdQ
• CrowdQ is the first system that uses
crowdsourcing to
– Understand the intended meaning
– Build a structured query template
– Answer the query over Linked Open Data
Gianluca Demartini, Beth Trushkowsky, Tim Kraska, and Michael Franklin. CrowdQ:
Crowdsourced Query Understanding. In: 6th Biennial Conference on Innovative Data Systems
Research (CIDR 2013).
EUCLID – Microtask crowdsourcing
applications for Linked Data
21
23. CrowdQ Architecture
Off-line: query template generation with the help of the crowd
On-line: query template matching using NLP and search over open data
Keyword Query
On#
line'Complex'Query
Processing
Complex
query
classifier
User
Y
Off#
line'Complex'Query
Decomposition
query
POS + NER tagging
N
N
Structured Query
Vetrical
selection,
Unstructured
Search, ...
Crowd
Manager
Match with existing Queries Templ +
Answer Types
query templates
t1
t2
t3
Template Generation
Answer
Composition
Query Template Index
SERP
Query
Log
Structured
LOD Search
Crowdsourcing
Platform
Result Joiner
23
LOD Open Data Cloud
24. Hybrid Human-Machine
Pipeline
Q= birthdate of actors of forrest gump
Query annotation
Noun
Noun
Named entity
Verification
Is forrest gump this entity in the query?
Entity Relations
Which is the relation between: actors and forrest gump
Schema element
Starring
Verification
Is the relation between:
Indiana Jones – Harrison Ford
Back to the Future – Michael J. Fox
of the same type as
Forrest Gump – actors
starring
<dbpedia-owl:starring>
EUCLID – Microtask crowdsourcing
applications for Linked Data
24
25. Structured query generation
Q= birthdate of actors of forrest gump
SELECT ?y ?x
WHERE { ?y <dbpedia-owl:birthdate> ?x .
?z <dbpedia-owl:starring> ?y .
?z <rdfs:label> ‘Forrest Gump’ }
Results from BTC09:
EUCLID – Microtask crowdsourcing
applications for Linked Data
25
28. Taste IT! Try IT!
•
•
•
•
Restaurant review Android app developed in the Insemtives project
Uses Dbpedia concepts to generate structured reviews
Uses mechanism design/gamification to configure incentives
User study
–
2274 reviews by 180 reviewers referring to 900 restaurants, using 5667 DPpedia concepts
2500
2000
1500
1000
500
0
CAFE
FASTFOOD
PUB
RESTAURANT
Numer of reviews
Number of semantic annotations (type of cuisine)
Number of semantic annotations (dishes)
https://play.google.com/store/apps/details?id=insemtives.android&hl=en
11/11/2013
EUCLID – Microtask crowdsourcing
applications for Linked Data
28
32. Problems and Challenges
•
What is feasible and how can tasks be optimally translated into microtasks?
– Examples: data quality assessment for technical and contextual features; subjective vs
objective tasks (also in modeling); open-ended questions
•
What to show to users
– Natural language descriptions of Linked Data/SPARQL
– How much context
– What form of rendering
– How about links?
•
How to combine with automatic tools
–
Which results to validate
•
•
•
Low precision (no fun for gamers...)
Low recall (vs all possible questions)
How to embed it into an existing application
– Tasks are fine granular, perceived as additional burden to the actual functionality
•
What to do with the resulting data?
– Integration into existing practices
– Vocabularies!
11/11/2013
EUCLID – Microtask crowdsourcing
applications for Linked Data
32
34. For exercises, quiz and further material visit our website:
http://www.euclid-project.eu
Course
eBook
Other channels:
@euclid_project
euclidproject
EUCLID – Microtask crowdsourcing
applications for Linked Data
euclidproject
34