SlideShare a Scribd company logo
1 of 60
Download to read offline
Mining and Understanding (Learning)
Activities and Resources on the Web
Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
L3S Research Center, Hannover, Germany
14/07/16 1Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
Research areas
 Web science, Information Retrieval, Semantic Web, Social Web
Analytics, Knowledge Discovery, Human Computation
 Interdisciplinary application areas: digital humanities,
TEL/education, Web archiving, mobility
Some projects
L3S Research Center
14/07/16 2
 See also: http://www.l3s.de
Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
“Intelligent Access to Information” / L3S
14/07/16 3Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
Team & current projects
LA4S LearnWeb
14/07/16 4
GlycoRec
Ran Yu
Ujwal Gadiraju
Besnik Fetahu
Stefan Dietze
Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
14/07/16 5
AFEL – Analytics for Everyday (Online) Learning
Figure courtesy of Mathieu d‘Aquin
Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
14/07/16 6
AFEL – Analytics for Everyday Learning
Apply and Evaluate
- WP1 -
Data
Capture
- WP3 -
Visual
Analytics
- WP5 -
Use Cases and
Evaluation
Collect & Enrich Data
Detect and Model User &
Learning Activities
Analyse Learning Behaviour
- WP2 -
Data
Enrichment
- WP4 -
Cognitive
Modelling
Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
Figure courtesy of Mathieu d‘Aquin
14/07/16 7
AFEL – Analytics for Everyday Learning
Entities/notions, e.g.:
• Learning
• ... Resource
• ... Activity
• ... Performance
• Knowledge
• Competence
• ....
Collect & Enrich Data
Detect and Model User &
Learning Activities
Analyse Learning Behaviour
- WP2 -
Data
Enrichment
- WP4 -
Cognitive
Modelling
Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
14/07/16 8
AFEL – Analytics for Everyday Learning
Entities/notions, e.g.:
• Learning
• ... Resource
• ... Activity
• ... Performance
• Knowledge
• Competence
• ....
Collect & Enrich Data
Detect and Model User &
Learning Activities
Analyse Learning Behaviour
- WP2 -
Data
Enrichment
- WP4 -
Cognitive
Modelling
Understanding informal/micro learning on the Web (e.g. Social Web) – Challenges:
 Absence of competence indcators/assessments etc ?
 Measuring/detecting progress/competence etc, i.e. distinguish good/bad performance ?
 Understanding learning activities => understanding of learning resources and involved entities
 Heterogeneity and scale of data/activities/documents to consider (i.e. the Web)
 ...
Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
14/07/16 9
Overview
Mining & understanding (learning) resources on the Web:
 “Extracting entity-centric knowledge/learning resources from Web Documents“ (Stefan)
 “Automated Wikipedia Entity Enrichment with News Sources” (Besnik)
Mining & understanding (learning) activities on the Web
 Predicting/measuring „competence“: “Behavioral Methods for Improving the Effectiveness of
Microtask Crowdsourcing" (Ujwal)
Collect & Enrich Data
Detect and Model User &
Learning Activities
Analyse Learning Behaviour
- WP2 -
Data
Enrichment
- WP4 -
Cognitive
Modelling
Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
14/07/16 10
Understanding knowledge resources on the Web
Apple
Digital Revolution
Steve Jobs
IT Company
Bank
Jobs Biopic/Movie
Person
 Detecting (salient) entities in Web
resources/documents
 NLP-based named entity
recognition and disambiguation
(Babelfy, DBpedia Spotlight etc)
 Usually uses background
knowledge graphs
(eg DBpedia/Wikipedia, Linked
Data)
Band
?
Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
Web documents vs structured entity-centric knowledge graphs
14/07/16 11
Unstructured Web documents
Linked Data & Knowledge Graphs
 The Web: approx. 46.000.000.000.000 (46 trillion)
Web pages indexed by Google
vs
 Linked Data & Knowledge Graphs: structured
entity-centric data, approx. 1000 datasets & 100
billion statements (DBpedia, etc)
 Linking entities (NED/NER) from documents:
 Computational complex
 Error-prone
 Issues with less popular entities
(example: regional news sites)
 Knowledge graphs less dynamic than Web
documents
Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
 Markup: entity-centric data embedded in the Web
(30% of all Web documents in 2015)
 Using W3C standards (RDFa, Microdata,
Microformats)
 Schema.org: inititative from
Google, Yahoo, Bing, Yandex to
push common vocabulary
 Same order of magnitude as Web itself with respect
to scale and dynamics
(as opposed to knowledge graphs, DBpedia et al)
 Rich source of knowledge and data going beyond
existing knowledge bases (eg Wikipedia)
Entity-centric data on the Web: Web markup (schema.org)
14/07/16 12
Entity
node2 publisher Pearson Education
node2 publisher Elsevier
node2 published 03-01-2014
Unstructured Web documents
Linked Data & Knowledge Graphs
Embedded Markup (schema.org)
Entity
node1 name French Grammar advanced
node1 publisher The Open University
node1 publisher Nature
node1 datePublished 1956
node1 datePublished 1953
Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
1
10
100
1000
10000
100000
1000000
10000000
1 51 101 151 201
count(log)
PLD (ranked)
# entities # statements
Example: entity markup of learning resources on the Web
 “Learning Resources Metadata Intiative (LRMI)”:
schema.org vocabulary for annotation of learning
resources (informal, formal, etc)
 Approx. 5000 PLDs in “Common Crawl”
 LRMI-Adaptation on the Web (WDC) [LILE16]:
 2014: 30.599.024 quads, 4.182.541 resources
 2013: 10.636873 quads, 1.461.093 resources
14/07/16 13
Power law distribution across providers
4805 Provider / PLDs
Taibi, D., Dietze, S., Towards embedded markup of learning resources
on the Web: a quantitative Analysis of LRMI Terms Usage, in
Companion Publication of the IW3C2 WWW 2016 Conference, IW3C2
2016, Montreal, Canada, April 11, 2016
Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
Entity-centric markup on the Web: challenges
14/07/16 14
Characteristics Example
Coreferences
18.000 results for <„Iphone 6“, type, s:Product>
(8,6 quads on average) in CommonCrawl
Redundancy <s, schema:name, „Iphone 6“> occurring 1000 times in CC
Lack of links Largely unlinked entity descriptions
Errors
(typos & schema
violations, see Meusel
et al [ESWC2015])
Wrong namespaces, such as http://schma.org
Undefined types & predicates:
9,7 %, less common than in LOD
Confusion of datatype and object properties:
<s1, s:publisher, „Springer“>, 24,35 % object property issues vs 8%
in LOD
Data property range violations: e.g. literals vs numbers
(12,6% vs 4,6 in LOD)
 Why not using markup as knowledge graph of entities involved in (learning) resources (similar to
DBpedia/Wikipedia)?
Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
 Improving understanding of resources: consolidating entity-
centric Web data for a given document/resource/entity?
 Markup as distributed knowledge graph/base, e.g. to augment
existing knowledge bases (eg DBpedia/Wikipedia) ?
Data fusion for consolidating entity centric Web markup
14/07/16 15
Yu, R., Gadiraju, U., Zhu, X., Fetahu, B., S. Dietze, Entity
summarisation on structured web markup. In The
Semantic Web: ESWC 2016 Satellite Events. Springer,
2016.
Yu, R., Gadiraju, U., Zhu, X., Fetahu, B., S. Dietze, Fact
Selection for data fusion on structured web markup.
ICDE2017, IEEE International Conference on Data
Engineering, in progress.
Query
iPhone 6, type:(Product)
Entity Description
brand Apple Inc.
weight 129
date 30.09.2015
manufacturer Foxconn
Storage 16 GB
<e1, s:name, „Iphone 6“>
<e2, s:brand, „Apple Inc.“>
<e3, s:brand, „Apple“> <e4, s:weight, 127>
<e5, s:releaseDate, „1.12.1972“>
Web (crawl)
(i.e. billions of entites/facts)
Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
A supervised ML approach to select entity facts from the Web
14/07/16 17
 Fact/entity retrieval: BM25 entity retrieval model on markup index (Common Crawl)
 Fact selection: supervised ML classifier (SVM), using 3 feature categories (relevance, authority, clustering)
 Experiments on Common Crawl: products, movies, books (approx. 3 billion facts)
1. Retrieval
2. Fact selection
New Queries
Foxconn, type:(Organization)
Cupertino, type:(City)
Apple Inc., type:(Organization)
(trained SVM classifier)
Entity Description
brand Apple Inc.
weight 129
date 30.09.2015
manufacturer Foxconn
Storage 16 GB
Query
iPhone 6, type:(Product)
Candidate Facts
node1 brand _node-x
node1 brand Apple Inc.
node1 weight 129
node2 weight 172
node2 manufacturer Foxconn
node3 releasedate 01.12.1972
node3 manufacturer Foxconn
Web page
markup
Web (crawl)
approx. 125.000 facts for „iPhone6“
Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
14/07/16 19
Evaluation & results
Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
Performance
 Outperforms baselines (BM25F, CBFS)
 Strong variance across types/queries
 Average precision from 75% – 98 %
14/07/16 20
Evaluation & results: markup vs DBpedia/Wikipedia
Can markup augment existing Knowledge Graphs?
 Comparison of obtained facts with existing
knowledge bases (DBpedia/Wikipedia)
 „new“: fact not existing in DBpedia
(eg a book‘s releaseDate in Wiki/DBpedia)
 „new-p“: property not existing in DBpedia
(eg a book‘s release countries)
 „existing“: fact already in DBpedia
 On average approx. 60% new facts
Performance
 Outperforms baselines (BM25F, CBFS)
 Strong variance across types/queries
 Average precision from 75% – 98 %
Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
14/07/16 21
Conclusions
 Data fusion on markup as means to extract
rich descriptions of entities in Web documents
 Understanding semantics of activities and
resources (particularly learning resources)
 Markup: rich source of entity centric data
(30% of the Web, i.e. 16 trillion Web pages)
 Potential training data for NED/NER
approaches
 Potential for augmenting existing knowledge
graphs/bases (DBpedia/Wikipedia et al)
Collect & Enrich Data
Detect and Model User &
Learning Activities
Analyse Learning Behaviour
Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
14/07/16 22
Next
Mining & understanding (learning) resources on the Web:
 “Extracting entity-centric knowledge/learning
resources from Web Documents“ (Stefan)
 “Automated Wikipedia Entity Enrichment with News
Sources” (Besnik)
Mining & understanding (learning) activities on the Web
 Predicting/measuring „competence“: “Behavioral
Methods for Improving the Effectiveness of Microtask
Crowdsourcing" (Ujwal)
Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
Collect & Enrich Data
Detect and Model User &
Learning Activities
Analyse Learning Behaviour
Outline
Wikipedia Entity
Enrichment
Besnik Fetahu, Katja Markert, Avishek Anand: Automated News Suggestions for Populating Wikipedia Entity Pages. CIKM 2015: 323-332
14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
Introduction
• Human fatalities: 10k vs 1.8k losses
• Estimated damages: $4.5 vs. $108 billions
• ‘Odisha cyclone’ has no coverage in the
entity location ‘Odisha’
• ‘Hurricane Katrina’ finds broad coverage in
entity location `New Orleans’
New Orleans
Odisha
Hurricane Katrina
Odisha Cyclone
14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
Introduction
• Entities comprise of facts and statements supported by external
references!
• News as authoritative sources with emerging facts and events.
• Delay between the reporting of an event in news and its
inclusion in entity pages1
• Incomplete section structure for long—tail entities
• Several implications on real-world applications that make use of
Wikipedia, e.g. KB maintenance, entity disambiguation etc.
Besnik Fetahu, Abhijat Anand, Avishek Anand: How much is Wikipedia lagging behind news?. WebSci 2015
14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
Motivation: News Density in Wikipedia
• Citation templates (‘news’,
‘books’, ‘web’, ‘journal’ etc.)
• ~60% vs. 20% ‘web’ and
‘news’ citations
• On average there are ~6.5
news citations per entity
• On average a news article is
assigned to ~1.3 entities
• The most cited news article
is cited by 81 entities
Besnik Fetahu, Abhijat Anand, Avishek Anand: How much is Wikipedia lagging behind news?. WebSci 2015
14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
Problem Definition
news
Pub.date: tk
entity pages
Rev.date: tk-1
news article
• news title
• headline
• paragraphs
• named entities
entity page
• section template
• categories
• entities (anchors)
• …..
suggest news n to entity e ?
specify the section in e for n
suggest news n to entity e ?
14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
Automated news suggestion to entity pages
feature extraction
Some half a million people were evacuated
from the southeastern Indian coast as
Cyclone Phailin, a tropical storm from the
Bay of Bengal, bore down on India. The
states of Orissa and Andhra Pradesh, both
of which have large coastal populations, were
on high alert ahead of the storm’s expected
arrival.
entities
news article
sections
wikipedia
entity page
article entity
placement
Odisha
Bay of Bengal Phailin
Task#1
one classifier per
entity type
article section
placement
[state]:geography
[city]:climate
…
Task#2
14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
Article—Entity Placement
Task#1
News Suggestion Attributes: Task#1
Entity Salience
Nikola Tesla
Elon Musk
Larry Page
John B. Kennedy
Entity Salience: Relative Entity Frequency
• reward entity appearing throughout the text
• reward entity appearing in the top paragraphs
• weigh an entity w.r.t its co-occurring entities
Tesla is a central
concept in the given
news article
14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
News Suggestion Attributes: Task#1
Relative Entity Authority
Elias TabanHillary Clinton
Relative Entity Authority
• entities with `low authority’ have lower
entry barrier for a news article
• a news article in which an entity co-
occurs with `high authority’ entities
conveys news the importance
• entity authority as an a priori probability
or any centrality based measure
14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
News Suggestion Attributes: Task#1
Novelty & Redundancy
previously added news articles
• novelty is measured w.r.t previously added news articles
in an entity page
• major events have wide coverage in news media
• place the news article into the correct section
Novelty and Redundancy Measure
14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
Article—Section Placement
Task#2
Task#2: Section—template Generation
Germanwings Adria Lufthansa
• Section templates per entity type
• Pre-determined number of main
sections
• Canonicalize sections
• Generate `complete’ section
templates based on similar entities
• Cluster based on the X—means[3]
algorithm
[3] D. Pelleg, A. W. Moore, et al. X-means: Extending k-means
with efficient estimation of the number of clusters. In ICML,
pages 727–734, 2000.
14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
Task#2: Overall news—section fit
• What is the best section to append a given news article?
• measure overall similarity between n and the pre-computed sections in
the section templates
• Similarity aspects between news articles and sections
• Topic similarity (LDA models over the sections and news documents)
• Syntactic similarity
• Lexical similarity
• Entity—based similarity (overlap of named entities)
• Frequency
14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
Evaluation Strategy
What comprises of the ground-truth for such a task?
Challenges
• `Invasive’: add news articles and wait for a time period until it is either accepted or
deleted by the Wikipedia editors
• Long tail vs. trunk entities: long tail entities might not be of particular interest to
editors, hence, many `false positives’ will go unnoticed.
• Crowdsourcing: Challenging to find knowledgable workers for long-tail entities
Approach
•Use already referenced news articles from entity pages
•Avoid the uncertainty of judgements and expertise of crowd workers
•Non-invasive approach for entity pages
•Reusable test bed for similar approaches
14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
Experimental Setup
Distribution of news articles, entities,
and sections across the years
Datasets Evaluation Plan
• train at years [to, ti], test at (ti, tk]
• P/R/F1 metrics
Baselines
Task#1: AEP
• B1: AEP based on Dunietz and Gillick
• B2: AEP if entity appears in the news title
Task#2: ASP
• S1: AES based on max similarity to one of the sections
• S2: AES to the most frequent section
14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
Task#1: Article—Entity Placement
Performance
Robustness
Feature Analysis
Number Instances
14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
Task#2: Article—Section Placement
14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
• Two—stage news suggestion approach for Wikipedia entity pages
• Model and define what makes a good news suggestion
• Model functions for salience, relative authority, novelty and section placement defined as attributes
for a ‘good news suggestion’
• Entity profile expansion
• Extensive evaluation over 350k news articles, 73k entity pages and for the different Wikipedia
states between 2009 and 2014.
• A publicly available and reusable test bed for similar tasks
Conclusions
14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
Next
Mining & understanding (learning) resources on the Web:
 “Extracting entity-centric knowledge/learning
resources from Web Documents“ (Stefan)
 “Automated Wikipedia Entity Enrichment with News
Sources” (Besnik)
Mining & understanding (learning) activities on the Web
 Predicting/measuring „competence“: “Behavioral
Methods for Improving the Effectiveness of Microtask
Crowdsourcing" (Ujwal)
Collect & Enrich Data
Detect and Model User &
Learning Activities
Analyse Learning Behaviour
14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
42
Crowdsourcing - A Brief Introduction
* 42
Portmanteau of "crowd " and "outsourcing,"
first coined by Jeff Howe in a June 2006
Wired magazine article.
Accumulating small
contributions from
each crowd worker to
solve a bigger
problem.
14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
43
Crowdsourcing - The Means to Many Ends
* 4314/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
44
The Paid Crowdsourcing Paradigm
❏ Small monetary rewards in exchange for completing short tasks online
❏ Entertainment-driven workers primarily seek diversion by taking up
interesting, possibly challenging tasks
❏ Money-driven workers mainly attracted by monetary incentives
❏ A crowdsourcing platform acts as a marketplace for such tasks
❏ About five million tasks are completed per year at 1-5 cents each
❏ Some jobs can contain more than 300K tasks
14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
45
Microtask Crowdsourcing Platforms as Online Social
Environments
Crowd worker as a learner in an atypical learning environment :
❏ No information regarding the background, knowledge, or skills
of a worker.
❏ Short nature of crowdsourced microtasks, workers face an
‘on-the-fly’ learning situation.
❏ Comparable to experiential learning and microlearning.
❏ In many cases, workers have no time to apply their gained
experience.
❏ Often for single use, high % of new requesters.
Training Workers for Improving Performance in
Crowdsourcing Microtasks. Ujwal Gadiraju, Besnik
Fetahu, Ricardo Kawase. ECTEL 2015; Toledo, Spain.
Crowd Workers as Learners
14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
46
Challenges
○ Diverse pool of workers
○ Wide range of behavior
○ Various motivations
Ross, J., Irani, L., Silberman, M., Zaldivar, A. and Tomlinson, B.
Who are the crowdworkers?: shifting demographics in mechanical
turk. In CHI'10 Extended Abstracts on Human factors in computing
systems. ACM.
Kazai, Gabriella, Jaap Kamps, and Natasa Milic-Frayling. The face of
quality in crowdsourcing relevance labels: demographics,
personality and labeling accuracy. Proceedings of CIKM’12. ACM.
Quality Control in Crowdsourcing
14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
47
➢ Typically adopted solution to
prevent/flag malicious activity
:
Gold-Standard Questions
➢ Flourishing crowdsourcing
markets, advances in
malicious activity
“workers with ulterior motives, who either simply sabotage
a task, or provide poor responses in an attempt to quickly
attain task completion for monetary gains”
Need to understand workers
behavior and types of malicious
activity.
Malicious Workers
14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
48
Malicious Workers - Behavioral Patterns in a Survey
Ineligible
Workers (IW)
Fast Deceivers
(FD)
Rule Breakers
(RB)
Smart Deceivers
(SD)
Gold Standard
Preys (GSP)
Instruction: Please attempt this microtask ONLY IF you have
successfully completed 5 microtasks previously.
Response: ‘this is my first task’
eg: Copy-pasting same text in response to multiple questions, entering
gibberish, etc.
Response: ‘What’s your task?’ , ‘adasd’, ‘fgfgf gsd ljlkj’
Instruction: Identify 5 keywords that represent this task
(separated by commas).
Response: ‘survey, tasks, history’ , ‘previous task yellow’
Instruction: Identify 5 keywords that represent this task
(separated by commas).
Response: ‘one, two, three, four, five’
These workers abide by the instructions and provide valid
responses, but stumble at the gold-standard questions!
Understanding Malicious Behavior in Crowdsourcing
Platforms: The Case of Online Surveys. Ujwal Gadiraju,
Ricardo Kawase, Stefan DIetze, Gianluca Demartini. CHI
2015; Seoul, Korea.
14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
49
Workers Behavioral Patterns - Experimental Results
14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
50
Automatic Classification of Worker Type
Image Transcription & Information Findings Tasks
14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
51
Low-level features through
keystroke & mouse-tracking
❏ timeBeforeInput
❏ timeBeforeClick
❏ tabSwitchFreq
❏ windowToggleFreq
❏ openNewTabFreq
❏ totalMouseMovements
❏ scrollUpFreq
❏ scrollDownFreq
❏ . . .
Competent Worker
Fast Deceiver
Crowd Anatomy: Behavioral Traces for Crowd Worker
Modeling and Pre-selection. Ujwal Gadiraju, Gianluca
Demartini, Ricardo Kawase, and Stefan Dietze. (Under
Review at AAAI HCOMP 2016. Austin, Texas, USA.
Capturing Behavioral Traces ⇒ Behavioral Patterns
14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
52
Worker Behavioral Patterns
❏ Multitaskers
❏ Divers & Feelers
❏ Wanderers
❏ Copy-Pasters & Typers
❏ . . .
Worker Types
❏ Competent Workers
❏ Diligent Workers
❏ Ineligible Workers
❏ Fast Deceivers
❏ Smart Deceivers
❏ Rule Breakers
❏ Incompetent Workers
❏ Sloppy Workers
Automatic Worker Type
Classification
Behavioral Traces for
Crowd Worker Modeling
and Pre-selection
Capturing Behavioral Traces ⇒ Behavioral Patterns
14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
53
Evaluation of Automatic Worker Type Classification
Supervised Machine Learning
Model
❏ Automatic classification at scale
❏ Random forest classifier
❏ Classifiers evaluated using 10-fold
cross validation
❏ Information Finding & Content
Creation Tasks
Evaluation for Information Finding Tasks
14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
54
Benefit of Automatic Worker Type Classification
Information Finding
Tasks (finding
middle names)
Content Creation
Tasks
(image transcription)
14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
PRE-SELECTION
OF DESIRED
WORKER TYPES
55
Task Turnover Time
“the amount of time required to acquire the full set of
judgments from crowd workers, thereby completing and
finalizing a task considering pre-defined criteria (such as
qualification tests or pre-selection)”
14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
56
Task Turnover Time
Information Finding
Tasks (finding
middle names)
Content Creation
Tasks
(image transcription)
14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
57
Cognitive Theories & Entailing Data
Paradox of Choice in the Crowd
❏ Many available platforms and tasks
❏ Overload of choices for workers
❏ Detrimental effects on decision
making (psychology & social theory
works)
❏ Workers settle for less suitable tasks
❏ More capable workers are deprived
of an opportunity to work on suitable
tasks
❏ Overall effectiveness of the
crowdsourcing paradigm decreases
Typically Adopted Solution:
Crowd Worker Pre-selection
14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
58
The Dunning-Kruger Effect
❏ Cognitive bias: Incompetent
individuals depict inflated self-
assessments and illusory superiority.
❏ Incompetence in a particular domain
reduces the metacognitive ability of
individuals to realize it.
❏ Incompetent individuals cognitively
miscalibrate by erroneously assessing
oneselves, while competent
individuals miscalibrate by
erroneously assessing others.
Cognitive Theories & Entailing Data
14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
5914/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
60
Self-Assessments for Pre-selection of Crowd Workers
❏ Crowd workers often lack awareness about their true level of
competence
❏ Novel worker pre-selection method based on self-assessments
& performance
Evaluation in
a Sentiment
Analysis Task
Worker
Performance Data
Cognitive Theories & Entailing Data
Using Worker Self-Assessments for Competence-based
Pre-Selection. Ujwal Gadiraju, Besnik Fetahu, Ricardo
Kawase, Patrick Siehndel and Stefan Dietze. (Under
Review at ACM CSCW 2017. Portland, Oregon, USA.
14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
14/07/16 61
Summary
Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
Mining & understanding (learning) resources on the Web:
 “Extracting entity-centric knowledge/learning
resources from Web Documents“ (Stefan)
 “Automated Wikipedia Entity Enrichment with News
Sources” (Besnik)
Mining & understanding (learning) activities on the Web
 Predicting/measuring „competence“: “Behavioral
Methods for Improving the Effectiveness of Microtask
Crowdsourcing" (Ujwal)
Collect & Enrich Data
Detect and Model User &
Learning Activities
Analyse Learning Behaviour
14/07/16 62
Thank you!
Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
• http://www.l3s.de
• http://stefandietze.net
• http://l3s.de/~fetahu
• http://www.l3s.de/~gadiraju/

More Related Content

What's hot

Combining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linkingCombining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linkingBesnik Fetahu
 
Interpreting Data Mining Results with Linked Data for Learning Analytics
Interpreting Data Mining Results with Linked Data for Learning AnalyticsInterpreting Data Mining Results with Linked Data for Learning Analytics
Interpreting Data Mining Results with Linked Data for Learning AnalyticsMathieu d'Aquin
 
Semantic Web, Linked Data and Education: A Perfect Fit?
Semantic Web, Linked Data and Education: A Perfect Fit?Semantic Web, Linked Data and Education: A Perfect Fit?
Semantic Web, Linked Data and Education: A Perfect Fit?Mathieu d'Aquin
 
Supporting the use of data: From data repositories to service discovery
Supporting the use of data: From data repositories to service discoverySupporting the use of data: From data repositories to service discovery
Supporting the use of data: From data repositories to service discoveryMathieu d'Aquin
 
LinkedUp - Linked Data & Education
LinkedUp - Linked Data & EducationLinkedUp - Linked Data & Education
LinkedUp - Linked Data & EducationStefan Dietze
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionUniversity of Washington
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceUniversity of Washington
 
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic WebWeb Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic WebStefan Dietze
 
WWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
WWW2014 Tutorial: Online Learning & Linked Data - Lessons LearnedWWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
WWW2014 Tutorial: Online Learning & Linked Data - Lessons LearnedStefan Dietze
 
Working with Social Media Data: Ethics & good practice around collecting, usi...
Working with Social Media Data: Ethics & good practice around collecting, usi...Working with Social Media Data: Ethics & good practice around collecting, usi...
Working with Social Media Data: Ethics & good practice around collecting, usi...Nicola Osborne
 
Lcwebinar rise of-the_databrarian_73961
Lcwebinar rise of-the_databrarian_73961Lcwebinar rise of-the_databrarian_73961
Lcwebinar rise of-the_databrarian_73961Sigaard
 
Research Life Cycle for GeoData 2014
Research Life Cycle for GeoData 2014Research Life Cycle for GeoData 2014
Research Life Cycle for GeoData 2014Carly Strasser
 
Keystone summer school 2015 paolo-missier-provenance
Keystone summer school 2015 paolo-missier-provenanceKeystone summer school 2015 paolo-missier-provenance
Keystone summer school 2015 paolo-missier-provenancePaolo Missier
 
Search, Exploration and Analytics of Evolving Data
Search, Exploration and Analytics of Evolving DataSearch, Exploration and Analytics of Evolving Data
Search, Exploration and Analytics of Evolving DataNattiya Kanhabua
 
LUCERO - Building the Open University Web of Linked Data
LUCERO - Building the Open University Web of Linked DataLUCERO - Building the Open University Web of Linked Data
LUCERO - Building the Open University Web of Linked DataMathieu d'Aquin
 
Exploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sourcesExploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sourcesLaura Po
 
Tragedy of the (Data) Commons
Tragedy of the (Data) CommonsTragedy of the (Data) Commons
Tragedy of the (Data) CommonsJames Hendler
 

What's hot (20)

Combining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linkingCombining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linking
 
Interpreting Data Mining Results with Linked Data for Learning Analytics
Interpreting Data Mining Results with Linked Data for Learning AnalyticsInterpreting Data Mining Results with Linked Data for Learning Analytics
Interpreting Data Mining Results with Linked Data for Learning Analytics
 
Semantic Web, Linked Data and Education: A Perfect Fit?
Semantic Web, Linked Data and Education: A Perfect Fit?Semantic Web, Linked Data and Education: A Perfect Fit?
Semantic Web, Linked Data and Education: A Perfect Fit?
 
Supporting the use of data: From data repositories to service discovery
Supporting the use of data: From data repositories to service discoverySupporting the use of data: From data repositories to service discovery
Supporting the use of data: From data repositories to service discovery
 
LinkedUp - Linked Data & Education
LinkedUp - Linked Data & EducationLinkedUp - Linked Data & Education
LinkedUp - Linked Data & Education
 
Science Data, Responsibly
Science Data, ResponsiblyScience Data, Responsibly
Science Data, Responsibly
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data Science
 
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic WebWeb Science Synergies: Exploring Web Knowledge through the Semantic Web
Web Science Synergies: Exploring Web Knowledge through the Semantic Web
 
WWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
WWW2014 Tutorial: Online Learning & Linked Data - Lessons LearnedWWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
WWW2014 Tutorial: Online Learning & Linked Data - Lessons Learned
 
Urban Data Science at UW
Urban Data Science at UWUrban Data Science at UW
Urban Data Science at UW
 
Working with Social Media Data: Ethics & good practice around collecting, usi...
Working with Social Media Data: Ethics & good practice around collecting, usi...Working with Social Media Data: Ethics & good practice around collecting, usi...
Working with Social Media Data: Ethics & good practice around collecting, usi...
 
Lcwebinar rise of-the_databrarian_73961
Lcwebinar rise of-the_databrarian_73961Lcwebinar rise of-the_databrarian_73961
Lcwebinar rise of-the_databrarian_73961
 
Research Life Cycle for GeoData 2014
Research Life Cycle for GeoData 2014Research Life Cycle for GeoData 2014
Research Life Cycle for GeoData 2014
 
Keystone summer school 2015 paolo-missier-provenance
Keystone summer school 2015 paolo-missier-provenanceKeystone summer school 2015 paolo-missier-provenance
Keystone summer school 2015 paolo-missier-provenance
 
Search, Exploration and Analytics of Evolving Data
Search, Exploration and Analytics of Evolving DataSearch, Exploration and Analytics of Evolving Data
Search, Exploration and Analytics of Evolving Data
 
LUCERO - Building the Open University Web of Linked Data
LUCERO - Building the Open University Web of Linked DataLUCERO - Building the Open University Web of Linked Data
LUCERO - Building the Open University Web of Linked Data
 
Exploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sourcesExploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sources
 
Broad Data
Broad DataBroad Data
Broad Data
 
Tragedy of the (Data) Commons
Tragedy of the (Data) CommonsTragedy of the (Data) Commons
Tragedy of the (Data) Commons
 

Similar to Mining and Understanding Activities and Resources on the Web

Semantic Linking & Retrieval for Digital Libraries
Semantic Linking & Retrieval for Digital LibrariesSemantic Linking & Retrieval for Digital Libraries
Semantic Linking & Retrieval for Digital LibrariesStefan Dietze
 
Towards research data knowledge graphs
Towards research data knowledge graphsTowards research data knowledge graphs
Towards research data knowledge graphsStefan Dietze
 
Demo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open DataDemo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open DataStefan Dietze
 
Big Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday LearningBig Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday LearningStefan Dietze
 
Beyond research data infrastructures: exploiting artificial & crowd intellige...
Beyond research data infrastructures: exploiting artificial & crowd intellige...Beyond research data infrastructures: exploiting artificial & crowd intellige...
Beyond research data infrastructures: exploiting artificial & crowd intellige...Stefan Dietze
 
LinkedUp - Linked Data Europe Workshop 2014
LinkedUp - Linked Data Europe Workshop 2014LinkedUp - Linked Data Europe Workshop 2014
LinkedUp - Linked Data Europe Workshop 2014Stefan Dietze
 
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...Stefan Dietze
 
What's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked DatasetsWhat's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked DatasetsStefan Dietze
 
Open Data & Education Seminar, ITMO, St Petersburg, March 2014
Open Data & Education Seminar, ITMO, St Petersburg, March 2014Open Data & Education Seminar, ITMO, St Petersburg, March 2014
Open Data & Education Seminar, ITMO, St Petersburg, March 2014Stefan Dietze
 
WWW2013 Tutorial: Linked Data & Education
WWW2013 Tutorial: Linked Data & EducationWWW2013 Tutorial: Linked Data & Education
WWW2013 Tutorial: Linked Data & EducationStefan Dietze
 
The Neuroscience Information Framework: A Scalable Platform for Information E...
The Neuroscience Information Framework: A Scalable Platform for Information E...The Neuroscience Information Framework: A Scalable Platform for Information E...
The Neuroscience Information Framework: A Scalable Platform for Information E...Neuroscience Information Framework
 
CNI fall 2009 enhanced publications john_doove-SURFfoundation
CNI fall 2009 enhanced publications john_doove-SURFfoundationCNI fall 2009 enhanced publications john_doove-SURFfoundation
CNI fall 2009 enhanced publications john_doove-SURFfoundationJohn Doove
 
KnowEscape workshop, OKCon 2013
KnowEscape workshop, OKCon 2013KnowEscape workshop, OKCon 2013
KnowEscape workshop, OKCon 2013Stefan Dietze
 
Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)Stefan Dietze
 
Linking Open Data with Drupal
Linking Open Data with DrupalLinking Open Data with Drupal
Linking Open Data with Drupalemmanuel_jamin
 
Putting Data to Work: Moving science forward together beyond where we thought...
Putting Data to Work: Moving science forward together beyond where we thought...Putting Data to Work: Moving science forward together beyond where we thought...
Putting Data to Work: Moving science forward together beyond where we thought...Erin Robinson
 
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...Stefan Dietze
 
The web of data: how are we doing so far
The web of data: how are we doing so farThe web of data: how are we doing so far
The web of data: how are we doing so farElena Simperl
 

Similar to Mining and Understanding Activities and Resources on the Web (20)

Semantic Linking & Retrieval for Digital Libraries
Semantic Linking & Retrieval for Digital LibrariesSemantic Linking & Retrieval for Digital Libraries
Semantic Linking & Retrieval for Digital Libraries
 
Towards research data knowledge graphs
Towards research data knowledge graphsTowards research data knowledge graphs
Towards research data knowledge graphs
 
Demo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open DataDemo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open Data
 
Big Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday LearningBig Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday Learning
 
Beyond research data infrastructures: exploiting artificial & crowd intellige...
Beyond research data infrastructures: exploiting artificial & crowd intellige...Beyond research data infrastructures: exploiting artificial & crowd intellige...
Beyond research data infrastructures: exploiting artificial & crowd intellige...
 
LinkedUp - Linked Data Europe Workshop 2014
LinkedUp - Linked Data Europe Workshop 2014LinkedUp - Linked Data Europe Workshop 2014
LinkedUp - Linked Data Europe Workshop 2014
 
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
 
What's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked DatasetsWhat's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked Datasets
 
Open Data & Education Seminar, ITMO, St Petersburg, March 2014
Open Data & Education Seminar, ITMO, St Petersburg, March 2014Open Data & Education Seminar, ITMO, St Petersburg, March 2014
Open Data & Education Seminar, ITMO, St Petersburg, March 2014
 
WWW2013 Tutorial: Linked Data & Education
WWW2013 Tutorial: Linked Data & EducationWWW2013 Tutorial: Linked Data & Education
WWW2013 Tutorial: Linked Data & Education
 
Carpenter "The Future of the Scholarly Record"
Carpenter "The Future of the Scholarly Record"Carpenter "The Future of the Scholarly Record"
Carpenter "The Future of the Scholarly Record"
 
The Neuroscience Information Framework: A Scalable Platform for Information E...
The Neuroscience Information Framework: A Scalable Platform for Information E...The Neuroscience Information Framework: A Scalable Platform for Information E...
The Neuroscience Information Framework: A Scalable Platform for Information E...
 
CNI fall 2009 enhanced publications john_doove-SURFfoundation
CNI fall 2009 enhanced publications john_doove-SURFfoundationCNI fall 2009 enhanced publications john_doove-SURFfoundation
CNI fall 2009 enhanced publications john_doove-SURFfoundation
 
BrightTALK - Semantic AI
BrightTALK - Semantic AI BrightTALK - Semantic AI
BrightTALK - Semantic AI
 
KnowEscape workshop, OKCon 2013
KnowEscape workshop, OKCon 2013KnowEscape workshop, OKCon 2013
KnowEscape workshop, OKCon 2013
 
Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)
 
Linking Open Data with Drupal
Linking Open Data with DrupalLinking Open Data with Drupal
Linking Open Data with Drupal
 
Putting Data to Work: Moving science forward together beyond where we thought...
Putting Data to Work: Moving science forward together beyond where we thought...Putting Data to Work: Moving science forward together beyond where we thought...
Putting Data to Work: Moving science forward together beyond where we thought...
 
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
 
The web of data: how are we doing so far
The web of data: how are we doing so farThe web of data: how are we doing so far
The web of data: how are we doing so far
 

More from Stefan Dietze

AI in between online and offline discourse - and what has ChatGPT to do with ...
AI in between online and offline discourse - and what has ChatGPT to do with ...AI in between online and offline discourse - and what has ChatGPT to do with ...
AI in between online and offline discourse - and what has ChatGPT to do with ...Stefan Dietze
 
An interdisciplinary journey with the SAL spaceship – results and challenges ...
An interdisciplinary journey with the SAL spaceship – results and challenges ...An interdisciplinary journey with the SAL spaceship – results and challenges ...
An interdisciplinary journey with the SAL spaceship – results and challenges ...Stefan Dietze
 
Research Knowledge Graphs at NFDI4DS & GESIS
Research Knowledge Graphs at NFDI4DS & GESISResearch Knowledge Graphs at NFDI4DS & GESIS
Research Knowledge Graphs at NFDI4DS & GESISStefan Dietze
 
Research Knowledge Graphs at GESIS & NFDI4DataScience
Research Knowledge Graphs at GESIS & NFDI4DataScienceResearch Knowledge Graphs at GESIS & NFDI4DataScience
Research Knowledge Graphs at GESIS & NFDI4DataScienceStefan Dietze
 
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...Stefan Dietze
 
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...Stefan Dietze
 
Using AI to understand everyday learning on the Web
Using AI to understand everyday learning on the WebUsing AI to understand everyday learning on the Web
Using AI to understand everyday learning on the WebStefan Dietze
 
Analysing User Knowledge, Competence and Learning during Online Activities
Analysing User Knowledge, Competence and Learning during Online ActivitiesAnalysing User Knowledge, Competence and Learning during Online Activities
Analysing User Knowledge, Competence and Learning during Online ActivitiesStefan Dietze
 
Towards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the WebTowards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the WebStefan Dietze
 
Dietze linked data-vr-es
Dietze linked data-vr-esDietze linked data-vr-es
Dietze linked data-vr-esStefan Dietze
 
From Data to Knowledge - Profiling & Interlinking Web Datasets
From Data to Knowledge - Profiling & Interlinking Web DatasetsFrom Data to Knowledge - Profiling & Interlinking Web Datasets
From Data to Knowledge - Profiling & Interlinking Web DatasetsStefan Dietze
 
Towards preservation of semantically enriched architectural knowledge
Towards preservation of semantically enriched architectural knowledgeTowards preservation of semantically enriched architectural knowledge
Towards preservation of semantically enriched architectural knowledgeStefan Dietze
 

More from Stefan Dietze (12)

AI in between online and offline discourse - and what has ChatGPT to do with ...
AI in between online and offline discourse - and what has ChatGPT to do with ...AI in between online and offline discourse - and what has ChatGPT to do with ...
AI in between online and offline discourse - and what has ChatGPT to do with ...
 
An interdisciplinary journey with the SAL spaceship – results and challenges ...
An interdisciplinary journey with the SAL spaceship – results and challenges ...An interdisciplinary journey with the SAL spaceship – results and challenges ...
An interdisciplinary journey with the SAL spaceship – results and challenges ...
 
Research Knowledge Graphs at NFDI4DS & GESIS
Research Knowledge Graphs at NFDI4DS & GESISResearch Knowledge Graphs at NFDI4DS & GESIS
Research Knowledge Graphs at NFDI4DS & GESIS
 
Research Knowledge Graphs at GESIS & NFDI4DataScience
Research Knowledge Graphs at GESIS & NFDI4DataScienceResearch Knowledge Graphs at GESIS & NFDI4DataScience
Research Knowledge Graphs at GESIS & NFDI4DataScience
 
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
 
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
 
Using AI to understand everyday learning on the Web
Using AI to understand everyday learning on the WebUsing AI to understand everyday learning on the Web
Using AI to understand everyday learning on the Web
 
Analysing User Knowledge, Competence and Learning during Online Activities
Analysing User Knowledge, Competence and Learning during Online ActivitiesAnalysing User Knowledge, Competence and Learning during Online Activities
Analysing User Knowledge, Competence and Learning during Online Activities
 
Towards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the WebTowards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the Web
 
Dietze linked data-vr-es
Dietze linked data-vr-esDietze linked data-vr-es
Dietze linked data-vr-es
 
From Data to Knowledge - Profiling & Interlinking Web Datasets
From Data to Knowledge - Profiling & Interlinking Web DatasetsFrom Data to Knowledge - Profiling & Interlinking Web Datasets
From Data to Knowledge - Profiling & Interlinking Web Datasets
 
Towards preservation of semantically enriched architectural knowledge
Towards preservation of semantically enriched architectural knowledgeTowards preservation of semantically enriched architectural knowledge
Towards preservation of semantically enriched architectural knowledge
 

Recently uploaded

User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationColumbia Weather Systems
 
Quarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and FunctionsQuarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and FunctionsCharlene Llagas
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingNetHelix
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests GlycosidesGLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests GlycosidesNandakishor Bhaurao Deshmukh
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptJoemSTuliba
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
trihybrid cross , test cross chi squares
trihybrid cross , test cross chi squarestrihybrid cross , test cross chi squares
trihybrid cross , test cross chi squaresusmanzain586
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxmaryFF1
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationColumbia Weather Systems
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxMedical College
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
 
PROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalPROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalMAESTRELLAMesa2
 
well logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxwell logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxzaydmeerab121
 

Recently uploaded (20)

User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather Station
 
Quarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and FunctionsQuarter 4_Grade 8_Digestive System Structure and Functions
Quarter 4_Grade 8_Digestive System Structure and Functions
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests GlycosidesGLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.ppt
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
 
AZOTOBACTER AS BIOFERILIZER.PPTX
AZOTOBACTER AS BIOFERILIZER.PPTXAZOTOBACTER AS BIOFERILIZER.PPTX
AZOTOBACTER AS BIOFERILIZER.PPTX
 
trihybrid cross , test cross chi squares
trihybrid cross , test cross chi squarestrihybrid cross , test cross chi squares
trihybrid cross , test cross chi squares
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather Station
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdf
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptx
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
 
PROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalPROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and Vertical
 
well logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxwell logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptx
 

Mining and Understanding Activities and Resources on the Web

  • 1. Mining and Understanding (Learning) Activities and Resources on the Web Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju L3S Research Center, Hannover, Germany 14/07/16 1Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 2. Research areas  Web science, Information Retrieval, Semantic Web, Social Web Analytics, Knowledge Discovery, Human Computation  Interdisciplinary application areas: digital humanities, TEL/education, Web archiving, mobility Some projects L3S Research Center 14/07/16 2  See also: http://www.l3s.de Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 3. “Intelligent Access to Information” / L3S 14/07/16 3Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 4. Team & current projects LA4S LearnWeb 14/07/16 4 GlycoRec Ran Yu Ujwal Gadiraju Besnik Fetahu Stefan Dietze Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 5. 14/07/16 5 AFEL – Analytics for Everyday (Online) Learning Figure courtesy of Mathieu d‘Aquin Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 6. 14/07/16 6 AFEL – Analytics for Everyday Learning Apply and Evaluate - WP1 - Data Capture - WP3 - Visual Analytics - WP5 - Use Cases and Evaluation Collect & Enrich Data Detect and Model User & Learning Activities Analyse Learning Behaviour - WP2 - Data Enrichment - WP4 - Cognitive Modelling Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju Figure courtesy of Mathieu d‘Aquin
  • 7. 14/07/16 7 AFEL – Analytics for Everyday Learning Entities/notions, e.g.: • Learning • ... Resource • ... Activity • ... Performance • Knowledge • Competence • .... Collect & Enrich Data Detect and Model User & Learning Activities Analyse Learning Behaviour - WP2 - Data Enrichment - WP4 - Cognitive Modelling Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 8. 14/07/16 8 AFEL – Analytics for Everyday Learning Entities/notions, e.g.: • Learning • ... Resource • ... Activity • ... Performance • Knowledge • Competence • .... Collect & Enrich Data Detect and Model User & Learning Activities Analyse Learning Behaviour - WP2 - Data Enrichment - WP4 - Cognitive Modelling Understanding informal/micro learning on the Web (e.g. Social Web) – Challenges:  Absence of competence indcators/assessments etc ?  Measuring/detecting progress/competence etc, i.e. distinguish good/bad performance ?  Understanding learning activities => understanding of learning resources and involved entities  Heterogeneity and scale of data/activities/documents to consider (i.e. the Web)  ... Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 9. 14/07/16 9 Overview Mining & understanding (learning) resources on the Web:  “Extracting entity-centric knowledge/learning resources from Web Documents“ (Stefan)  “Automated Wikipedia Entity Enrichment with News Sources” (Besnik) Mining & understanding (learning) activities on the Web  Predicting/measuring „competence“: “Behavioral Methods for Improving the Effectiveness of Microtask Crowdsourcing" (Ujwal) Collect & Enrich Data Detect and Model User & Learning Activities Analyse Learning Behaviour - WP2 - Data Enrichment - WP4 - Cognitive Modelling Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 10. 14/07/16 10 Understanding knowledge resources on the Web Apple Digital Revolution Steve Jobs IT Company Bank Jobs Biopic/Movie Person  Detecting (salient) entities in Web resources/documents  NLP-based named entity recognition and disambiguation (Babelfy, DBpedia Spotlight etc)  Usually uses background knowledge graphs (eg DBpedia/Wikipedia, Linked Data) Band ? Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 11. Web documents vs structured entity-centric knowledge graphs 14/07/16 11 Unstructured Web documents Linked Data & Knowledge Graphs  The Web: approx. 46.000.000.000.000 (46 trillion) Web pages indexed by Google vs  Linked Data & Knowledge Graphs: structured entity-centric data, approx. 1000 datasets & 100 billion statements (DBpedia, etc)  Linking entities (NED/NER) from documents:  Computational complex  Error-prone  Issues with less popular entities (example: regional news sites)  Knowledge graphs less dynamic than Web documents Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 12.  Markup: entity-centric data embedded in the Web (30% of all Web documents in 2015)  Using W3C standards (RDFa, Microdata, Microformats)  Schema.org: inititative from Google, Yahoo, Bing, Yandex to push common vocabulary  Same order of magnitude as Web itself with respect to scale and dynamics (as opposed to knowledge graphs, DBpedia et al)  Rich source of knowledge and data going beyond existing knowledge bases (eg Wikipedia) Entity-centric data on the Web: Web markup (schema.org) 14/07/16 12 Entity node2 publisher Pearson Education node2 publisher Elsevier node2 published 03-01-2014 Unstructured Web documents Linked Data & Knowledge Graphs Embedded Markup (schema.org) Entity node1 name French Grammar advanced node1 publisher The Open University node1 publisher Nature node1 datePublished 1956 node1 datePublished 1953 Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 13. 1 10 100 1000 10000 100000 1000000 10000000 1 51 101 151 201 count(log) PLD (ranked) # entities # statements Example: entity markup of learning resources on the Web  “Learning Resources Metadata Intiative (LRMI)”: schema.org vocabulary for annotation of learning resources (informal, formal, etc)  Approx. 5000 PLDs in “Common Crawl”  LRMI-Adaptation on the Web (WDC) [LILE16]:  2014: 30.599.024 quads, 4.182.541 resources  2013: 10.636873 quads, 1.461.093 resources 14/07/16 13 Power law distribution across providers 4805 Provider / PLDs Taibi, D., Dietze, S., Towards embedded markup of learning resources on the Web: a quantitative Analysis of LRMI Terms Usage, in Companion Publication of the IW3C2 WWW 2016 Conference, IW3C2 2016, Montreal, Canada, April 11, 2016 Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 14. Entity-centric markup on the Web: challenges 14/07/16 14 Characteristics Example Coreferences 18.000 results for <„Iphone 6“, type, s:Product> (8,6 quads on average) in CommonCrawl Redundancy <s, schema:name, „Iphone 6“> occurring 1000 times in CC Lack of links Largely unlinked entity descriptions Errors (typos & schema violations, see Meusel et al [ESWC2015]) Wrong namespaces, such as http://schma.org Undefined types & predicates: 9,7 %, less common than in LOD Confusion of datatype and object properties: <s1, s:publisher, „Springer“>, 24,35 % object property issues vs 8% in LOD Data property range violations: e.g. literals vs numbers (12,6% vs 4,6 in LOD)  Why not using markup as knowledge graph of entities involved in (learning) resources (similar to DBpedia/Wikipedia)? Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 15.  Improving understanding of resources: consolidating entity- centric Web data for a given document/resource/entity?  Markup as distributed knowledge graph/base, e.g. to augment existing knowledge bases (eg DBpedia/Wikipedia) ? Data fusion for consolidating entity centric Web markup 14/07/16 15 Yu, R., Gadiraju, U., Zhu, X., Fetahu, B., S. Dietze, Entity summarisation on structured web markup. In The Semantic Web: ESWC 2016 Satellite Events. Springer, 2016. Yu, R., Gadiraju, U., Zhu, X., Fetahu, B., S. Dietze, Fact Selection for data fusion on structured web markup. ICDE2017, IEEE International Conference on Data Engineering, in progress. Query iPhone 6, type:(Product) Entity Description brand Apple Inc. weight 129 date 30.09.2015 manufacturer Foxconn Storage 16 GB <e1, s:name, „Iphone 6“> <e2, s:brand, „Apple Inc.“> <e3, s:brand, „Apple“> <e4, s:weight, 127> <e5, s:releaseDate, „1.12.1972“> Web (crawl) (i.e. billions of entites/facts) Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 16. A supervised ML approach to select entity facts from the Web 14/07/16 17  Fact/entity retrieval: BM25 entity retrieval model on markup index (Common Crawl)  Fact selection: supervised ML classifier (SVM), using 3 feature categories (relevance, authority, clustering)  Experiments on Common Crawl: products, movies, books (approx. 3 billion facts) 1. Retrieval 2. Fact selection New Queries Foxconn, type:(Organization) Cupertino, type:(City) Apple Inc., type:(Organization) (trained SVM classifier) Entity Description brand Apple Inc. weight 129 date 30.09.2015 manufacturer Foxconn Storage 16 GB Query iPhone 6, type:(Product) Candidate Facts node1 brand _node-x node1 brand Apple Inc. node1 weight 129 node2 weight 172 node2 manufacturer Foxconn node3 releasedate 01.12.1972 node3 manufacturer Foxconn Web page markup Web (crawl) approx. 125.000 facts for „iPhone6“ Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 17. 14/07/16 19 Evaluation & results Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju Performance  Outperforms baselines (BM25F, CBFS)  Strong variance across types/queries  Average precision from 75% – 98 %
  • 18. 14/07/16 20 Evaluation & results: markup vs DBpedia/Wikipedia Can markup augment existing Knowledge Graphs?  Comparison of obtained facts with existing knowledge bases (DBpedia/Wikipedia)  „new“: fact not existing in DBpedia (eg a book‘s releaseDate in Wiki/DBpedia)  „new-p“: property not existing in DBpedia (eg a book‘s release countries)  „existing“: fact already in DBpedia  On average approx. 60% new facts Performance  Outperforms baselines (BM25F, CBFS)  Strong variance across types/queries  Average precision from 75% – 98 % Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 19. 14/07/16 21 Conclusions  Data fusion on markup as means to extract rich descriptions of entities in Web documents  Understanding semantics of activities and resources (particularly learning resources)  Markup: rich source of entity centric data (30% of the Web, i.e. 16 trillion Web pages)  Potential training data for NED/NER approaches  Potential for augmenting existing knowledge graphs/bases (DBpedia/Wikipedia et al) Collect & Enrich Data Detect and Model User & Learning Activities Analyse Learning Behaviour Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 20. 14/07/16 22 Next Mining & understanding (learning) resources on the Web:  “Extracting entity-centric knowledge/learning resources from Web Documents“ (Stefan)  “Automated Wikipedia Entity Enrichment with News Sources” (Besnik) Mining & understanding (learning) activities on the Web  Predicting/measuring „competence“: “Behavioral Methods for Improving the Effectiveness of Microtask Crowdsourcing" (Ujwal) Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju Collect & Enrich Data Detect and Model User & Learning Activities Analyse Learning Behaviour
  • 21. Outline Wikipedia Entity Enrichment Besnik Fetahu, Katja Markert, Avishek Anand: Automated News Suggestions for Populating Wikipedia Entity Pages. CIKM 2015: 323-332 14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 22. Introduction • Human fatalities: 10k vs 1.8k losses • Estimated damages: $4.5 vs. $108 billions • ‘Odisha cyclone’ has no coverage in the entity location ‘Odisha’ • ‘Hurricane Katrina’ finds broad coverage in entity location `New Orleans’ New Orleans Odisha Hurricane Katrina Odisha Cyclone 14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 23. Introduction • Entities comprise of facts and statements supported by external references! • News as authoritative sources with emerging facts and events. • Delay between the reporting of an event in news and its inclusion in entity pages1 • Incomplete section structure for long—tail entities • Several implications on real-world applications that make use of Wikipedia, e.g. KB maintenance, entity disambiguation etc. Besnik Fetahu, Abhijat Anand, Avishek Anand: How much is Wikipedia lagging behind news?. WebSci 2015 14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 24. Motivation: News Density in Wikipedia • Citation templates (‘news’, ‘books’, ‘web’, ‘journal’ etc.) • ~60% vs. 20% ‘web’ and ‘news’ citations • On average there are ~6.5 news citations per entity • On average a news article is assigned to ~1.3 entities • The most cited news article is cited by 81 entities Besnik Fetahu, Abhijat Anand, Avishek Anand: How much is Wikipedia lagging behind news?. WebSci 2015 14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 25. Problem Definition news Pub.date: tk entity pages Rev.date: tk-1 news article • news title • headline • paragraphs • named entities entity page • section template • categories • entities (anchors) • ….. suggest news n to entity e ? specify the section in e for n suggest news n to entity e ? 14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 26. Automated news suggestion to entity pages feature extraction Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival. entities news article sections wikipedia entity page article entity placement Odisha Bay of Bengal Phailin Task#1 one classifier per entity type article section placement [state]:geography [city]:climate … Task#2 14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 28. News Suggestion Attributes: Task#1 Entity Salience Nikola Tesla Elon Musk Larry Page John B. Kennedy Entity Salience: Relative Entity Frequency • reward entity appearing throughout the text • reward entity appearing in the top paragraphs • weigh an entity w.r.t its co-occurring entities Tesla is a central concept in the given news article 14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 29. News Suggestion Attributes: Task#1 Relative Entity Authority Elias TabanHillary Clinton Relative Entity Authority • entities with `low authority’ have lower entry barrier for a news article • a news article in which an entity co- occurs with `high authority’ entities conveys news the importance • entity authority as an a priori probability or any centrality based measure 14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 30. News Suggestion Attributes: Task#1 Novelty & Redundancy previously added news articles • novelty is measured w.r.t previously added news articles in an entity page • major events have wide coverage in news media • place the news article into the correct section Novelty and Redundancy Measure 14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 32. Task#2: Section—template Generation Germanwings Adria Lufthansa • Section templates per entity type • Pre-determined number of main sections • Canonicalize sections • Generate `complete’ section templates based on similar entities • Cluster based on the X—means[3] algorithm [3] D. Pelleg, A. W. Moore, et al. X-means: Extending k-means with efficient estimation of the number of clusters. In ICML, pages 727–734, 2000. 14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 33. Task#2: Overall news—section fit • What is the best section to append a given news article? • measure overall similarity between n and the pre-computed sections in the section templates • Similarity aspects between news articles and sections • Topic similarity (LDA models over the sections and news documents) • Syntactic similarity • Lexical similarity • Entity—based similarity (overlap of named entities) • Frequency 14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 34. Evaluation Strategy What comprises of the ground-truth for such a task? Challenges • `Invasive’: add news articles and wait for a time period until it is either accepted or deleted by the Wikipedia editors • Long tail vs. trunk entities: long tail entities might not be of particular interest to editors, hence, many `false positives’ will go unnoticed. • Crowdsourcing: Challenging to find knowledgable workers for long-tail entities Approach •Use already referenced news articles from entity pages •Avoid the uncertainty of judgements and expertise of crowd workers •Non-invasive approach for entity pages •Reusable test bed for similar approaches 14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 35. Experimental Setup Distribution of news articles, entities, and sections across the years Datasets Evaluation Plan • train at years [to, ti], test at (ti, tk] • P/R/F1 metrics Baselines Task#1: AEP • B1: AEP based on Dunietz and Gillick • B2: AEP if entity appears in the news title Task#2: ASP • S1: AES based on max similarity to one of the sections • S2: AES to the most frequent section 14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 36. Task#1: Article—Entity Placement Performance Robustness Feature Analysis Number Instances 14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 37. Task#2: Article—Section Placement 14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 38. • Two—stage news suggestion approach for Wikipedia entity pages • Model and define what makes a good news suggestion • Model functions for salience, relative authority, novelty and section placement defined as attributes for a ‘good news suggestion’ • Entity profile expansion • Extensive evaluation over 350k news articles, 73k entity pages and for the different Wikipedia states between 2009 and 2014. • A publicly available and reusable test bed for similar tasks Conclusions 14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 39. Next Mining & understanding (learning) resources on the Web:  “Extracting entity-centric knowledge/learning resources from Web Documents“ (Stefan)  “Automated Wikipedia Entity Enrichment with News Sources” (Besnik) Mining & understanding (learning) activities on the Web  Predicting/measuring „competence“: “Behavioral Methods for Improving the Effectiveness of Microtask Crowdsourcing" (Ujwal) Collect & Enrich Data Detect and Model User & Learning Activities Analyse Learning Behaviour 14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 40. 42 Crowdsourcing - A Brief Introduction * 42 Portmanteau of "crowd " and "outsourcing," first coined by Jeff Howe in a June 2006 Wired magazine article. Accumulating small contributions from each crowd worker to solve a bigger problem. 14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 41. 43 Crowdsourcing - The Means to Many Ends * 4314/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 42. 44 The Paid Crowdsourcing Paradigm ❏ Small monetary rewards in exchange for completing short tasks online ❏ Entertainment-driven workers primarily seek diversion by taking up interesting, possibly challenging tasks ❏ Money-driven workers mainly attracted by monetary incentives ❏ A crowdsourcing platform acts as a marketplace for such tasks ❏ About five million tasks are completed per year at 1-5 cents each ❏ Some jobs can contain more than 300K tasks 14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 43. 45 Microtask Crowdsourcing Platforms as Online Social Environments Crowd worker as a learner in an atypical learning environment : ❏ No information regarding the background, knowledge, or skills of a worker. ❏ Short nature of crowdsourced microtasks, workers face an ‘on-the-fly’ learning situation. ❏ Comparable to experiential learning and microlearning. ❏ In many cases, workers have no time to apply their gained experience. ❏ Often for single use, high % of new requesters. Training Workers for Improving Performance in Crowdsourcing Microtasks. Ujwal Gadiraju, Besnik Fetahu, Ricardo Kawase. ECTEL 2015; Toledo, Spain. Crowd Workers as Learners 14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 44. 46 Challenges ○ Diverse pool of workers ○ Wide range of behavior ○ Various motivations Ross, J., Irani, L., Silberman, M., Zaldivar, A. and Tomlinson, B. Who are the crowdworkers?: shifting demographics in mechanical turk. In CHI'10 Extended Abstracts on Human factors in computing systems. ACM. Kazai, Gabriella, Jaap Kamps, and Natasa Milic-Frayling. The face of quality in crowdsourcing relevance labels: demographics, personality and labeling accuracy. Proceedings of CIKM’12. ACM. Quality Control in Crowdsourcing 14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 45. 47 ➢ Typically adopted solution to prevent/flag malicious activity : Gold-Standard Questions ➢ Flourishing crowdsourcing markets, advances in malicious activity “workers with ulterior motives, who either simply sabotage a task, or provide poor responses in an attempt to quickly attain task completion for monetary gains” Need to understand workers behavior and types of malicious activity. Malicious Workers 14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 46. 48 Malicious Workers - Behavioral Patterns in a Survey Ineligible Workers (IW) Fast Deceivers (FD) Rule Breakers (RB) Smart Deceivers (SD) Gold Standard Preys (GSP) Instruction: Please attempt this microtask ONLY IF you have successfully completed 5 microtasks previously. Response: ‘this is my first task’ eg: Copy-pasting same text in response to multiple questions, entering gibberish, etc. Response: ‘What’s your task?’ , ‘adasd’, ‘fgfgf gsd ljlkj’ Instruction: Identify 5 keywords that represent this task (separated by commas). Response: ‘survey, tasks, history’ , ‘previous task yellow’ Instruction: Identify 5 keywords that represent this task (separated by commas). Response: ‘one, two, three, four, five’ These workers abide by the instructions and provide valid responses, but stumble at the gold-standard questions! Understanding Malicious Behavior in Crowdsourcing Platforms: The Case of Online Surveys. Ujwal Gadiraju, Ricardo Kawase, Stefan DIetze, Gianluca Demartini. CHI 2015; Seoul, Korea. 14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 47. 49 Workers Behavioral Patterns - Experimental Results 14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 48. 50 Automatic Classification of Worker Type Image Transcription & Information Findings Tasks 14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 49. 51 Low-level features through keystroke & mouse-tracking ❏ timeBeforeInput ❏ timeBeforeClick ❏ tabSwitchFreq ❏ windowToggleFreq ❏ openNewTabFreq ❏ totalMouseMovements ❏ scrollUpFreq ❏ scrollDownFreq ❏ . . . Competent Worker Fast Deceiver Crowd Anatomy: Behavioral Traces for Crowd Worker Modeling and Pre-selection. Ujwal Gadiraju, Gianluca Demartini, Ricardo Kawase, and Stefan Dietze. (Under Review at AAAI HCOMP 2016. Austin, Texas, USA. Capturing Behavioral Traces ⇒ Behavioral Patterns 14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 50. 52 Worker Behavioral Patterns ❏ Multitaskers ❏ Divers & Feelers ❏ Wanderers ❏ Copy-Pasters & Typers ❏ . . . Worker Types ❏ Competent Workers ❏ Diligent Workers ❏ Ineligible Workers ❏ Fast Deceivers ❏ Smart Deceivers ❏ Rule Breakers ❏ Incompetent Workers ❏ Sloppy Workers Automatic Worker Type Classification Behavioral Traces for Crowd Worker Modeling and Pre-selection Capturing Behavioral Traces ⇒ Behavioral Patterns 14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 51. 53 Evaluation of Automatic Worker Type Classification Supervised Machine Learning Model ❏ Automatic classification at scale ❏ Random forest classifier ❏ Classifiers evaluated using 10-fold cross validation ❏ Information Finding & Content Creation Tasks Evaluation for Information Finding Tasks 14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 52. 54 Benefit of Automatic Worker Type Classification Information Finding Tasks (finding middle names) Content Creation Tasks (image transcription) 14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju PRE-SELECTION OF DESIRED WORKER TYPES
  • 53. 55 Task Turnover Time “the amount of time required to acquire the full set of judgments from crowd workers, thereby completing and finalizing a task considering pre-defined criteria (such as qualification tests or pre-selection)” 14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 54. 56 Task Turnover Time Information Finding Tasks (finding middle names) Content Creation Tasks (image transcription) 14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 55. 57 Cognitive Theories & Entailing Data Paradox of Choice in the Crowd ❏ Many available platforms and tasks ❏ Overload of choices for workers ❏ Detrimental effects on decision making (psychology & social theory works) ❏ Workers settle for less suitable tasks ❏ More capable workers are deprived of an opportunity to work on suitable tasks ❏ Overall effectiveness of the crowdsourcing paradigm decreases Typically Adopted Solution: Crowd Worker Pre-selection 14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 56. 58 The Dunning-Kruger Effect ❏ Cognitive bias: Incompetent individuals depict inflated self- assessments and illusory superiority. ❏ Incompetence in a particular domain reduces the metacognitive ability of individuals to realize it. ❏ Incompetent individuals cognitively miscalibrate by erroneously assessing oneselves, while competent individuals miscalibrate by erroneously assessing others. Cognitive Theories & Entailing Data 14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 57. 5914/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 58. 60 Self-Assessments for Pre-selection of Crowd Workers ❏ Crowd workers often lack awareness about their true level of competence ❏ Novel worker pre-selection method based on self-assessments & performance Evaluation in a Sentiment Analysis Task Worker Performance Data Cognitive Theories & Entailing Data Using Worker Self-Assessments for Competence-based Pre-Selection. Ujwal Gadiraju, Besnik Fetahu, Ricardo Kawase, Patrick Siehndel and Stefan Dietze. (Under Review at ACM CSCW 2017. Portland, Oregon, USA. 14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju
  • 59. 14/07/16 61 Summary Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju Mining & understanding (learning) resources on the Web:  “Extracting entity-centric knowledge/learning resources from Web Documents“ (Stefan)  “Automated Wikipedia Entity Enrichment with News Sources” (Besnik) Mining & understanding (learning) activities on the Web  Predicting/measuring „competence“: “Behavioral Methods for Improving the Effectiveness of Microtask Crowdsourcing" (Ujwal) Collect & Enrich Data Detect and Model User & Learning Activities Analyse Learning Behaviour
  • 60. 14/07/16 62 Thank you! Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju • http://www.l3s.de • http://stefandietze.net • http://l3s.de/~fetahu • http://www.l3s.de/~gadiraju/