t-test Parametric test Biostatics and Research Methodology
Knowledge discoverylaurahollink
1. Knowlege Discovery for the Semantic Web
An Application to Web Usage Mining
&
How to use semantics in the Preprocessing stage
Input
Data
Data Preprocessing
and Transformation
Data Mining
Interpretation
and Evaluation
Information/
Taking Action
Data fusion (multiple sources)
Data Cleaning (noise,missing val.)
Feature Selection
Dimensionality Reduction
Data Normalization
Filtering Patterns
Visualization
Statistical Analysis
- Hypothesis testing
- Attribute evaluation
- Comparing learned models
- Computing Confidence Intervals
Claudia D’Amato - University of Bari, IT.
Laura Hollink - Centrum Wiskunde & Informatica, Amsterdam, NL.
2. Knowlege Discovery for the Semantic Web
An Application to Web Usage Mining
&
How to use semantics in the Preprocessing stage
Input
Data
Data Preprocessing
and Transformation
Data Mining
Interpretation
and Evaluation
Information/
Taking Action
Data fusion (multiple sources)
Data Cleaning (noise,missing val.)
Feature Selection
Dimensionality Reduction
Data Normalization
Filtering Patterns
Visualization
Statistical Analysis
- Hypothesis testing
- Attribute evaluation
- Comparing learned models
- Computing Confidence Intervals
Claudia D’Amato - University of Bari, IT.
Laura Hollink - Centrum Wiskunde & Informatica, Amsterdam, NL.
3. An application to Web Usage Mining
Web Usage Mining = discovering patterns in logs of user interaction with Web
resources
• logs typically contain an identifier for users (e.g. ip address), their queries
and clicks
4. An application to Web Usage Mining
Web Usage Mining = discovering patterns in logs of user interaction with Web
resources
• logs typically contain an identifier for users (e.g. ip address), their queries
and clicks
• What about usage of Linked
Open Data?
5. An application to Web Usage Mining
Web Usage Mining = discovering patterns in logs of user interaction with Web
resources
• logs typically contain an identifier for users (e.g. ip address), their queries
and clicks
• What about usage of Linked
Open Data?
• Can we use semantics to
improve mining of Web Usage?
6. Mining Usage of Linked Open Data in USEWOD
USEWOD: http://usewod.org/ [B. Berendt, L. Hollink., M. Luczak-Roesch, et al.]
1. USEWOD workshop series @ ESWC / WWW since 2011
2. USEWOD dataset: server logs of DBpedia, BioPortal, LinkedGeoData, etc.,
and client side logs from YASGUI.
7. Mining Usage of Linked Open Data in USEWOD
USEWOD: http://usewod.org/ [B. Berendt, L. Hollink., M. Luczak-Roesch, et al.]
1. USEWOD workshop series @ ESWC / WWW since 2011
2. USEWOD dataset: server logs of DBpedia, BioPortal, LinkedGeoData, etc.,
and client side logs from YASGUI.
example removed
8. Mining Usage of Linked Open Data in USEWOD
• Results of USEWOD: LOD usage mining for more efficient indexing [1],
cashing [2], auto-completion [3], etc.
[1] Arias, M., Fernández, J. D., Martínez-Prieto, M. A., & de la Fuente, P. (2011). An empirical study
of real-world SPARQL queries. USEWOD @ WWW 2011
[2] Lorey, J., & Naumann, F. Caching and prefetching strategies for sparql queries. USEWOD @
ESWC 2013.
[3] K. Kramer,R.Q. Dividino, and G. Gröner. SPACE: SPARQL Index for Efficient Autocompletion.
ISWC (Posters & Demos) 2013.
[4] Rietveld, L., & Hoekstra, R. Man vs. Machine: Differences in SPARQL Queries. USEWOD @
ESWC 2014
[5] Huelss, J., & Paulheim, H. What SPARQL Query Logs Tell and do not Tell about Semantic
Relatedness in LOD. NoISE @ ESWC 2015
• Issues:
• what is the difference between
queries by machines and humans? [4]
• what is the meaning of repeated
queries by bots/tools?
• a lot of the usage is invisible due to
data dump download [5]
9. Usage mining example 1: clustering rdf:properties
in DBpedia
Instead of listing all DBpedia properties
alphabetically, can we display them in a
more meaningful way? Can we use query
logs for this?
[5]
10. Usage mining example 1: clustering rdf:properties
in DBpedia
Instead of listing all DBpedia properties
alphabetically, can we display them in a
more meaningful way? Can we use query
logs for this?
[5]
[5] Huelss, J., & Paulheim, H. What SPARQL
Query Logs Tell and do not Tell about Semantic
Relatedness in LOD. NoISE @ ESWC 2015
Disclaimer: simplified discussion of this paper!
11. Usage mining example 1: clustering rdf:properties
in DBpedia
Approach: Hierarchical Clustering of
properties, where the distance between a
pair of properties is based on how often
they co-occur in a SPARQL query in the
USEWOD2015 logs.
[5] Huelss, J., & Paulheim, H. What SPARQL
Query Logs Tell and do not Tell about Semantic
Relatedness in LOD. NoISE @ ESWC 2015
Disclaimer: simplified discussion of this paper!
12. Usage mining example 1: clustering rdf:properties
in DBpedia
Approach: Hierarchical Clustering of
properties, where the distance between a
pair of properties is based on how often
they co-occur in a SPARQL query in the
USEWOD2015 logs.
[5] Huelss, J., & Paulheim, H. What SPARQL
Query Logs Tell and do not Tell about Semantic
Relatedness in LOD. NoISE @ ESWC 2015
Disclaimer: simplified discussion of this paper!
Evaluation: run an experiment to
measure how quickly and accurately
people identify facts when looking
at the standard view or the clustered
view.
13. Usage mining example 1: clustering rdf:properties
in DBpedia
Approach: Hierarchical Clustering of
properties, where the distance between a
pair of properties is based on how often
they co-occur in a SPARQL query in the
USEWOD2015 logs.
[5] Huelss, J., & Paulheim, H. What SPARQL
Query Logs Tell and do not Tell about Semantic
Relatedness in LOD. NoISE @ ESWC 2015
Disclaimer: simplified discussion of this paper!
Evaluation: run an experiment to
measure how quickly and accurately
people identify facts when looking
at the standard view or the clustered
view.
Result: no significant differences ☹
14. Usage mining example 1: clustering rdf:properties
in DBpedia
Approach: Hierarchical Clustering of
properties, where the distance between a
pair of properties is based on how often
they co-occur in a SPARQL query in the
USEWOD2015 logs.
[5] Huelss, J., & Paulheim, H. What SPARQL
Query Logs Tell and do not Tell about Semantic
Relatedness in LOD. NoISE @ ESWC 2015
Disclaimer: simplified discussion of this paper!
Evaluation: run an experiment to
measure how quickly and accurately
people identify facts when looking
at the standard view or the clustered
view.
Result: no significant differences ☹
15. Usage mining example 2: mining semantically
enriched query logs
[5] Laura Hollink, Peter Mika and Roi Blanco. Web
Usage Mining with Semantic Analysis. WWW 2013.
16. Usage mining example 2: mining semantically
enriched query logs
Data: queries and clicks on Yahoo! search engine.
[5] Laura Hollink, Peter Mika and Roi Blanco. Web
Usage Mining with Semantic Analysis. WWW 2013.
17. Usage mining example 2: mining semantically
enriched query logs
Data: queries and clicks on Yahoo! search engine.
Problem when mining ‘raw’ logs: low support of even the most
frequent patterns
[5] Laura Hollink, Peter Mika and Roi Blanco. Web
Usage Mining with Semantic Analysis. WWW 2013.
18. Usage mining example 2: mining semantically
enriched query logs
Data: queries and clicks on Yahoo! search engine.
Problem when mining ‘raw’ logs: low support of even the most
frequent patterns
[5] Laura Hollink, Peter Mika and Roi Blanco. Web
Usage Mining with Semantic Analysis. WWW 2013.
19. Usage mining example 2: mining semantically
enriched query logs
Approach:
1. link queries to entities in
LOD cloud
2. choose class of entity +
selected properties
3. detect modifier words
(download, trailer, cast,
date, etc.)
20. Usage mining example 2: mining semantically
enriched query logs
Approach:
1. link queries to entities in
LOD cloud
2. choose class of entity +
selected properties
3. detect modifier words
(download, trailer, cast,
date, etc.)
1. Link queries to entities in LOD cloud:
• Freebase (has a lot of movie related info)
• DBpedia (Wikipedia is widely used)
21. Usage mining example 2: mining semantically
enriched query logs
Approach:
1. link queries to entities in
LOD cloud
2. choose class of entity +
selected properties
3. detect modifier words
(download, trailer, cast,
date, etc.)
1. Link queries to entities in LOD cloud:
• Freebase (has a lot of movie related info)
• DBpedia (Wikipedia is widely used)
22. Usage mining example 2: mining semantically
enriched query logs
Approach:
1. link queries to entities in
LOD cloud
2. choose class of entity +
selected properties
3. detect modifier words
(download, trailer, cast,
date, etc.)
1. Link queries to entities in LOD cloud:
• Freebase (has a lot of movie related info)
• DBpedia (Wikipedia is widely used)
23. Usage mining example 2: mining semantically
enriched query logs
•Sequential
pattern mining
on the class-
level using
PrefixSpan.
24. Usage mining example 2: mining semantically
enriched query logs
•Sequential
pattern mining
on the class-
level using
PrefixSpan.
25. Usage mining example 2: mining semantically
enriched query logs
1.Discover frequent patterns on class-level using
• Using the efficient PrefixSpan algorithm to mine all possible subsequence
patterns
26. Usage mining example 3: semantic patterns of
query modification
•Goal: Identify frequent query modifications in an image archive
• state of the art = 3 classes: generalization, specification,
reformulation
•Approach:
1.link queries to entities in the LOD cloud
2.Choose class of entity
3.Determine shortest path between consecutive queries Q1 and
Q2
4.Rank property-paths according to support and confidence.
Hollink, V., Tsikrika, T., & de Vries, A. P.
(2011). Semantic search log analysis: a
method and a study on professional image
search. JASIST 62(4), 691-713.
See also:
Huurnink, B., Hollink, L., Van Den Heuvel,
W., & De Rijke, M. (2010). Search behavior
of media professionals at an audiovisual
archive: A transaction log analysis. JASIST,
61(6), 1180-1197.
27. Usage mining example 3: semantic patterns of
query modification
•Goal: Identify frequent query modifications in an image archive
• state of the art = 3 classes: generalization, specification,
reformulation
•Approach:
1.link queries to entities in the LOD cloud
2.Choose class of entity
3.Determine shortest path between consecutive queries Q1 and
Q2
4.Rank property-paths according to support and confidence.
Hollink, V., Tsikrika, T., & de Vries, A. P.
(2011). Semantic search log analysis: a
method and a study on professional image
search. JASIST 62(4), 691-713.
See also:
Huurnink, B., Hollink, L., Van Den Heuvel,
W., & De Rijke, M. (2010). Search behavior
of media professionals at an audiovisual
archive: A transaction log analysis. JASIST,
61(6), 1180-1197.
28. Usage mining example 3: semantic patterns of
query modification
•Goal: Identify frequent query modifications in an image archive
• state of the art = 3 classes: generalization, specification,
reformulation
•Approach:
1.link queries to entities in the LOD cloud
2.Choose class of entity
3.Determine shortest path between consecutive queries Q1 and
Q2
4.Rank property-paths according to support and confidence.
Hollink, V., Tsikrika, T., & de Vries, A. P.
(2011). Semantic search log analysis: a
method and a study on professional image
search. JASIST 62(4), 691-713.
See also:
Huurnink, B., Hollink, L., Van Den Heuvel,
W., & De Rijke, M. (2010). Search behavior
of media professionals at an audiovisual
archive: A transaction log analysis. JASIST,
61(6), 1180-1197.
Conclusions:
• Identified patterns not visible on raw
data.
• but “the method is only moderately
successful in identifying the most
prominent relations for a given query
pair”
29. The feature selection issue when using LOD
Input
Data
Data Preprocessing
and Transformation
Data Mining
Interpretation
and Evaluation
Information/
Taking Action
Data fusion (multiple sources)
Data Cleaning (noise,missing val.)
Feature Selection
Dimensionality Reduction
Data Normalization
Filtering Patterns
Visualization
Statistical Analysis
- Hypothesis testing
- Attribute evaluation
- Comparing learned models
- Computing Confidence Intervals
30. Feature Selection
• Feature selection = Limiting the number of features for faster computation
times, more understandable models, better prediction value.
• Using Linked Open Data can lead to large number of features per data point.
• a DBpedia resource easily has 50 property-value pairs.
• more are easily added using reasoning
• note: these numbers are not large compared to the number of features in
DNA strings, or all words in a text corpus!
• Still, many of them are irrelevant or redundant.
31. Feature Selection Example
• Goal: learn a relation R between x and y.
• In this paper, R = ‘occupation’, ‘gender’, ‘instance_of’, ‘acted_in’, ‘genre’,
‘position_played_on_team’
• Approach: given a training set of pairs of x, y, learn a “whitelist” of properties
in DBpedia, WikiData, YAGO and WordNet that indicate a relation R between
x and y
• Cast as a subset selection problem:
• E = the set of possible properties
• local search over the power set of E (i.a. all subsets) to find the optimal
subset.
Learning to Exploit Structured Resources
for Lexical Inference. Vered Shwartz, Omer
Levy, Ido Dagan and Jacob Goldberger.
CoNLL 2015 (to appear)july
32. Data Fusion
Input
Data
Data Preprocessing
and Transformation
Data Mining
Interpretation
and Evaluation
Information/
Taking Action
Data fusion (multiple sources)
Data Cleaning (noise,missing val.)
Feature Selection
Dimensionality Reduction
Data Normalization
Filtering Patterns
Visualization
Statistical Analysis
- Hypothesis testing
- Attribute evaluation
- Comparing learned models
- Computing Confidence Intervals
38. Methods for Data Fusion (ontology alignment)
label
label
label
label
39. Methods for Data Fusion: structural matchers
label
label
label
label
40. Methods for Data Fusion: structural matchers
label
label
label
label
• E.g. Similarity Flooding: the similarity of a matched pair s1
and t1 propagates to their respective neighbors s2 and t2.
• neighbors can be defined as subclasses,
superclasses, instances, domain/ranges, etc.
• Structural measures are in practice never used stand
alone.
[10] Ngo, Duy Hoa, and Zohra Bellahsene.
YAM++-results for OAEI 2012. OAEI @
ISWC 2012.
[11] Sergey Melnik, Hector Garcia-Molina,
and Erhard Rahm. Similarity flooding: A
versatile graph matching algorithm and its
application to schema matching.
ICDE 2002.
41. Methods for Data Fusion: instance based matchers
label
label
label
label
42. Methods for Data Fusion: instance based matchers
label
label
label
label
• Match classes based on similarity of their instances
• note: you need a way to assess similarity of the instances!
44. Methods for Data Fusion: string based
• This is the most important feature in ontology alignment.
• “nearly all [ontology alignment systems] use a string similarity metric” [12]
• stopping and stemming is not helpful! Nor is using WordNet synonyms. [12]
• In [13] we took an even less semantic approach: linking based on URL syntax.
label
label
label
label
[12] Cheatham, M., & Hitzler, P. String
similarity metrics for ontology alignment.
ISWC 2013.
[13] The debates of the European
Parliament as Linked Open Data. Under
review. See http://www.talkofeurope.eu/
data/ for details.
45. Methods for Data Fusion: string based
• This is the most important feature in ontology alignment.
• “nearly all [ontology alignment systems] use a string similarity metric” [12]
• stopping and stemming is not helpful! Nor is using WordNet synonyms. [12]
• In [13] we took an even less semantic approach: linking based on URL syntax.
label
label
label
label
[12] Cheatham, M., & Hitzler, P. String
similarity metrics for ontology alignment.
ISWC 2013.
[13] The debates of the European
Parliament as Linked Open Data. Under
review. See http://www.talkofeurope.eu/
data/ for details.
46. Methods for Data Fusion: string based
• This is the most important feature in ontology alignment.
• “nearly all [ontology alignment systems] use a string similarity metric” [12]
• stopping and stemming is not helpful! Nor is using WordNet synonyms. [12]
• In [13] we took an even less semantic approach: linking based on URL syntax.
label
label
label
label
[12] Cheatham, M., & Hitzler, P. String
similarity metrics for ontology alignment.
ISWC 2013.
[13] The debates of the European
Parliament as Linked Open Data. Under
review. See http://www.talkofeurope.eu/
data/ for details.
http://www.dbpedia.org/page/Judith_Sargentini
47. Link types
Equality
SameAs
EquivalentClasses
EquivalentProperties
“Den Haag” = “The Hague”
wood-material = wood
Hierarchical
rdfs:subClassOf
rdf:type
rdfs:subPropertyOf
aat:Artist ⊇ wn:Artist
tgn:Africa ∈ wn:Continent
conf:has_the_last_name =
edas:hasLastName
Weaker semantics
skos:closeMatch / exactMatch /
broadMatch /narrowMatch /
relatedMatch
geonames:Italy skos:closeMatch
librarytopics:Italy
Domain specific links
E.g. born-in
E.g. hasStyle
E.g. hasPart
Van Gogh (ULAN) born-in Groot-
Zundert (TGN)
50. Representation of links
architecten
architects
Link 001
skos:exactMatch
handmatigL. Hollink
concept1
concept2
link type
link methode
auteur
architecten architectsskos:exactMatch
• Open Question: how valid are the
patterns we discover in data when
the quality of the links is low?
51. Representation of links
architecten
architects
Link 001
skos:exactMatch
handmatigL. Hollink
concept1
concept2
link type
link methode
auteur
architecten architectsskos:exactMatch
• Open Question: how valid are the
patterns we discover in data when
the quality of the links is low?
52. Representation of links
architecten
architects
Link 001
skos:exactMatch
handmatigL. Hollink
concept1
concept2
link type
link methode
auteur
architecten architectsskos:exactMatch
• Open Question: how valid are the
patterns we discover in data when
the quality of the links is low?
• Even more important to be critical
and evaluate the data
• source criticism
• tool criticism (see http://
event.cwi.nl/toolcriticism/)
54. Evaluation of Data Fusion / Linking
1. Manually rating (a sample of) mappings
• relatively cheap and easy to interpret
• only precision, no recall
55. Evaluation of Data Fusion / Linking
1. Manually rating (a sample of) mappings
• relatively cheap and easy to interpret
• only precision, no recall
2. Comparison to a reference alignment
• precision and recall
• used in OAEI on the SEALS platform
• more expensive if a reference alignment has to be
created (but: crowd sourcing!)
56. Evaluation of Data Fusion / Linking
1. Manually rating (a sample of) mappings
• relatively cheap and easy to interpret
• only precision, no recall
2. Comparison to a reference alignment
• precision and recall
• used in OAEI on the SEALS platform
• more expensive if a reference alignment has to be
created (but: crowd sourcing!)
3. End-to-end evaluation (a.k.a. evaluating an application
that uses the mappings)
• arguably the best method!
• need to have access to an application + users
58. Evaluation of Data Fusion / Linking
• Comparison to a reference alignment: Alternative measures:
• 1. instead of a binary classification into correct/incorrect mappings, take
into account how wrong an link is:
59. Evaluation of Data Fusion / Linking
• Comparison to a reference alignment: Alternative measures:
• 1. instead of a binary classification into correct/incorrect mappings, take
into account how wrong an link is:
60. Evaluation of Data Fusion / Linking
• Comparison to a reference alignment: Alternative measures:
• 1. instead of a binary classification into correct/incorrect mappings, take
into account how wrong an link is:
61. Evaluation of Data Fusion / Linking
• Comparison to a reference alignment: Alternative measures:
• 1. instead of a binary classification into correct/incorrect mappings, take
into account how wrong an link is:
• where r(a) is the semantic distance between correspondence a and
correspondence a’ in the reference alignment, A is the number of
correspondences.
• 2. weight score of mappings based on the frequency of their use
• e.g from usage logs! Laura Hollink, Mark van Assem, Shenghui
Wang, Antoine Isaac, Guus Schreiber. Two
Variations on Ontology Alignment
Evaluation: Methodological Issues.ESWC
2008.
62. Evaluation of Data Fusion / Linking
1. Manually rating (a sample of) mappings
• relatively cheap and easy to interpret
• only precision, no recall
2. Comparison to a reference alignment
• precision and recall
• used in OAEI on the SEALS platform
• more expensive if a reference alignment has to be
created (but: crowd sourcing!)
3. End-to-end evaluation (a.k.a. evaluating an application
that uses the mappings)
• arguably the best method!
• need to have access to an application + users
63. Discovering links from text
Pointers to what happens in other communities
• Word2Vec: efficient deep learning algorithm to learn vector representations of
words
• vector similarity captures semantics between words
• No explicit semantics, but we can’t deny that there is meaning there!
• Success seems to be mostly due to big data
64. Discovering links from text
Pointers to what happens in other communities
• Word2Vec: efficient deep learning algorithm to learn vector representations of
words
• vector similarity captures semantics between words
• No explicit semantics, but we can’t deny that there is meaning there!
• Success seems to be mostly due to big data
Mikolov, Tomas, et al. "Distributed
representations of words and phrases and
their compositionality." Advances in neural
information processing systems. 2013.
65. Discovering links from text
Pointers to what happens in other communities
• Word2Vec: efficient deep learning algorithm to learn vector representations of
words
• vector similarity captures semantics between words
• No explicit semantics, but we can’t deny that there is meaning there!
• Success seems to be mostly due to big data
Mikolov, Tomas, et al. "Distributed
representations of words and phrases and
their compositionality." Advances in neural
information processing systems. 2013.
Example:
Vec(Madrid) - Vec(Spain) + Vec(France)
is closer to Vec(Paris) than to any other
vector
66. NELL: Never-Ending Language Learning
• several machine learning approaches to discover facts (beliefs) from text on
the web
• string features, distribution of context words, html structure, visual image
analysis.
• Running since 2010, has so far learned over 80 million beliefs
67. NELL: Never-Ending Language Learning
• several machine learning approaches to discover facts (beliefs) from text on
the web
• string features, distribution of context words, html structure, visual image
analysis.
• Running since 2010, has so far learned over 80 million beliefs
T. Mitchell, W. Cohen, E. Hruschka, P.
Talukdar, J. Betteridge, A. Carlson, B. Dalvi,
M. Gardner, B. Kisiel, J. Krishnamurthy, N.
Lao, K. Mazaitis, T. Mohamed, N.
Nakashole, E. Platanios, A. Ritter, M.
Samadi, B. Settles, R. Wang, D. Wijaya, A.
Gupta, X. Chen, A. Saparov, M. Greaves, J.
Welling. In Proceedings of the Conference
on Artificial Intelligence (AAAI), 2015.
68. Research Task Format
Work in 6 groups of 10 students
• 5 people design an approach to
association rules with semantics
• 5 people focus on how that
approach should be evaluated
The idea is to work together!
E.g. which measures are best
for this approach? Which
versions of the approach
should be evaluated? Will this
approach score high on these
measures? In which cases?
69. Research Task Format
Work in 6 groups of 10 students
• 5 people design an approach to
association rules with semantics
• 5 people focus on how that
approach should be evaluated
The idea is to work together!
E.g. which measures are best
for this approach? Which
versions of the approach
should be evaluated? Will this
approach score high on these
measures? In which cases?
• We would like one presentation per group of 10 people
• of 3 or 4 slides
• of max 4 minutes (less is fine too!)
• Send me the slides in PDF, with your group number in the title,
by email to l.hollink@cwi.nl, today before 16:30.
• The presentation should show clearly:
1. the AR method
2. how did you take into account semantics?
3. the evaluation method
• BONUS: argue when and why your approach will score high.
• BONUS: discuss how the newly learned links can be
represented and used.
70. Research Task Format
Work in 6 groups of 10 students
• 5 people design an approach to
association rules with semantics
• 5 people focus on how that
approach should be evaluated
The idea is to work together!
E.g. which measures are best
for this approach? Which
versions of the approach
should be evaluated? Will this
approach score high on these
measures? In which cases?
• We would like one presentation per group of 10 people
• of 3 or 4 slides
• of max 4 minutes (less is fine too!)
• Send me the slides in PDF, with your group number in the title,
by email to l.hollink@cwi.nl, today before 16:30.
• The presentation should show clearly:
1. the AR method
2. how did you take into account semantics?
3. the evaluation method
• BONUS: argue when and why your approach will score high.
• BONUS: discuss how the newly learned links can be
represented and used.
Tips:
• you may pick a dataset that
you will use as an example