The 7 Things I Know About Cyber Security After 25 Years | April 2024
Exploiting Linked Open Data as Background Knowledge in Data Mining
1. 10/08/13 Heiko Paulheim 1
Exploiting Linked Open Data
as Background Knowledge in Data Mining
Heiko Paulheim, University of Mannheim
2. 10/08/13 Heiko Paulheim 2
Outline
• Motivation
• The original FeGeLOD framework
• Experiments
• Applications
• The RapidMiner Linked Open Data Extension
• Challenges and Future Work
3. 10/08/13 Heiko Paulheim 3
Motivation: An Example Data Mining Task
• Analyzing book sales
ISBN City Sold
3-2347-3427-1 Darmstadt 124
3-43784-324-2 Mannheim 493
3-145-34587-0 Roßdorf 14
...
ISBN City Population ... Genre Publisher ... Sold
3-2347-3427-1 Darm-
stadt
144402 ... Crime Bloody
Books
... 124
3-43784-324-2 Mann-
heim
291458 … Crime Guns Ltd. … 493
3-145-34587-0 Roß-
dorf
12019 ... Travel Up&Away ... 14
...
→ Crime novels sell better in larger cities
4. 10/08/13 Heiko Paulheim 4
Motivation
• Many data mining problems are solved better
– when you have more background knowledge
(leaving scalability aside)
• Problems:
– Tedious work
– Selection bias: what to include?
6. 10/08/13 Heiko Paulheim 6
Motivation
• Idea:
– reuse background knowledge from Linked Open Data
– include it in the data mining process as needed
• Two main variants:
– develop mining/learning algorithms that run directly on Linked Data
– create relational features from Linked Data
7. 10/08/13 Heiko Paulheim 7
Motivation
• Develop mining/learning algorithms
– e.g., DL Learner
– e.g., dedicated Kernel functions
• Advantages:
– can be quite efficient
– no reduction to “flat” table structure
– semantics can be respected directly
8. 10/08/13 Heiko Paulheim 8
Motivation
• Create relational features
– e.g., LiDDM
– e.g., AutoSPARQL
– e.g., FeGeLOD / RapidMiner Linked Open Data Extension
• Advantages:
– Easy combination of knowledge from various sources
• including relational features in the original data
– Arbitrary mining algorithms/tools possible
9. 10/08/13 Heiko Paulheim 9
FeGeLOD – Feature Generation from LOD
IS B N
3 -2 3 4 7 -3 4 2 7 -1
C ity
D a r m s ta d t
# s o ld
1 2 4
N a m e d E n t it y
R e c o g n it io n
IS B N
3 -2 3 4 7 -3 4 2 7 - 1
C ity
D a r m s ta d t
# s o ld
1 2 4
C ity _ U R I
h ttp : / / d b p e d ia .o r g / r e s o u r c e/ D a r m s ta d t
F e a t u r e
G e n e r a t io n
IS B N
3 -2 3 4 7 -3 4 2 7 -1
C ity
D a r m s ta d t
# s o ld
1 2 4
C ity _ U R I
h ttp : / / d b p e d ia .o r g / r e s o u r c e / D a r m s ta d t
C ity _ U R I_ d b p e d ia -o w l: p o p u la tio n T o ta l
1 4 1 4 7 1
C ity _ U R I_ ...
...
F e a t u r e
S e le c t io n
IS B N
3 -2 3 4 7 -3 4 2 7 - 1
C ity
D a r m s ta d t
# s o ld
1 2 4
C ity _ U R I
h ttp : / / d b p e d ia .o r g / r e s o u r c e/ D a r m s ta d t
C ity _ U R I_ d b p e d ia -o w l:p o p u la tio n T o ta l
1 4 1 4 7 1
10. 10/08/13 Heiko Paulheim 10
FeGeLOD – Feature Generation from LOD
• Original prototype, based on Weka:
– Simple NER (guessing URIs)
– Seven generators:
• direct types
• data properties
• unqualified relations (boolean, numeric)
• qualified relations (boolean, numeric)
• individuals (dangerous!) - may be restricted to specific property
– Simple feature selection: filtering features
• that have only* different values (expect numerical)
• that have only* identical values
• that are mostly missing*
*) 95% or 99%
11. 10/08/13 Heiko Paulheim 11
Experiments
• Testing with two* standard machine learning data sets
– Zoo: classifying animals
– AAUP: predicting income of university employees
(regression task)
• Question: how much improvement do additional features bring?
*) standard ML datasets with speaking labels are scarce!
14. 10/08/13 Heiko Paulheim 14
Experiments: Early Insights
• Additional features often improve the results
• Zoo dataset:
– Ripper: 89.11 to 96.04
– SMO: 93.07 to 97.03
– No improvement for Naive Bayes
• AAUP dataset (compensation):
– M5: 59.88 to 51.28
– SMO: 74.12 to 61.97
– No improvement for linear regression
• ...but they may also cause problems
– extreme example: 6.54 to 189.90 for linear regression
– memory and timeouts due to large datasets
15. 10/08/13 Heiko Paulheim 15
Experiments: Quality of Features
• Information gain of features on Zoo dataset
16. 10/08/13 Heiko Paulheim 16
Experiments: Quality of Features
• Information gain of features on AAUP dataset (compensation)
17. 10/08/13 Heiko Paulheim 17
Application: Classifying Events from Wikipedia
• Event Extraction from Wikipedia
• Joint work with Dennis Wegener and Daniel Hienert (GESIS)
• Task: event classification (e.g., Politics, Sports, ...)
http://www.vizgr.org/historical-events/timeline/
18. 10/08/13 Heiko Paulheim 18
Application: Classifying Events from Wikipedia
• Source Material:
http://www.vizgr.org/historical-events/timeline/
19. 10/08/13 Heiko Paulheim 19
Application: Classifying Events from Wikipedia
• Positive Examples for class politics:
– 2011, March 15 - German chancellor Angela Merkel shuts down the seven
oldest German nuclear power plants.
– 2010, June 3 – Christian Wulff is nominated for President of Germany by
Angela Merkel.
• Negative Examples for class politics:
– 2010, July 7 – Spain defeats Germany 1-0 to win its semi-final and for its first
time, along with Netherlands make the 2010 FIFA World Cup Final.
– 2012, February 16 – Roman Lob is selected to represent Germany in the
Eurovision Song Contest.
20. 10/08/13 Heiko Paulheim 20
Application: Classifying Events from Wikipedia
• Positive Examples for class politics:
– 2011, March 15 - German chancellor Angela Merkel shuts down the seven
oldest German nuclear power plants.
– 2010, June 3 – Christian Wulff is nominated for President of Germany by
Angela Merkel.
• Negative Examples for class politics:
– 2010, July 7 – Spain defeats Germany 1-0 to win its semi-final and for its first
time, along with Netherlands make the 2010 FIFA World Cup Final.
– 2012, February 16 – Roman Lob is selected to represent Germany in the
Eurovision Song Contest.
• Possible learned model:
– "Angela Merkel" → Politics
21. 10/08/13 Heiko Paulheim 21
Application: Classifying Events from Wikipedia
• Possibly Learned Model:
– "Angela Merkel" → Politics
• How can we do better?
• Background knowledge from Linked Open Data
– 2011, March 15 - German chancellor Angela Merkel [class: Politician] shuts
down the seven oldest German nuclear power plants.
– 2012, May 13, Elections in North Rhine-Westphalia – Hannelore Kraft [class:
Politician] is elected to continue as Minister-President, heading an SPD-
Green coalition.
• Model learned in that case:
– "[class: Politician]" → Politics
22. 10/08/13 Heiko Paulheim 22
Application: Classifying Events from Wikipedia
• Model learned in that case:
– "[class: Politician]" → Politics
• Much more general
– Can also classify events with politicians
not contained in the training set
• Less training examples required
– A few events with politicians, athletes, singers, ... are enough
23. 10/08/13 Heiko Paulheim 23
Application: Classifying Events from Wikipedia
• Experiments on Wikipedia data
– >10 categories
– 1,000 labeled examples as training set
– Classification accuracy: 80%
• Plus:
– We have trained a language-independent model!
• often, models are like "elect*" → Politics
– 22. Mai 2012: Peter Altmaier [class: Politician] wird als Nachfolger von
Norbert Röttgen [class: Politician] zum Bundesumweltminister ernannt.
– 6 januari 2012: Jonas Sjöstedt [class: Politician] väljs till ny partiledare för
Vänsterpartiet efter Lars Ohly [class: Politician].
24. 10/08/13 Heiko Paulheim 24
Application: Classifying Tweets
• Joint work with Axel Schulz and Petar Ristoski (SAP Research)
• Goal: using Twitter for emergency management
fire at #mannheim
#universityomg two cars on
fire #A5 #accident
fire at train station
still burning
my heart
is on fire!!!come on baby
light my fire
boss should fire
that stupid moron
25. 10/08/13 Heiko Paulheim 25
Application: Classifying Tweets
• Social media contains data on many incidents
– But keyword search is not enough
– Detecting small incidents is hard
– Manual inspection is too expensive (and slow)
• Machine learning could help
– Train a model to classify incident/non incident tweets
– Apply model for detecting incident related tweets
• Training data:
– Traffic accidents
– ~2,000 tweets containing relevant keywords (“car”, “crash”, etc.),
hand labeled (50% related to traffic incidents)
26. 10/08/13 Heiko Paulheim 26
Application: Classifying Tweets
• Learning to classify tweets:
– Positive and negative examples
– Features:
• Stemming
• POS tagging
• Word n-grams
• …
• Accuracy ~90%
• But
– Accuracy drops to ~85% when applying the model to a different city
27. 10/08/13 Heiko Paulheim 27
Application: Classifying Tweets
• Example set:
– “Again crash on I90”
– “Accident on I90”
• Model:
– “I90” → indicates traffic accident
• Applying the model:
– “Two cars crashed on I51” → not related to traffic accident
28. 10/08/13 Heiko Paulheim 28
Using LOD for Preventing Overfitting
• Example set:
– “Again crash on I90”
– “Accident on I90”
dbpedia:Interstate_90
dbpedia-owl:Road
rdf:type
dbpedia:Interstate_51
rdf:type
• Model:
– dbpedia-owl:Road → indicates traffic accident
• Applying the model:
– “Two cars crashed on I51” → indicates traffic accident
• Using DBpedia Spotlight + FeGeLOD
– Accuracy keeps up at 90%
– Overfitting is avoided
29. 10/08/13 Heiko Paulheim 29
Explaining Statistics
• Statistics are very wide spread
– Quality of living in cities
– Corruption by country
– Fertility rate by country
– Suicide rate by country
– Box office revenue of films
– ...
30. 10/08/13 Heiko Paulheim 30
Explaining Statistics
• Questions we are often interested in
– Why does city X have a high/low quality of living?
– Why is the corruption higher in country A than in country B?
– Will a new film create a high/low box office revenue?
• i.e., we are looking for
– explanations
– forecasts (e.g., extrapolations)
33. 10/08/13 Heiko Paulheim 33
Explaining Statistics
• There are powerful tools for finding correlations etc.
– but many statistics cannot be interpreted directly
– background knowledge is missing
• Approach:
– use Linked Open Data for enriching statistical data (e.g., FeGeLOD)
– run analysis tools for finding explanations
35. 10/08/13 Heiko Paulheim 35
Statistical Data: Examples
• Data Set: Mercer Quality of Living
– Quality of living in 216 cities word wide
– norm: NYC=100 (value range 23-109)
– As of 1999
– http://across.co.nz/qualityofliving.htm
• LOD data sets used in the examples:
– DBpedia
– CIA World Factbook for statistics by country
36. 10/08/13 Heiko Paulheim 36
Statistical Data: Examples
• Examples for low quality cities
– big hot cities (junHighC >= 27 and areaTotalKm >= 334)
– cold cities where no music has ever been recorded
(recordedIn_in = false and janHighC <= 16)
– latitude <= 24 and longitude <= 47
• a very accurate rule
• but what's the interpretation? Next Record Studio
2547 miles
Next Record Studio
2547 miles
38. 10/08/13 Heiko Paulheim 38
Statistical Data: Examples
• Data Set: Transparency International
– 177 Countries and a corruption perception indicator
(between 1 and 10)
– As of 2010
– http://www.transparency.org/cpi2010/results
39. 10/08/13 Heiko Paulheim 39
Statistical Data: Examples
• Example rules for countries with low corruption
– HDI > 78%
• Human Development Index, calculated from
live expectancy, education level, economic performance
– OECD member states
– Foundation place of more than nine organizations
– More than ten mountains
– More than ten companies with their headquarter in that state,
but less than two cargo airlines
40. 10/08/13 Heiko Paulheim 40
Statistical Data: Examples
• Data Set: Burnout rates
– 16 German DAX companies
– Absolute and relative numbers
– As of 2011
– http://de.statista.com/statistik/daten/studie/226959/umfrage/burn-out-
erkrankungen-unter-mitarbeitern-ausgewaehlter-dax-unternehmen/
41. 10/08/13 Heiko Paulheim 41
Datavalues
Type
Unqualifiedrelation(boolean)
Unqualifiedrelation(numeric)
Qualifiedrelation(boolean)
Qualifiedrelation(numeric)
Joint
1
1.5
2
2.5
3
3.5
4
4.5
5
Correlation
Rule Learning
Evaluation of Feature Quality
• Quality of living dataset
43. 10/08/13 Heiko Paulheim 43
Statistical Data: Examples
• Findings for burnout rates
– Positive correlation between turnover and burnout rates
– Car manufacturers are less prone to burnout
– German companies are less prone to burnout than international ones
• Exception: Frankfurt
44. 10/08/13 Heiko Paulheim 44
Statistical Data: Examples
• Data Set: Antidepressives consumption
– In European countries
– Source: OECD
– http://www.oecd-ilibrary.org/social-issues-migration-health/health-at-a-glance-
2011/pharmaceutical-consumption_health_glance-2011-39-en
45. 10/08/13 Heiko Paulheim 45
Statistical Data: Examples
• Findings for antidepressives consumption
– Larger countries have higher consumption
– Low HDI → high consumption
– By geography:
• Nordic countries, countries at the Atlantic: high
• Mediterranean: medium
• Alpine countries: low
– High average age → high consumption
– High birth rates → high consumption
46. 10/08/13 Heiko Paulheim 46
Statistical Data: Examples
• Data Set: Suicide rates
– By country
– OECD states
– As of 2005
– http://www.washingtonpost.com/wp-srv/world/suiciderate.html
47. 10/08/13 Heiko Paulheim 47
Statistical Data: Examples
• Findings for suicide rates
– Democraties have lower suicide rates than other forms of government
– High HDI → low suicide rate
– High population density → high suicide rate
– By geography:
• At the sea → low
• In the mountains → high
– High Gini index → low suicide rate
• High Gini index ↔ unequal distribution of wealth
– High usage of nuclear power → high suicide rates
48. 10/08/13 Heiko Paulheim 48
Statistical Data: Examples
• Data set: sexual activity
– Percentage of people having sex weekly
– By country
– Survey by Durex 2005-2009
– http://chartsbin.com/view/uya
49. 10/08/13 Heiko Paulheim 49
Statistical Data: Examples
• Findings on sexual activity
– By geography:
• High in Europe, low in Asia
• Low in Island states
– By language:
• English speaking: low
• French speaking: high
– Low average age → high activity
– High GDP per capita → low activity
– High unemployment rate → high activity
– High number of ISP providers → low activity
50. 10/08/13 Heiko Paulheim 50
Try it... but be careful!
• Download from
http://www.ke.tu-darmstadt.de/resources/explain-a-lod
• including a demo video, papers, etc.
http://xkcd.com/552/
51. 10/08/13 Heiko Paulheim 51
RapidMiner Linked Open Data Extension
• August 16th
, 2013: FeGeLOD celebrates its 2nd
birthday
• Problems
– still no nice UI
– special configurations are tricky
– difficult to enhance
• Decision
– Reimplementation on RapidMiner platform
– September 13th
, 2013:
Release of RapidMiner Linked Open Data Extension
– Available from RapidMiner marketplace
• http://dws.informatik.uni-mannheim.de/en/research/rapidminer-lod-extension/
52. 10/08/13 Heiko Paulheim 52
RapidMiner Linked Open Data Extension
• Simple wiring of operators
– linkers
– generators
• Combination with powerful RapidMiner operators
53. 10/08/13 Heiko Paulheim 53
RapidMiner Linked Open Data Extension
• Easy SPARQL endpoint definitions
• Support of custom SPARQL statements
54. 10/08/13 Heiko Paulheim 54
Challenges and Future Work
• SPARQL variants
– Some endpoints support special/non-standard SPARQL constructs
– COUNT(...)
– transitive closure
– exploit where applicable
• Implementations without SPARQL
– Freebase
– OpenCyc
55. 10/08/13 Heiko Paulheim 55
Challenges and Future Work
• Linking is still challenging
– URI patterns are not flexible
– Search by label is time consuming
– Services like DBpedia Lookup are scarce
• Limitations of completely unsupervised linking
– e.g., Hurricanes
– how to use headlines/attribute names?
56. 10/08/13 Heiko Paulheim 56
Challenges and Future Work
• Linking as optimization problem
– find candidates for all entities, e.g., by DBpedia lookup
– find a selection of candidates that are most similar to each other
• e.g., all of them are U.S. cities
– some experiments with types and categories
• problem: not complete
– some problems cannot be addressed (e.g.: Hurricanes)
• Alternatives:
– semi supervised linking – user provides some example links
– active learning
57. 10/08/13 Heiko Paulheim 57
Challenges and Future Work
• Exploiting semantics for feature selection
• Given two features:
– f1: type(RoadsInAlaska)
– f2: type(Road)
• and the schema definition Road rdfs:subclassOf RoadsInAlaska
• Exploit that information for feature selection
– e.g., gain(f1) ≈ gain(f2), f1<f2 → remove f1
58. 10/08/13 Heiko Paulheim 58
Challenges and Future Work
• Incompleteness of LOD
– e.g., type information in DBpedia
– may lead to findings such as
• if a city is of type Place, the quality of living is high
– possible remedy: autocomplete on the dataset
(e.g., Paulheim/Bizer 2013)
• Biases in LOD
– e.g., DBpedia has a bias towards western culture
– may lead to findings such as
• if many records have been made in a city, the quality of living is high
59. 10/08/13 Heiko Paulheim 59
Challenges and Future Work
• Features not used for scalability reasons:
– features for single entities
• e.g., “Roman Polanski directorOf X”
– features more than one hop away
• e.g., “Cities with a university which has a computer science department”
– some are covered by YAGO types, e.g., “AustralianBandsFoundedIn1990”
• but subject to YAGO's selection bias
• Approaches are required to use such features
– which respect scalability
– “generate first, filter later” is not the best solution
• e.g., “Cities with at least one of ArtSchoolsInParis”
– on-the-fly filtering may be more suitable
• e.g., sampling
60. 10/08/13 Heiko Paulheim 60
Challenges and Future Work
• Automatically exploit data sources with non-simple structures
EU18931 a Funding .
EU18931 has-grant-value [
has-amount 1300000 .
has-unit-of-measure EUR .
]
• Support geo/temporal features
– e.g., Data Cubes
– e.g., Linked Geo Data
• Construct complex features (in a scalable way!)
– e.g., cinemas per inhabitant
real example from
CORDIS dataset
61. 10/08/13 Heiko Paulheim 61
Wrap-up
• Linked Data is useful as background knowledge
– especially on problems which have little knowledge in themselves
• Unsupervised methods
– avoid biases and work without knowledge about LOD
– but: scalability and generality problems
• RapidMiner LOD extension
– a constantly growing toolkit
62. 10/08/13 Heiko Paulheim 62
Credits & Thanks
• Past contributors of FeGeLOD:
– Johannes Fürnkranz
– Raad Bahmani
– Alexander Gabriel
– Simon Holthausen
• Current team of RapidMiner Linked Open Data Extension:
– Chris Bizer
– Petar Ristoski
– Evgeny Mitichkin
63. 10/08/13 Heiko Paulheim 63
Exploiting Linked Open Data
as Background Knowledge in Data Mining
Heiko Paulheim, University of Mannheim