What you Can Make Out of Linked Data

Text
What you can make out of Linked Data
Marco Fossati <fossati@spaziodati.eu>
Steven R. Loomis <srloomis@us.ibm.com>
1

Let's meet the presenters
first!
2

Marco Fossati
Natural Language Processing
Advocate
Recommender Systems
Aficionado
Open Data
Apologist
3

Steven R. Loomis
IBM
Chair, Unicode ULI-TC
!
Projects:
ICU, CLDR, ULI

Outline
1. Linked Open Data 101
2. DBpedia
3. The ULI use case
5

Warning!
Highly interactive tutorial
6

Text
Linked Open Data 101
The Big Picture
8

What is data?
Data is how we express facts in a reusable form
9

Why data? The ingredients
for...
...Information
Knowledge
Wisdom
10

OK it's data, what else?
Big Billions of facts “Santa
Clara is a city”
Linked Richly structured
Open Open licenses
11

Facts, not words
A fact is...
An assertion about the world
Subject + predicate + object
A triple
Human
mind
Natural language
!
Machine
12

Human
mind
Perceiving
relationships
between entities
13

Natural language
"Elvis Presley sings Jailhouse Rock"
14

Machine
The triple
Elvis
Presley
Jailhouse
Rock
!
sings
15

The graph
Rich structure made of
triples
16

From the web of documents...
Text
17

...to the web of entities
Text
18

The web of entities
An entity can be...
Identified
Described through relationships
Understood both by humans and machines
19

Towards a WWW of entities
Identify via HTTP URIs
http://dbpedia.org/resource/Elvis_Presley
Describe via RDF statements
:Elvis_presley :sings :Jailhouse_Rock
Understand via
HTML for humans
RDF for machines
20

Hands-on Time!
https://pad.okfn.org/p/DBpediaULI
21

Text
DBpedia
Extracting Knowledge from Wikipedia
23

DBpedia is…
A. …a data extraction framework
from Wikipedia semi-structured data
B. …an open-source community effort
24

Wikipedia can’t answer
simple questions
“What do Santa Clara and San Francisco
have in common?”
26

Wikipedia can’t answer
complex questions
“Which are the black and white movies
produced in Italy that have soundtracks which
were composed by musicians who were born in
a city of the Trentino-Alto-Adige region with less
than 40,000 inhabitants?”
27

The story so far
Project started in 2007
From good ol’ PHP to Java + Scala
Steadily growing community
Internationalization Committee
Freely available on GitHub
28

Data in Wikipedia
Title
Short abstract
Long abstract
29

Structure in
Wikipedia
Infobox
Images
30

Structure in Wikipedia
Links
Categories
31

Structure in
Wikipedia
Interlanguage Links
32

Much more at
http://dbpedia.org/Datasets
33

DBpedia Extraction
Framework (DEF)
Wikipedia
dump Extractors RDF graph
34

Extractors
Article Features
Abstract, redirects, categories, geo-coordinates,
interlanguage links, etc.
Infobox
Raw
Mapping-based
35

Raw Infobox Extractor
:Elvis_Presley
:born “Elvis Aaron Presley…”
:died “August 16, 1977…”
:restingPlace “Graceland…”
:education “L.C. Humes…”
:occupation “Singer…”
36

The Big Issues
Data is heterogeneous!
Data is multilingual!
37

Solution
• The DBpedia ontology as a multilingual glue
• Wikipedia-to-ontology Mapping
39

DBpedia
Ontology
Encoding the worldwide
encyclopedic
knowledge
40

Mapping-based Extractor
Combines what belongs together
Separates what is different
41

DIEF -Mapping-Based Infobox extractor
42

The Mappings Wiki
Anybody can contribute to
mappings.dbpedia.org
43

Download the latest
DBpedia dump at
http://downloads.dbpedia.org/
current/
44

English SPARQL endpoint
dbpedia.org/sparql
45

Language chapters
DBpedia in your mother tongue
46

Active chapters
International (English-based)
Basque, Czech, Dutch, French, German, Greek,
Indonesian, Italian, Japanese, Korean, Polish,
Portuguese, Spanish
47

Host your own language
chapter!
48

Applications
Get the best out of DBpedia data
49

Knowledge
Graphs
Highly informative
summaries in your
own language
50

Text
Question Answering
“Who is Bram Stoker?”
51

Text
Entity Linking
Detecting Things in Text
52

Automatic
Huge
Gazetteers
Language and Domain-specific
Resources for
Short Sentences
Classification
53

DBpedia Stakeholders
Who is using the knowledge base?
54

Open
Government
Linking Local Data
55

Digital
Libraries
Enriching the Catalogue
56

Data-driven
Journalism
Building Infographics
57

Hands-on Time!
58

Text
The ULI use case
Putting Linked Open Data to work

What’s wrong with
Localization Interoperability?
Inconsistent application, implementation, and
interpretation of standards
Lack of clear requirements for localization data
interchange

Unicode Localization
Interoperability
Technical Committee of Unicode
Focus Areas:
1. Translation memory
2. Translation source strings / translations
3. Segmentation rules

ULI Suppression:
Abbreviations
English
Spanish
Mr.
Sr.
Mrs.
Dto.
Dr.
Sra.
St.
Avda.
…
…
Russian
проф.
февр.
тел.
кв.
…

Demo: ULI Breaks
http://demo.icu-project.org/icu-bin/icusegments
DEMO

DBpedia applied to ULI
(University of Leipzig)
Sebastian Hellman,
Martin Brümmer,
Dimitris Kontokostas
Opportunity:
Help segmentation
by supplying
abbreviation data

Yes!
Evaluation shows that especially for small
texts, abbreviations can contribute to
precision and recall of segmentation

multilingual with over 100 languages
!
structured data eases extraction
!
additional data like entity types and
categories

Example: Mr.
“MR” disambiguation page links to “Mr.” article.
!
Ends in full stop, so may be an abbreviation.

The “Mr.” SPARQL query
SELECT ?entryExample ?exampleTested ?indegreeRanking
WHERE {
<http://dbpedia.org/resource/Mr.>
rdfs:label ?entryExample ;
rdfs:comment ?exampleTested .
FILTER ( lang(?entryExample) = lang(?exampleTested) )
#subselect:
{ SELECT count(?in) as ?indegreeRanking
WHERE { ?in ?p <http://dbpedia.org/resource/Mr.> }
}
}
LIMIT 100
DEMO

Example DBpedia data
(English)
St.
Street
<http://en.wikipedia.org/wiki/Street>
<http://schema.org/Place>
<http://dbpedia.org/ontology/Place>
<http://dbpedia.org/ontology/PopulatedPlace>

Example DBpedia data
(Russian)
Проф.
Профессор (Professor)
<http://ru.wikipedia.org/wiki/Профессор>

2.
Load DBpedia data into local DB

3.
SPARQL Query data and tsv output

!
22859 abbreviations with
78197 meanings in 99
languages

!
22859 abbreviations with 78197 meanings in
99 languages
!
!
Long Tail
!
!
!
Only 25 languages >100 abbrevs.
!
Only 7 languages >1000 abbrevs.
!
!

Long tail (total abbrevs) (zoom)

ULI Process
DBpedia
Wikipedia
ULI
Review
Extraction
Translation
Memory Translation
Memory
Translation
Memory
Comparison
Manual review
CLDR
"Lupa.na.encyklopedii" by Julo - Own work. Licensed under Public domain via Wikimedia Commons - https://
commons.wikimedia.org/wiki/File:Lupa.na.encyklopedii.jpg#mediaviewer/File:Lupa.na.encyklopedii.jpg
CLDR abbrs.
CLDR Suppressions

Comparison with
Translation Memory
Entry % in TM
Corp. 0.0307%
St. 0.0023%
P.T.T.C. 0%
"Trichtermitfilter" by Gmhofmann - Own work. Licensed under Creative Commons Attribution-Share Alike 3.0 via
Wikimedia Commons - https://commons.wikimedia.org/wiki/File:Trichtermitfilter.jpg#mediaviewer/
File:Trichtermitfilter.jpg

CLDR Input
Extract abbreviations from CLDR localized data
Days of week: Sun. Mon. Tue. Wed. Thu. …
Months: Jan. Feb. Mar. …
etc…

CLDR output format
<segmentations>
<segmentation type="SentenceBreak">

<suppressions type="standard">
<suppression>Port.</suppression>
<suppression>Alt.</suppression>
<suppression>Di.</suppression>
<suppression>Ges.</suppression>
<suppression>frz.</suppression>

CLDR 26 Output
http://cldr.unicode.org
“Break Suppression”
de 239
en 151
es 164
fr 82
it 45
pt 170
ru 18

Challenges
"Long Tail" Languages
harder to find existing TM data
harder to find linguistic rules/review
harder to find tagged corpora to benchmark
Systematic issues with using redirects/disambiguation

Opportunity
Scope:
Non-full stop
punctuation- "Yahoo!"
Language specific
abbreviation rules
Context (Medical,
Business, …)
Leverage
Schema/Taxonomy
( “Place” vs “Person”
etc. ) to filter
DBpedia lists
Additional LOD

Thank You!
Further Q&A?
!
Slides & contact info:

What you Can Make Out of Linked Data

Recommended

Recommended

More Related Content

Similar to What you Can Make Out of Linked Data

Similar to What you Can Make Out of Linked Data (20)

More from Marco Fossati

More from Marco Fossati (8)

Recently uploaded

Recently uploaded (20)

What you Can Make Out of Linked Data