19. The web of entities
An entity can be...
Identified
Described through relationships
Understood both by humans and machines
19
20. Towards a WWW of entities
Identify via HTTP URIs
http://dbpedia.org/resource/Elvis_Presley
Describe via RDF statements
:Elvis_presley :sings :Jailhouse_Rock
Understand via
HTML for humans
RDF for machines
20
26. Wikipedia can’t answer
simple questions
“What do Santa Clara and San Francisco
have in common?”
26
27. Wikipedia can’t answer
complex questions
“Which are the black and white movies
produced in Italy that have soundtracks which
were composed by musicians who were born in
a city of the Trentino-Alto-Adige region with less
than 40,000 inhabitants?”
27
28. The story so far
Project started in 2007
From good ol’ PHP to Java + Scala
Steadily growing community
Internationalization Committee
Freely available on GitHub
28
60. Text
The ULI use case
Putting Linked Open Data to work
61. What’s wrong with
Localization Interoperability?
Inconsistent application, implementation, and
interpretation of standards
Lack of clear requirements for localization data
interchange
63. ULI: Segmentation
Given:
Thanks to Dr. Jones for this effort.
UAX#11 Segmentation:
|Thanks to Dr.| Jones for this effort.|
English:
|Thanks to Dr. Jones for this effort.|
66. DBpedia applied to ULI
(University of Leipzig)
Sebastian Hellman,
Martin Brümmer,
Dimitris Kontokostas
Opportunity:
Help segmentation
by supplying
abbreviation data
67. Yes!
Evaluation shows that especially for small
texts, abbreviations can contribute to
precision and recall of segmentation
72. Example DBpedia data
(English)
St.
Street
<http://en.wikipedia.org/wiki/Street>
<http://schema.org/Place>
<http://dbpedia.org/ontology/Place>
<http://dbpedia.org/ontology/PopulatedPlace>
73. Example DBpedia data
(Russian)
Проф.
Профессор (Professor)
<http://ru.wikipedia.org/wiki/Профессор>
79. !
22859 abbreviations with 78197 meanings in
99 languages
!
!
Long Tail
!
!
!
Only 25 languages >100 abbrevs.
!
Only 7 languages >1000 abbrevs.
!
!
82. ULI Process
DBpedia
Wikipedia
ULI
Review
Extraction
Translation
Memory Translation
Memory
Translation
Memory
Comparison
Manual review
CLDR
"Lupa.na.encyklopedii" by Julo - Own work. Licensed under Public domain via Wikimedia Commons - https://
commons.wikimedia.org/wiki/File:Lupa.na.encyklopedii.jpg#mediaviewer/File:Lupa.na.encyklopedii.jpg
CLDR abbrs.
CLDR Suppressions
83. Comparison with
Translation Memory
Entry % in TM
Corp. 0.0307%
St. 0.0023%
P.T.T.C. 0%
"Trichtermitfilter" by Gmhofmann - Own work. Licensed under Creative Commons Attribution-Share Alike 3.0 via
Wikimedia Commons - https://commons.wikimedia.org/wiki/File:Trichtermitfilter.jpg#mediaviewer/
File:Trichtermitfilter.jpg
84. CLDR Input
Extract abbreviations from CLDR localized data
Days of week: Sun. Mon. Tue. Wed. Thu. …
Months: Jan. Feb. Mar. …
etc…
87. CLDR 26 Output
http://cldr.unicode.org
“Break Suppression”
de 239
en 151
es 164
fr 82
it 45
pt 170
ru 18
88. Challenges
"Long Tail" Languages
harder to find existing TM data
harder to find linguistic rules/review
harder to find tagged corpora to benchmark
Systematic issues with using redirects/disambiguation
89. Opportunity
Scope:
Non-full stop
punctuation- "Yahoo!"
Language specific
abbreviation rules
Context (Medical,
Business, …)
Leverage
Schema/Taxonomy
( “Place” vs “Person”
etc. ) to filter
DBpedia lists
Additional LOD