1. Semantic Representations
for Research
Rinke Hoekstra and Stefan Schlobach
VU University Amsterdam/University of Amsterdam
http://www.data2semantics.org
2. About us...
• Knowledge Representation and Reasoning Group
Frank van Harmelen
• Modeling of complex domains
• Querying and reasoning over these
models
• ... at a very large scale (the Web)
3. About us...
• Knowledge Representation and Reasoning Group
Frank van Harmelen
• Experience
a.o. CATCH, STICH, LarKC, CEDAR and Data2Semantics
• Premier group for provenance and linked data at scale
4. Overview
• Research Lifecycle
Data2Semantics and LarKC
• Historical Census Data
CEDAR and Data2Semantics
• Short Title Catalogue of The Netherlands (STCN)
Inger Leemans, Fernie Maas, Paul Huygen, Albert Meroño-Peñuela
5. How to share, publish, access, analyse, interpret and reuse data?
Increase the ease of sharing scientific data ...
... of accessing, analysing and interpreting data ...
... and thereby increasing the reuse of data
8. EASY Data Repository
Enrich datasets: census data
Large volumes of publications
Improve services to clients
Automated services
9. EASY Data Repository
Enrich datasets: census data
Large volumes of publications
Improve services to clients
Automated services
Build systems for hospitals
10. EASY Data Repository
Enrich datasets: census data
Large volumes of publications
Improve services to clients
Automated services
Build systems for hospitals
11. Linked Data
• “Semantic Hyperlinks” between data items
• Every data item has a global identifier ...
• ... that looks like a web address (URI) ...
• ... is linked and described using shared vocabularies
• Resource Description Framework (RDF)
• SPARQL query language & endpoint
12. Linked Data
Linked
LOV User Slideshare tags2con
Audio
Feedback 2RDF delicious
Moseley Scrobbler Bricklink Sussex
Folk (DBTune) Reading St.
GTAA
Magna- Lists Andrews
Klapp-
tune stuhl- Resource NTU
DB club Lists Resource
Tropes Lotico Semantic yovisto
John Music Man- Lists
Music Tweet chester
Hellenic Peel Brainz NDL
(DBTune) (Data Brainz Reading
subjects
FBD (zitgist) Lists Open
EUTC Incubator) Linked
Hellenic Library Open t4gm
Produc- Crunch-
PD Surge RDF info
tions
Discogs base Library
Radio Ontos Source Code
Crime ohloh Plymouth (Talis)
(Data News LEM
Ecosystem Reading RAMEAU
Reports business Incubator)
Crime data.gov. Portal Linked Data Lists SH
UK Music Jamendo
(En- uk
Brainz (DBtune) LinkedL
Ox AKTing) FanHubz gnoss ntnusc
(DBTune) SSW CCN
•
Points Thesau-
Last.FM Poké- Thesaur
Popula- artists Didactal us rus W
“Semantic Hyperlinks” between data items
pédia LIBRIS
tion (En- (DBTune) Last.FM ia theses. LCSH Rådata
reegle research patents MARC
AKTing) (rdfize) my fr nå!
data.gov. data.go Codes
Ren.
NHS uk v.uk Good- Experi-
Classical List
Energy (En- win flickr ment
(DB Pokedex Norwe-
Genera- AKTing) Mortality BBC Family wrappr Sudoc PSH
Tune) gian
(En-
tors Program MeSH
AKTing) semantic
mes BBC IdRef GND
CO2 educatio OpenEI web.org SW
Energy Sudoc ndlna
Emission n.data.g Music Dog VIAF
EEA (En- Chronic- Linked
(En- ov.uk Portu- Food UB
AKTing) ling Event MDB
AKTing) guese Mann- Europeana
BBC America Media
DBpedia Calames heim
Ord- Recht- Wildlife Deutsche
Open Revyu DDC
Openly spraak. Finder Bio- lobid
Election nance
legislation Local nl RDF graphie
Resources NSZL
•
Data Survey Tele- data Ulm Swedish
EU New Book
Project data.gov.uk graphis bnf.fr Catalog Open
Insti- York Mashup
Every data item has a global identifier ...
tutions URI Greek Open P20 Cultural
UK Post- Times
Burner DBpedia Calais Heritage
codes statistics ECS Wiki lobid
GovWILD data.gov. Taxon iServe South- Organi-
LOIUS BNB
Brazilian
uk Concept ECS ampton sations
Geo World OS BibBase STW GESIS
Poli- ESD South- ECS
Names Fact- (RKB
ticians stan- reference ampton
book
•
data.gov.uk Freebase Explorer) Budapest
dards data.gov. NASA EPrints
uk intervals Project OAI
Lichfield (Data
... that looks like a web address (URI) ...
transport DBpedia data Pisa
Spen- Incu- Guten- dcs
data.gov. RESEX Scholaro-
ISTAT ding bator) Fishes berg DBLP DBLP
uk Geo
meter
Immi- Scotland of Texas (FU (L3S)
Pupils & Uberblic DBLP
gration Species Berlin) IRIT
Exams Euro- dbpedia data- (RKB
London TCM ACM
stat lite open- Explorer) NVD
Gazette (FUB) Gene IBM
Traffic Geo ac-uk
•
Scotland TWC LOGD Eurostat Daily DIT
Linked UN/
Data UMBEL Med ERA
Data LOCODE
... is linked and described using shared vocabularies
DEPLOY
Gov.ie CORDIS YAGO New-
lingvoj Disea-
(RKB some SIDER RAE2001 castle LOCAH
CORDIS Explorer) Linked Eurécom
Eurostat Drug CiteSeer Roma
(FUB) Sensor Data
GovTrack (Ontology (Kno.e.sis) Open Bank Pfam Course-
Central) riese Enipedia
Cyc Lexvo LinkedCT ware
Linked PDB
UniProt VIVO
EURES EDGAR dotAC
US SEC Indiana ePrints IEEE
(Ontology totl.net
(rdfabout)
Central) WordNet RISKS
(VUA) Taxono UniProt
US Census EUNIS Twarql HGNC
Semantic Cornetto (Bio2RDF)
(rdfabout) my VIVO
FTS XBRL PRO- ProDom STITCH Cornell LAAS
SITE KISTI NSF
Scotland
GeoWord LODE
•
Geo-
graphy Net WordNet WordNet JISC
(W3C)
Resource Description Framework (RDF)
Climbing (RKB Affy- KEGG
Linked VIVO UF
SMC Explorer) SISVU metrix Pub Drug
Piedmont Journals GeoData PubMed SGD ECCO-
Finnish Gene Chem
Munici-
Accomo- El AGROV Ontology TCP Media
dations Alpine bible
palities Viajero OC
Ski ontology
Tourism KEGG
Ocean
Austria
Enzyme PBAC Geographic
Metoffice GEMET ChEMBL
•
Italian Drilling OMIM KEGG
Weather Open
public Codices AEMET Linked MGI Pathway
Data Publications
SPARQL query language & endpoint
schools Forecasts Open InterPro GeneID KEGG
EARTh Thesau-
Turismo
rus Colors Reaction
de
Zaragoza Product Smart KEGG
User-generated content
Weather DB Link Medi Glycan
Janus Stations Product Care KEGG
AMP UniParc UniRef UniSTS Government
Types Italian
Homolo Com-
Yahoo! Airports Museums pound
Ontology Google
Gene
Geo Art
Planet National
wrapper
Chem2 Cross-domain
Radio- Bio2RDF
activity UniPath
JP Sears Open Linked OGOLOD way
Life sciences
Corpo- Amster- Reactome
dam medu- Open
rates Numbers
Museum cator
As of September 2011
13. Research Lifecycle
Linked Data
Cloud$ Analysis and
Cloud Metrics
acquiring$data$from$text?$ Ana
Me
Semi8
Semi-Automatic Querying and
Automa;c$
Annotation Ranking
Annota;on$ e.g.$GATE$
Amalgame$ SILK$
OpenCalais$
Que
Graph$Rewri;ng$ Graph$Rewri;ng$
and$R
Link to Other
RDF Conversion Internal Linking Visualization
Data
RDF$ RDF$ Internal$ Link$to$
Conversion$ Cleaning$ Linking$ Other$Data$
xml2rdf$
d2rq$ Visua
rdb2rdf$
Semi-Automatic Provenance
$ Conversion Enrichment
User Interfaces
Provenance$
Enrichment$
U
Inte
RDF Feedback
Semi8
Automa;c$
Provenance Tracking
Conversion$
“tablinker”$
14. Challenges
• Build useful services and tools for data publishers ...
• ... that maintain provenance information ...
• ... and cater for the entire research cycle ...
• ... including a feedback loop to new research
15. Challenges
• Build useful services and tools for data publishers ...
• ... that maintain provenance information ...
• ... and cater for the entire research cycle ...
• ... including a feedback loop to new research
16. Large Knowledge Collider
• Data analysis pipeline
• Custom workflows
• Highly scalable
• Query driven
• Exposed as SPARQL endpoint
20. Historical Census Data
• Gathered from 1795 - 1971
• Demographics, houses, occupations
• 507 Excel files
• 2288 tables
• 33283 annotations
21. Annotations
• Created at data entry time
• Created as we speak
• Corrections to original census tables
• Corrections to excel version of census table
• Any additonal remarks...
22. Harmonization
?
• Enable historical research
across census years
• Query across multiple heterogeneous datasets
• Accommodate multiple interpretations
24. Current Situation
• Iterative refinement of MySQL database tables
• Harmonization against existing codifications
• Expensive manual process
• Loss of information between harmonization steps
• Loss of detail in mapping to existing codification
• Not repeatable
25. Requirements
• (Semi-)automatic conversion and harmonization
• Repeatable
• Conservation of information (only add)
• Provenance (who did what)
• Flexible model
• Linking to other datasets
• Publish as open data
26. Research Cycle
Linked Data
Cloud$ Analysis and
Cloud Metrics
acquiring$data$from$text?$ Ana
Me
Semi8
Semi-Automatic Querying and
Automa;c$
Annotation Ranking
Annota;on$ e.g.$GATE$
Amalgame$ SILK$
OpenCalais$
Que
Graph$Rewri;ng$ Graph$Rewri;ng$
and$R
Link to Other
RDF Conversion Internal Linking Visualization
Data
RDF$ RDF$ Internal$ Link$to$
Conversion$ Cleaning$ Linking$ Other$Data$
xml2rdf$
d2rq$ Visua
rdb2rdf$
Semi-Automatic Provenance
$ Conversion Enrichment
User Interfaces
Provenance$
Enrichment$
U
Inte
RDF Feedback
Semi8
Automa;c$
Provenance Tracking
Conversion$
“tablinker”$
30. 12
1878
TabLinker
M
O
I
leeftijd ?
http://github.com/Data2Semantics/TabLinker
nummer der beroepsklasse ?
geboortejaar
?
geslacht
?
huwelijkse staat
E pannenbakkers
beroep
positie
D 1
letter der beroepsklasse
31. TabLinker
• Verbatim graph representation of spreadsheet
• Separate layer for semantics of spreadsheet
• Separate graphs for any annotations, interpretations and
harmonizations of the underlying data
• Round-tripping from Excel to RDF and back
36. Harmonization within a year I
skos:broader
skos:broader
skos:broader
D E A
skos:broader skos:broader skos:broader
skos:broader
Fabricage van
Fabricage van steen aardewerk (incl.
Fabricage van Fabricage van dakpannen
(molensteen, steenbakkers, porcelein, terracotta,
kalk (pannenbakkers)
tegelbakkers) kachelbakkers,
pottenbakkers, enz.)
Sheet1:I
skos:broader skos:broader
skos:broader
Sheet1:D Sheet1:E Sheet1:A
skos:broader skos:broader skos:broader
skos:broader
Sheet1:Fabricage van
Sheet1:Fabricage van steen Sheet1:Fabricage van aardewerk (incl.
Sheet1:Fabricage
(molensteen, steenbakkers, dakpannen porcelein, terracotta,
van kalk
tegelbakkers) (pannenbakkers) kachelbakkers,
pottenbakkers, enz.)
37. Harmonization across years I
skos:broader
skos:broader
skos:broader
D E A
1889 skos:broader
skos:broader skos:broader skos:broader
Fabricage van
Fabricage van steen aardewerk (incl.
Fabricage van Fabricage van dakpannen
(molensteen, steenbakkers, porcelein, terracotta,
kalk (pannenbakkers)
tegelbakkers) kachelbakkers,
pottenbakkers, enz.)
skos:narrowMatch I skos:closeMatch
skos:exactMatch
skos:narrowMatch
skos:broader
skos:broader
skos:broader
D E A
skos:broader skos:broader skos:broader 1899
skos:broader
Fabricage van
Fabricage van steen aardewerk (incl.
Fabricage van Fabricage van dakpannen
(steenbakkers, porcelein,
kalk (pannenbakkers)
tegelbakkers) kachelbakkers,
pottenbakkers, enz.)
38. Harmonization external linking
I
skos:broader
skos:broader
skos:broader
D E A
skos:broader skos:broader skos:broader
skos:broader
Fabricage van
Fabricage van steen aardewerk (incl.
Fabricage van Fabricage van dakpannen
(molensteen, steenbakkers, porcelein, terracotta,
kalk (pannenbakkers)
tegelbakkers) kachelbakkers,
pottenbakkers, enz.)
skos:exactMatch skos:broadMatch skos:broadMatch skos:closeMatch
skos:exactMatch skos:exactMatch
skos:exactMatch
HISCO:23811 HISCO:25281 HISCO:25281 HISCO:26345
HISCO:23810 HISCO:25281 HISCO:26340
HISCO: Historical International Standard Classification of Occupations
40. Open Issues
• Create the necessary mappings between graphs
... this is historical research
• Mappings are interpretations
• Query within a specified interpretation space
• How to reliably perform statistical analysis across
mappings?
• How to study concept drift across years?
41. Short Title Catalogue
• All books published in NL until 1800
• Digitized over a period of 30 years
• 139817 publications (KB says >190000)
• 9962 publishers
• 23627 authors
• 96024 links to scanned title pages
43. Requirements
• (Semi-)automatic conversion and harmonization
• Repeatable
• Conservation of information (only add)
• Provenance (who did what)
• Flexible model
• Linking to other datasets
• Publish as open data
44. Research Cycle
Linked Data
Cloud$ Analysis and
Cloud Metrics
acquiring$data$from$text?$ Ana
Me
Semi8
Semi-Automatic Querying and
Automa;c$
Annotation Ranking
Annota;on$ e.g.$GATE$
Amalgame$ SILK$
OpenCalais$
Que
Graph$Rewri;ng$ Graph$Rewri;ng$
and$R
Link to Other
RDF Conversion Internal Linking Visualization
Data
RDF$ RDF$ Internal$ Link$to$
Conversion$ Cleaning$ Linking$ Other$Data$
xml2rdf$
d2rq$ Visua
rdb2rdf$
Semi-Automatic Provenance
$ Conversion Enrichment
User Interfaces
Provenance$
Enrichment$
U
Inte
RDF Feedback
Semi8
Automa;c$
Provenance Tracking
Conversion$
“tablinker”$
45. Procedure
• Convert to MySQL database
Paul Huygen
• Specify mapping to RDF
D2RQ mapping language
• Interlink with other datasources
Bibliografish portaal, Rijksmuseum, Iconclass, Ecartico
• Publish as browsable and queryable dataset
http://stcn.data2semantics.org
46. Procedure
• Convert to MySQL database ✓
Paul Huygen
• Specify mapping to RDF ✓
D2RQ mapping language
• Interlink with other datasources
Bibliografish portaal, Rijksmuseum, Iconclass, Ecartico
• Publish as browsable and queryable dataset ✓
http://stcn.data2semantics.org
50. Summary
• We use a highly flexible modeling framework that ...
• ... allows for rapid data publication and integration ...
• ... that is extensible and distributed (DB = Web)...
• ... allows for co-existing diverging interpretations ...
• ... adheres to the law of conservation of information ..
• ... offers existing methods for capturing provenance ...
• ... allows for a closed loop research cycle.