This tutorial about Open Government Data was a 4 hours tutorial at the Conferencia Latinoameticana en Informatica (CLEI 2013) http://clei2013.org.ve/ divided into 5 parts:
1 - Introduction
http://www.slideshare.net/jpane/open-government-data-tutorial-at-clei-2013-part-1-introduction
2 - Issues
https://www.slideshare.net/jpane/02-issues-v4slideshare
3 - Real Experience
http://www.slideshare.net/jpane/open-government-data-tutorial-03-real-experience
4 - Applications
http://www.slideshare.net/jpane/open-government-data-tutorial-at-clei-2013-part-4-applications
5 - Semantic Issues
http://www.slideshare.net/jpane/open-government-data-tutorial-at-clei-2013-part-5-semantic-issues
This is part 5 - Semantic Issues
4. Lack of explicit semantics
The real meaning of the data was kept in the developers mind
when creating the data
http://goo.gl/npEHKr (Thanks to Moaz Reyad)
4
Juan Pane, Lorenzino Vaccari
08/10/2013
5. Lack of explicit semantics
Can lead to things like:
http://goo.gl/npEHKr (Thanks to Moaz Reyad)
5
Juan Pane, Lorenzino Vaccari
08/10/2013
7. Issues when Opening Trentino Data
Each department has authority on only some part of the data.
Dataset originally created for internal use only.
Dataset created for a specific need.
Dataset created with custom format:
For structure (some exceptions)
For data
Lack of reuse -> duplication.
Lack of programmers.
We cannot TELL them what/how to do (always).
Data changes
7
Juan Pane, Lorenzino Vaccari
08/10/2013
9. Entity centric: Added value
Aggregated data
Accurate data, manually curated
Unique identifiers, distributed perspectives
Re-think identifiers
Semantified values
E1
E2
name
name
Ignacio P. F.
nationality
italian
born in
Paraguay
lives in
Trento
date of birth
1980
affiliation
9
Juan Pane
Univ. Trento
affiliation
PF-UNA
Juan Pane, Lorenzino Vaccari
08/10/2013
10. Entities
Real world: is something that has a distinct, separate
existence, although it need not be a material (physical)
existence. Has a set of properties, which evolve over time.
Example:
Mental: personal (local) model created and maintained by a
person that references and describes a real world entity.
Digital: capture the semantics of real world entities,
provided by people.
10
Juan Pane, Lorenzino Vaccari
08/10/2013
11. Entity Centric Semantic Layer:
• Address the integration problems due to semantic
heterogeneity:
• Different formats
• Different identifiers
• Implicit semantics
• Homonyms, synonyms, aliases
• Partial knowledge
• Knowledge evolution
http://www.webfoundation.org/2011/11/5-staropen-data-initiatives/
11
Juan Pane, Lorenzino Vaccari
08/10/2013
12. Entity-based Integration
• Focus on entities as first class citizens
• Entities are objects which are so important in our everyday life to be referred with a name
• Each entity has its own metadata (e.g. name, latitude, longitude, …)
• Each entity is in relation with many other entities (e.g. Einstein was born in Ulm, his affiliation
was Charles University, Ulm is a city in Germany)
• There are relatively “few” commonsense entity types (person, …, event)
• There are many domain specific entities (bus stops, cycling paths, ..)
• All components have explicit semantics: schema, entities, attributes, values
12
Juan Pane, Lorenzino Vaccari
08/10/2013
13. Importing pipeline, Macro Steps
Domain analysis
1.
Study the needed entity types, adapt the knowledge base
accordingly. First time bootstrapping
Import entities
2.
Semi-automatic tool.
13
Domain experts are expensive.
Human attention is a scarce resource.
Incremental enrichment and aggregation of entities.
Juan Pane, Lorenzino Vaccari
08/10/2013
14. Open Data Peculiarities
All data comes from a CKAN repository (DCAT).
Process one data file at a time.
Each data file can be represented as a table.
Each row in the table represents a (partial) entity.
The format of the values might not be enforced in the data
files.
Not all data is relevant.
14
Juan Pane, Lorenzino Vaccari
08/10/2013
18. 2. Schema Matching
Select a target type of entity -> correspondences between the input columns and
the output attributes
LocalitaTuristica
nome
provincia
descrizione
Andalo (1047)
Provincia di
Trento
Canazei (1450)
Trento Prov.
18
lat
long
Sorge su un'ampia sella prativa 3
al centro...
654463
712857
Situato all'estremità
settentrionale della...
511504
147444
Juan Pane, Lorenzino Vaccari
funivie
2
• Nome
• Provincia
• Quota
• Coordinate
• Descrizione
• popolazione
08/10/2013
19. 3. Data Validation
Applies format and structure validation and possible automatic transformations
needed to have the input data in the expected format.
19
Juan Pane, Lorenzino Vaccari
08/10/2013
20. 4. Semantic Enrichment (1/2)
Entity disambiguation: Transform text references into links to existing entities.
20
Juan Pane, Lorenzino Vaccari
08/10/2013
21. 4. Semantic Enrichment (2/2)
Natural Language Processing: Extract concepts and entity references from
free-text.
21
Juan Pane, Lorenzino Vaccari
08/10/2013
22. 5. Reconciliation
Run Identity Management Algorithms to identify each row as a new or existing
entity.
Result
• No Match
• Match
• Multiple
Matches
Action:
• Use ID
• New ID
• Ignore
Row
22
Juan Pane, Lorenzino Vaccari
08/10/2013
23. 6. Exporting
At this point:
We know what to export.
All values for target attributes conform to the expected format.
All text has been semantified (NLP).
All textual references to entities are converted to links
Each row has an identifier
v0
23
Juan Pane, Lorenzino Vaccari
i
i+1
08/10/2013
24. 7. Publishing
Put back the semantified entities into CKAN so that the entities
can be Open Data and can be found in the same catalog as the
original data.
Developers and find the data files of the cleaned, aggregated
entities
But can also interact with the entities via the Entitypedia APIs
8. Visualization
Search and Navigation
24
Juan Pane, Lorenzino Vaccari
08/10/2013
25. Semantic Layer: Services
Tool for aiding the “semantification” of the datasets in the catalog
based on:
• Schema matching services
• Identity Management services
• Entity Matching services
• Global Unique Identifier services
• Semantic search and indexing services
• Natural Language Processing
• Entity store
25
Juan Pane, Lorenzino Vaccari
08/10/2013