SlideShare uma empresa Scribd logo
1 de 28
OGD: Part 5 – Semantic Issues
Juan Pane: jpane@pol.una.py
Lorenzino Vaccari: lorenzino.vaccari@gmail.com

1

Juan Pane, Lorenzino Vaccari

http://dati.trentino.it/

08/10/2013
Outline
• Overview
• Issues of opening data
• Entity centric Semantic layer
• Importing pipeline

• Importing tool

2

Juan Pane, Lorenzino Vaccari

08/10/2013
Available

Structured

Linked Open
Data

Open formats
Redefenceable

Linked

The best data is
an open data
Vs.

All data must be
perfect

3

Juan Pane, Lorenzino Vaccari

08/10/2013
Lack of explicit semantics
The real meaning of the data was kept in the developers mind
when creating the data

http://goo.gl/npEHKr (Thanks to Moaz Reyad)

4

Juan Pane, Lorenzino Vaccari

08/10/2013
Lack of explicit semantics
Can lead to things like:

http://goo.gl/npEHKr (Thanks to Moaz Reyad)

5

Juan Pane, Lorenzino Vaccari

08/10/2013
Semantic heterogeneity
Difference in the meaning of local data

6

Juan Pane, Lorenzino Vaccari

08/10/2013
Issues when Opening Trentino Data
 Each department has authority on only some part of the data.

 Dataset originally created for internal use only.
 Dataset created for a specific need.
 Dataset created with custom format:
 For structure (some exceptions)
 For data
 Lack of reuse -> duplication.
 Lack of programmers.
 We cannot TELL them what/how to do (always).
 Data changes

7

Juan Pane, Lorenzino Vaccari

08/10/2013
Available

Data Catalog

Structured

Open formats
Redefenceable

Linked

8

Entity
Centric
Semantic
Layer
Juan Pane, Lorenzino Vaccari

08/10/2013
Entity centric: Added value
 Aggregated data

 Accurate data, manually curated
 Unique identifiers, distributed perspectives
 Re-think identifiers

 Semantified values
E1

E2

name

name

Ignacio P. F.

nationality

italian

born in

Paraguay

lives in

Trento

date of birth

1980

affiliation

9

Juan Pane

Univ. Trento

affiliation

PF-UNA

Juan Pane, Lorenzino Vaccari

08/10/2013
Entities
 Real world: is something that has a distinct, separate

existence, although it need not be a material (physical)
existence. Has a set of properties, which evolve over time.
Example:
 Mental: personal (local) model created and maintained by a

person that references and describes a real world entity.
 Digital: capture the semantics of real world entities,

provided by people.
10

Juan Pane, Lorenzino Vaccari

08/10/2013
Entity Centric Semantic Layer:
• Address the integration problems due to semantic

heterogeneity:
• Different formats
• Different identifiers
• Implicit semantics
• Homonyms, synonyms, aliases
• Partial knowledge
• Knowledge evolution
http://www.webfoundation.org/2011/11/5-staropen-data-initiatives/

11

Juan Pane, Lorenzino Vaccari

08/10/2013
Entity-based Integration
• Focus on entities as first class citizens
• Entities are objects which are so important in our everyday life to be referred with a name
• Each entity has its own metadata (e.g. name, latitude, longitude, …)
• Each entity is in relation with many other entities (e.g. Einstein was born in Ulm, his affiliation

was Charles University, Ulm is a city in Germany)
• There are relatively “few” commonsense entity types (person, …, event)
• There are many domain specific entities (bus stops, cycling paths, ..)
• All components have explicit semantics: schema, entities, attributes, values

12

Juan Pane, Lorenzino Vaccari

08/10/2013
Importing pipeline, Macro Steps
Domain analysis

1.

Study the needed entity types, adapt the knowledge base
accordingly. First time bootstrapping



Import entities

2.

Semi-automatic tool.






13

Domain experts are expensive.
Human attention is a scarce resource.
Incremental enrichment and aggregation of entities.

Juan Pane, Lorenzino Vaccari

08/10/2013
Open Data Peculiarities
 All data comes from a CKAN repository (DCAT).

 Process one data file at a time.
 Each data file can be represented as a table.
 Each row in the table represents a (partial) entity.

 The format of the values might not be enforced in the data

files.
 Not all data is relevant.

14

Juan Pane, Lorenzino Vaccari

08/10/2013
Available

Data Catalog

Structured

Open formats
Redefenceable

Linked

15

Juan Pane, Lorenzino Vaccari

Entity centric
Importing tool

08/10/2013
Importing tool process

16

Juan Pane, Lorenzino Vaccari

08/10/2013
1. Source Selection
Import one data file at a time

17

Juan Pane, Lorenzino Vaccari

08/10/2013
2. Schema Matching
Select a target type of entity -> correspondences between the input columns and
the output attributes
LocalitaTuristica
nome

provincia

descrizione

Andalo (1047)

Provincia di
Trento

Canazei (1450)

Trento Prov.

18

lat

long

Sorge su un'ampia sella prativa 3
al centro...

654463

712857

Situato all'estremità
settentrionale della...

511504

147444

Juan Pane, Lorenzino Vaccari

funivie

2

• Nome
• Provincia
• Quota
• Coordinate
• Descrizione
• popolazione

08/10/2013
3. Data Validation
Applies format and structure validation and possible automatic transformations
needed to have the input data in the expected format.

19

Juan Pane, Lorenzino Vaccari

08/10/2013
4. Semantic Enrichment (1/2)
Entity disambiguation: Transform text references into links to existing entities.

20

Juan Pane, Lorenzino Vaccari

08/10/2013
4. Semantic Enrichment (2/2)
Natural Language Processing: Extract concepts and entity references from
free-text.

21

Juan Pane, Lorenzino Vaccari

08/10/2013
5. Reconciliation
Run Identity Management Algorithms to identify each row as a new or existing
entity.
Result
• No Match
• Match
• Multiple
Matches

Action:
• Use ID
• New ID
• Ignore
Row

22

Juan Pane, Lorenzino Vaccari

08/10/2013
6. Exporting
At this point:
 We know what to export.
 All values for target attributes conform to the expected format.
 All text has been semantified (NLP).
 All textual references to entities are converted to links
 Each row has an identifier

v0
23

Juan Pane, Lorenzino Vaccari

i

i+1
08/10/2013
7. Publishing
Put back the semantified entities into CKAN so that the entities
can be Open Data and can be found in the same catalog as the
original data.
 Developers and find the data files of the cleaned, aggregated
entities
 But can also interact with the entities via the Entitypedia APIs

8. Visualization
Search and Navigation
24

Juan Pane, Lorenzino Vaccari

08/10/2013
Semantic Layer: Services
Tool for aiding the “semantification” of the datasets in the catalog
based on:
• Schema matching services
• Identity Management services
• Entity Matching services

• Global Unique Identifier services

• Semantic search and indexing services
• Natural Language Processing
• Entity store

25

Juan Pane, Lorenzino Vaccari

08/10/2013
Our Goal
TN

UK

ES

BE

26

Juan Pane, Lorenzino Vaccari

08/10/2013
27

Juan Pane, Lorenzino Vaccari

08/10/2013
http://www.shabra.com/wp-content/uploads/2011/03/lets-work-together.jpg
Gracias!

Grazie!
Mercy!

Thanks!
Kiitos!

Dank u!
Gràcies!

Gratias!
Danke!

ευχαριστώ

We thank in particular CLEI 2013, Autonomous Province of Trento, TrentoRise association,
Universidad Nacional de Asuncion, and University of Trento

28

Juan Pane, Lorenzino Vaccari

08/10/2013

Mais conteúdo relacionado

Mais procurados

Making working thesauri
Making working thesauriMaking working thesauri
Making working thesauriliddy
 
The Dynamics of Sharing: An Introduction to Shareable Metadata and Interopera...
The Dynamics of Sharing: An Introduction to Shareable Metadata and Interopera...The Dynamics of Sharing: An Introduction to Shareable Metadata and Interopera...
The Dynamics of Sharing: An Introduction to Shareable Metadata and Interopera...Sarah Shreeves
 
Vos at NCB Naturalis
Vos at NCB NaturalisVos at NCB Naturalis
Vos at NCB NaturalisRutger Vos
 
Accessibility Issues
Accessibility IssuesAccessibility Issues
Accessibility Issuesliddy
 
HLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian Carrier
HLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian CarrierHLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian Carrier
HLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian CarrierBasis Technology
 

Mais procurados (6)

Making working thesauri
Making working thesauriMaking working thesauri
Making working thesauri
 
The Dynamics of Sharing: An Introduction to Shareable Metadata and Interopera...
The Dynamics of Sharing: An Introduction to Shareable Metadata and Interopera...The Dynamics of Sharing: An Introduction to Shareable Metadata and Interopera...
The Dynamics of Sharing: An Introduction to Shareable Metadata and Interopera...
 
Vos at NCB Naturalis
Vos at NCB NaturalisVos at NCB Naturalis
Vos at NCB Naturalis
 
Accessibility Issues
Accessibility IssuesAccessibility Issues
Accessibility Issues
 
HLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian Carrier
HLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian CarrierHLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian Carrier
HLT 2013 - Triaging Foreign Language Documents for MEDEX by Brian Carrier
 
Clustering
ClusteringClustering
Clustering
 

Destaque

Open Government Data Tutorial at CLEI 2013. Part 1 - Introduction
Open Government Data Tutorial at CLEI 2013. Part 1 - IntroductionOpen Government Data Tutorial at CLEI 2013. Part 1 - Introduction
Open Government Data Tutorial at CLEI 2013. Part 1 - Introductionjpane
 
Open Government Data Tutorial at CLEI 2013. Part 4 Applications
Open Government Data Tutorial at CLEI 2013. Part 4 ApplicationsOpen Government Data Tutorial at CLEI 2013. Part 4 Applications
Open Government Data Tutorial at CLEI 2013. Part 4 Applicationsjpane
 
Adequabilidade de Postes a Luminárias - Claudia Granjeiro
Adequabilidade de Postes a Luminárias - Claudia GranjeiroAdequabilidade de Postes a Luminárias - Claudia Granjeiro
Adequabilidade de Postes a Luminárias - Claudia GranjeiroAureo Ricardo Salles
 
Open Government Data Tutorial at CLEI 2013. Part 3 Real Experience
Open Government Data Tutorial at CLEI 2013. Part 3 Real ExperienceOpen Government Data Tutorial at CLEI 2013. Part 3 Real Experience
Open Government Data Tutorial at CLEI 2013. Part 3 Real Experiencejpane
 
Expanding Open Data Horizons with R and RStudio
Expanding Open Data Horizons with R and RStudioExpanding Open Data Horizons with R and RStudio
Expanding Open Data Horizons with R and RStudior-kor
 
Open Government Data Tutorial at CLEI 2013. Part 2 - Issues
Open Government Data Tutorial at CLEI 2013. Part 2 - IssuesOpen Government Data Tutorial at CLEI 2013. Part 2 - Issues
Open Government Data Tutorial at CLEI 2013. Part 2 - Issuesjpane
 
Optima multitrade business plan
Optima multitrade business planOptima multitrade business plan
Optima multitrade business planJetan Arora
 
Open government obl_20101006
Open government obl_20101006Open government obl_20101006
Open government obl_20101006Andre Golliez
 
Vitrine (Artshare) UX Internship Document
Vitrine (Artshare) UX Internship DocumentVitrine (Artshare) UX Internship Document
Vitrine (Artshare) UX Internship DocumentAnne David
 
a3 systems relauncht Website der FITT gGmbH
a3 systems relauncht Website der FITT gGmbHa3 systems relauncht Website der FITT gGmbH
a3 systems relauncht Website der FITT gGmbHa3 systems GmbH
 
FBK - 11 MSc flyer HRM prf 29aug11
FBK - 11 MSc flyer HRM prf 29aug11FBK - 11 MSc flyer HRM prf 29aug11
FBK - 11 MSc flyer HRM prf 29aug11Deep Mahangi
 
Diapositiva expocision psicolo..
Diapositiva expocision psicolo..Diapositiva expocision psicolo..
Diapositiva expocision psicolo..rubengonzalez01
 
PHẦN MỀM NHÀ HÀNG - ACMAN BAR KARAOKE 6.1
PHẦN MỀM NHÀ HÀNG - ACMAN BAR KARAOKE 6.1PHẦN MỀM NHÀ HÀNG - ACMAN BAR KARAOKE 6.1
PHẦN MỀM NHÀ HÀNG - ACMAN BAR KARAOKE 6.1le thai
 
Präsentation carports Im hahnengrunde carports
Präsentation carports Im hahnengrunde carportsPräsentation carports Im hahnengrunde carports
Präsentation carports Im hahnengrunde carportsJenapolis
 
Simple Marketing Automation
Simple Marketing AutomationSimple Marketing Automation
Simple Marketing AutomationRoja Guggilam
 

Destaque (20)

Open Government Data Tutorial at CLEI 2013. Part 1 - Introduction
Open Government Data Tutorial at CLEI 2013. Part 1 - IntroductionOpen Government Data Tutorial at CLEI 2013. Part 1 - Introduction
Open Government Data Tutorial at CLEI 2013. Part 1 - Introduction
 
Open Government Data Tutorial at CLEI 2013. Part 4 Applications
Open Government Data Tutorial at CLEI 2013. Part 4 ApplicationsOpen Government Data Tutorial at CLEI 2013. Part 4 Applications
Open Government Data Tutorial at CLEI 2013. Part 4 Applications
 
Adequabilidade de Postes a Luminárias - Claudia Granjeiro
Adequabilidade de Postes a Luminárias - Claudia GranjeiroAdequabilidade de Postes a Luminárias - Claudia Granjeiro
Adequabilidade de Postes a Luminárias - Claudia Granjeiro
 
Open Government Data Tutorial at CLEI 2013. Part 3 Real Experience
Open Government Data Tutorial at CLEI 2013. Part 3 Real ExperienceOpen Government Data Tutorial at CLEI 2013. Part 3 Real Experience
Open Government Data Tutorial at CLEI 2013. Part 3 Real Experience
 
Open Government Data
Open Government DataOpen Government Data
Open Government Data
 
Expanding Open Data Horizons with R and RStudio
Expanding Open Data Horizons with R and RStudioExpanding Open Data Horizons with R and RStudio
Expanding Open Data Horizons with R and RStudio
 
Open Data handbook thai
Open Data handbook thaiOpen Data handbook thai
Open Data handbook thai
 
EOP.IM.S31
EOP.IM.S31EOP.IM.S31
EOP.IM.S31
 
Open Government Data Tutorial at CLEI 2013. Part 2 - Issues
Open Government Data Tutorial at CLEI 2013. Part 2 - IssuesOpen Government Data Tutorial at CLEI 2013. Part 2 - Issues
Open Government Data Tutorial at CLEI 2013. Part 2 - Issues
 
Optima multitrade business plan
Optima multitrade business planOptima multitrade business plan
Optima multitrade business plan
 
Open government obl_20101006
Open government obl_20101006Open government obl_20101006
Open government obl_20101006
 
Vitrine (Artshare) UX Internship Document
Vitrine (Artshare) UX Internship DocumentVitrine (Artshare) UX Internship Document
Vitrine (Artshare) UX Internship Document
 
a3 systems relauncht Website der FITT gGmbH
a3 systems relauncht Website der FITT gGmbHa3 systems relauncht Website der FITT gGmbH
a3 systems relauncht Website der FITT gGmbH
 
FBK - 11 MSc flyer HRM prf 29aug11
FBK - 11 MSc flyer HRM prf 29aug11FBK - 11 MSc flyer HRM prf 29aug11
FBK - 11 MSc flyer HRM prf 29aug11
 
Objetivos
ObjetivosObjetivos
Objetivos
 
Diapositiva expocision psicolo..
Diapositiva expocision psicolo..Diapositiva expocision psicolo..
Diapositiva expocision psicolo..
 
Revista finalizada
Revista finalizadaRevista finalizada
Revista finalizada
 
PHẦN MỀM NHÀ HÀNG - ACMAN BAR KARAOKE 6.1
PHẦN MỀM NHÀ HÀNG - ACMAN BAR KARAOKE 6.1PHẦN MỀM NHÀ HÀNG - ACMAN BAR KARAOKE 6.1
PHẦN MỀM NHÀ HÀNG - ACMAN BAR KARAOKE 6.1
 
Präsentation carports Im hahnengrunde carports
Präsentation carports Im hahnengrunde carportsPräsentation carports Im hahnengrunde carports
Präsentation carports Im hahnengrunde carports
 
Simple Marketing Automation
Simple Marketing AutomationSimple Marketing Automation
Simple Marketing Automation
 

Semelhante a Open Government Data Tutorial at CLEI 2013. Part 5 Semantic Issues

SemTech West 2011 - Digital Provenance
SemTech West 2011 - Digital ProvenanceSemTech West 2011 - Digital Provenance
SemTech West 2011 - Digital Provenancegvj4v
 
Apache Solr, il motore di ricerca enterprise open source
Apache Solr, il motore di ricerca enterprise open sourceApache Solr, il motore di ricerca enterprise open source
Apache Solr, il motore di ricerca enterprise open sourceLuca Bonesini
 
Semanticnews 230913-final
Semanticnews 230913-finalSemanticnews 230913-final
Semanticnews 230913-finalDavid Newman
 
Browser and Management App for Google's Person Finder
Browser and Management App for Google's Person FinderBrowser and Management App for Google's Person Finder
Browser and Management App for Google's Person FinderHeiko Joerg Schick
 
Linked Open Data in Romania
Linked Open Data in RomaniaLinked Open Data in Romania
Linked Open Data in RomaniaVlad Posea
 
How IKANOW uses MongoDB to help organizations solve really big problems
How IKANOW uses MongoDB to help organizations solve really big problemsHow IKANOW uses MongoDB to help organizations solve really big problems
How IKANOW uses MongoDB to help organizations solve really big problemsikanow
 
The CSO Open Data Experience
The CSO Open Data ExperienceThe CSO Open Data Experience
The CSO Open Data ExperienceDublinked .
 
Towards a frictionless data future
Towards a frictionless data futureTowards a frictionless data future
Towards a frictionless data futureJisc RDM
 
Global Media Monitor - Marko Grobelnik
Global Media Monitor - Marko GrobelnikGlobal Media Monitor - Marko Grobelnik
Global Media Monitor - Marko GrobelnikMarko Grobelnik
 
Text analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATEText analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATEDiana Maynard
 
Enabling re-use via CKAN: discoverability and interoperability
Enabling re-use via CKAN: discoverability and interoperabilityEnabling re-use via CKAN: discoverability and interoperability
Enabling re-use via CKAN: discoverability and interoperabilityIrina Bolychevsky
 
Advanced Research Investigations for SIU Investigators
Advanced Research Investigations for SIU InvestigatorsAdvanced Research Investigations for SIU Investigators
Advanced Research Investigations for SIU InvestigatorsSloan Carne
 
How to model digital objects within the semantic web
How to model digital objects within the semantic webHow to model digital objects within the semantic web
How to model digital objects within the semantic webAngelica Lo Duca
 
APLIC 2012: Discovering & Dealing with Data
APLIC 2012: Discovering & Dealing with DataAPLIC 2012: Discovering & Dealing with Data
APLIC 2012: Discovering & Dealing with DataHamilton Public Library
 
Webscale Discovery with the Enduser in Mind
Webscale Discovery with the Enduser in Mind Webscale Discovery with the Enduser in Mind
Webscale Discovery with the Enduser in Mind Debra Kolah
 

Semelhante a Open Government Data Tutorial at CLEI 2013. Part 5 Semantic Issues (20)

Open Data Trentino - Seminar at Universidad Simon Bolivar - 15th October 2013
Open Data Trentino - Seminar at Universidad Simon Bolivar - 15th October 2013Open Data Trentino - Seminar at Universidad Simon Bolivar - 15th October 2013
Open Data Trentino - Seminar at Universidad Simon Bolivar - 15th October 2013
 
SemTech West 2011 - Digital Provenance
SemTech West 2011 - Digital ProvenanceSemTech West 2011 - Digital Provenance
SemTech West 2011 - Digital Provenance
 
Apache Solr, il motore di ricerca enterprise open source
Apache Solr, il motore di ricerca enterprise open sourceApache Solr, il motore di ricerca enterprise open source
Apache Solr, il motore di ricerca enterprise open source
 
Line,,NATIONAL SEMINAR ORGANIZED BY KULISAA 15.01.2015
Line,,NATIONAL SEMINAR ORGANIZED BY KULISAA 15.01.2015Line,,NATIONAL SEMINAR ORGANIZED BY KULISAA 15.01.2015
Line,,NATIONAL SEMINAR ORGANIZED BY KULISAA 15.01.2015
 
Semanticnews 230913-final
Semanticnews 230913-finalSemanticnews 230913-final
Semanticnews 230913-final
 
Browser and Management App for Google's Person Finder
Browser and Management App for Google's Person FinderBrowser and Management App for Google's Person Finder
Browser and Management App for Google's Person Finder
 
Open Science and Identifiers
Open Science and IdentifiersOpen Science and Identifiers
Open Science and Identifiers
 
Linked Open Data in Romania
Linked Open Data in RomaniaLinked Open Data in Romania
Linked Open Data in Romania
 
How IKANOW uses MongoDB to help organizations solve really big problems
How IKANOW uses MongoDB to help organizations solve really big problemsHow IKANOW uses MongoDB to help organizations solve really big problems
How IKANOW uses MongoDB to help organizations solve really big problems
 
The CSO Open Data Experience
The CSO Open Data ExperienceThe CSO Open Data Experience
The CSO Open Data Experience
 
Loditaly2014 new
Loditaly2014 newLoditaly2014 new
Loditaly2014 new
 
Towards a frictionless data future
Towards a frictionless data futureTowards a frictionless data future
Towards a frictionless data future
 
Ice dec04-04-sammy
Ice dec04-04-sammyIce dec04-04-sammy
Ice dec04-04-sammy
 
Global Media Monitor - Marko Grobelnik
Global Media Monitor - Marko GrobelnikGlobal Media Monitor - Marko Grobelnik
Global Media Monitor - Marko Grobelnik
 
Text analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATEText analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATE
 
Enabling re-use via CKAN: discoverability and interoperability
Enabling re-use via CKAN: discoverability and interoperabilityEnabling re-use via CKAN: discoverability and interoperability
Enabling re-use via CKAN: discoverability and interoperability
 
Advanced Research Investigations for SIU Investigators
Advanced Research Investigations for SIU InvestigatorsAdvanced Research Investigations for SIU Investigators
Advanced Research Investigations for SIU Investigators
 
How to model digital objects within the semantic web
How to model digital objects within the semantic webHow to model digital objects within the semantic web
How to model digital objects within the semantic web
 
APLIC 2012: Discovering & Dealing with Data
APLIC 2012: Discovering & Dealing with DataAPLIC 2012: Discovering & Dealing with Data
APLIC 2012: Discovering & Dealing with Data
 
Webscale Discovery with the Enduser in Mind
Webscale Discovery with the Enduser in Mind Webscale Discovery with the Enduser in Mind
Webscale Discovery with the Enduser in Mind
 

Último

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 

Último (20)

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 

Open Government Data Tutorial at CLEI 2013. Part 5 Semantic Issues

  • 1. OGD: Part 5 – Semantic Issues Juan Pane: jpane@pol.una.py Lorenzino Vaccari: lorenzino.vaccari@gmail.com 1 Juan Pane, Lorenzino Vaccari http://dati.trentino.it/ 08/10/2013
  • 2. Outline • Overview • Issues of opening data • Entity centric Semantic layer • Importing pipeline • Importing tool 2 Juan Pane, Lorenzino Vaccari 08/10/2013
  • 3. Available Structured Linked Open Data Open formats Redefenceable Linked The best data is an open data Vs. All data must be perfect 3 Juan Pane, Lorenzino Vaccari 08/10/2013
  • 4. Lack of explicit semantics The real meaning of the data was kept in the developers mind when creating the data http://goo.gl/npEHKr (Thanks to Moaz Reyad) 4 Juan Pane, Lorenzino Vaccari 08/10/2013
  • 5. Lack of explicit semantics Can lead to things like: http://goo.gl/npEHKr (Thanks to Moaz Reyad) 5 Juan Pane, Lorenzino Vaccari 08/10/2013
  • 6. Semantic heterogeneity Difference in the meaning of local data 6 Juan Pane, Lorenzino Vaccari 08/10/2013
  • 7. Issues when Opening Trentino Data  Each department has authority on only some part of the data.  Dataset originally created for internal use only.  Dataset created for a specific need.  Dataset created with custom format:  For structure (some exceptions)  For data  Lack of reuse -> duplication.  Lack of programmers.  We cannot TELL them what/how to do (always).  Data changes 7 Juan Pane, Lorenzino Vaccari 08/10/2013
  • 9. Entity centric: Added value  Aggregated data  Accurate data, manually curated  Unique identifiers, distributed perspectives  Re-think identifiers  Semantified values E1 E2 name name Ignacio P. F. nationality italian born in Paraguay lives in Trento date of birth 1980 affiliation 9 Juan Pane Univ. Trento affiliation PF-UNA Juan Pane, Lorenzino Vaccari 08/10/2013
  • 10. Entities  Real world: is something that has a distinct, separate existence, although it need not be a material (physical) existence. Has a set of properties, which evolve over time. Example:  Mental: personal (local) model created and maintained by a person that references and describes a real world entity.  Digital: capture the semantics of real world entities, provided by people. 10 Juan Pane, Lorenzino Vaccari 08/10/2013
  • 11. Entity Centric Semantic Layer: • Address the integration problems due to semantic heterogeneity: • Different formats • Different identifiers • Implicit semantics • Homonyms, synonyms, aliases • Partial knowledge • Knowledge evolution http://www.webfoundation.org/2011/11/5-staropen-data-initiatives/ 11 Juan Pane, Lorenzino Vaccari 08/10/2013
  • 12. Entity-based Integration • Focus on entities as first class citizens • Entities are objects which are so important in our everyday life to be referred with a name • Each entity has its own metadata (e.g. name, latitude, longitude, …) • Each entity is in relation with many other entities (e.g. Einstein was born in Ulm, his affiliation was Charles University, Ulm is a city in Germany) • There are relatively “few” commonsense entity types (person, …, event) • There are many domain specific entities (bus stops, cycling paths, ..) • All components have explicit semantics: schema, entities, attributes, values 12 Juan Pane, Lorenzino Vaccari 08/10/2013
  • 13. Importing pipeline, Macro Steps Domain analysis 1. Study the needed entity types, adapt the knowledge base accordingly. First time bootstrapping  Import entities 2. Semi-automatic tool.     13 Domain experts are expensive. Human attention is a scarce resource. Incremental enrichment and aggregation of entities. Juan Pane, Lorenzino Vaccari 08/10/2013
  • 14. Open Data Peculiarities  All data comes from a CKAN repository (DCAT).  Process one data file at a time.  Each data file can be represented as a table.  Each row in the table represents a (partial) entity.  The format of the values might not be enforced in the data files.  Not all data is relevant. 14 Juan Pane, Lorenzino Vaccari 08/10/2013
  • 15. Available Data Catalog Structured Open formats Redefenceable Linked 15 Juan Pane, Lorenzino Vaccari Entity centric Importing tool 08/10/2013
  • 16. Importing tool process 16 Juan Pane, Lorenzino Vaccari 08/10/2013
  • 17. 1. Source Selection Import one data file at a time 17 Juan Pane, Lorenzino Vaccari 08/10/2013
  • 18. 2. Schema Matching Select a target type of entity -> correspondences between the input columns and the output attributes LocalitaTuristica nome provincia descrizione Andalo (1047) Provincia di Trento Canazei (1450) Trento Prov. 18 lat long Sorge su un'ampia sella prativa 3 al centro... 654463 712857 Situato all'estremità settentrionale della... 511504 147444 Juan Pane, Lorenzino Vaccari funivie 2 • Nome • Provincia • Quota • Coordinate • Descrizione • popolazione 08/10/2013
  • 19. 3. Data Validation Applies format and structure validation and possible automatic transformations needed to have the input data in the expected format. 19 Juan Pane, Lorenzino Vaccari 08/10/2013
  • 20. 4. Semantic Enrichment (1/2) Entity disambiguation: Transform text references into links to existing entities. 20 Juan Pane, Lorenzino Vaccari 08/10/2013
  • 21. 4. Semantic Enrichment (2/2) Natural Language Processing: Extract concepts and entity references from free-text. 21 Juan Pane, Lorenzino Vaccari 08/10/2013
  • 22. 5. Reconciliation Run Identity Management Algorithms to identify each row as a new or existing entity. Result • No Match • Match • Multiple Matches Action: • Use ID • New ID • Ignore Row 22 Juan Pane, Lorenzino Vaccari 08/10/2013
  • 23. 6. Exporting At this point:  We know what to export.  All values for target attributes conform to the expected format.  All text has been semantified (NLP).  All textual references to entities are converted to links  Each row has an identifier v0 23 Juan Pane, Lorenzino Vaccari i i+1 08/10/2013
  • 24. 7. Publishing Put back the semantified entities into CKAN so that the entities can be Open Data and can be found in the same catalog as the original data.  Developers and find the data files of the cleaned, aggregated entities  But can also interact with the entities via the Entitypedia APIs 8. Visualization Search and Navigation 24 Juan Pane, Lorenzino Vaccari 08/10/2013
  • 25. Semantic Layer: Services Tool for aiding the “semantification” of the datasets in the catalog based on: • Schema matching services • Identity Management services • Entity Matching services • Global Unique Identifier services • Semantic search and indexing services • Natural Language Processing • Entity store 25 Juan Pane, Lorenzino Vaccari 08/10/2013
  • 26. Our Goal TN UK ES BE 26 Juan Pane, Lorenzino Vaccari 08/10/2013
  • 27. 27 Juan Pane, Lorenzino Vaccari 08/10/2013 http://www.shabra.com/wp-content/uploads/2011/03/lets-work-together.jpg
  • 28. Gracias! Grazie! Mercy! Thanks! Kiitos! Dank u! Gràcies! Gratias! Danke! ευχαριστώ We thank in particular CLEI 2013, Autonomous Province of Trento, TrentoRise association, Universidad Nacional de Asuncion, and University of Trento 28 Juan Pane, Lorenzino Vaccari 08/10/2013