Linked Statistical Data: does it actually pay off?

Linked Statistical Data:
does it actually pay off?
Keynote at
3rd International Workshop on Semantic Statistics
(SemStats 2015)
11/10/2015
Oscar Corcho
ocorcho@fi.upm.es
@ocorcho
https://www.slideshare.com/ocorcho

Disclaimers
• I am convinced about the
potential benefits of
joining Semantics and
Statistics
• I will provide a very
practical point of view
• From my own experience
in working with semantics
and statistics
• I may have not followed
all recent advances
• I am here to learn as well
• I will be provocative
• Food for thought…

Structure of the talk
Part I. Some RDF Data Cube datasets that I have
created and (simple) applications on top of them
Part II. Lessons learned, reflections and a view towards
the future

Why did they call me?
• I did not participate actively on the W3C RDF Data
Cube discussions…
• I have not submitted any papers to SemStats
• I have not participated in the yearly challenges
• Even if I always say “I have to do it…”.
• So what? Let’s look into what I have done in this
area…

A few places where I have worked on data cubes
From the lab…
…to the market
Map4RDF
Map4RDF-iOS

Visualisation tools from the lab…
• Map4RDF and Map4RDF-iOS
• http://oeg-upm.github.io/map4rdf/
• Visualisations tools originally created for Geographical
Linked Data
• Faceted browsing
• Map-based visualisation
• Data inspection
• Data curation
• Extra features: bounding boxes, route planning, etc.
• And extended to RDF Data Cube-related data

Visualisation tools from the lab…
• https://youtu.be/us8wsG8HfKg

Geomarketing at Localidata
• https://youtu.be/DyLk3jInfkI

RDF Data Cube visualisations at Localidata
• https://youtu.be/aPqg_eoLVt4

Statistical data in Aragón
• Early work already available…
• http://opendata.aragon.es/
• Land use
• Recycling
• Lodging
• Work in progress now
with a list of 1940 reports

BBVA challenge
• https://youtu.be/sqfSsGQ3De8

Data Cube at the norm UNE 178301:2015
• UNE 178301:2015
• Norm on Open Data for
Smart Cities
• Organised by
• AENOR CTN 178 group
• Government and Mobility
• Government
• Open Data
(led by Localidata)
• Formed by
• Several cities
• Private companies
• Nation-wide
organisations
W3C RDF Data Cube proposed as the vocabulary to use for publishing open data about population

The Good, The Bad and the Ugly
Keynote at
(SemStats 2015)
11/10/2015
Oscar Corcho
ocorcho@fi.upm.es
@ocorcho
Note: Not sure about the license of this image

Linked Statistical Data: The Good
• URIs everywhere
• Easer treatment and linking
• Effective visualisations
• Especially map-based
• They allow breaking the data silos of a statistical office (even
when micro-data, macro-data and indicators are published)
• Easier cross-dataset querying
• E.g., give me the statistics about recycling of places with
more than 5000 inhabitants and ruled by political party X.
• See my talk at the COLD workshop tomorrow to learn more
• Simplified manner of accessing SDMX/PC-Axis/TSV
data for outsiders
• Non statisticians who know a bit of SPARQL and don’t
dislike SKOS (XKOS)

The Good: URIs everywhere and ontologies
• What do the columns
mean?
• unit PER
• geotime FI
• Which are their units of
measurement?
• All these should be
attached to a
methodology page, but
this is not always the
case (e.g. Eurostat)

The Good: a single language for cross-dataset querying
• Get municipalities and the number of hectares dedicated to airports for
those municipalities with an area smaller than 50 square kilometers
PREFIX aragodef: <http://opendata.aragon.es/def/Aragopedia#>
PREFIX dbpedia: <http://dbpedia.org/ontology/>
PREFIX qb: <http://purl.org/linked-data/cube#>
SELECT DISTINCT ?municipio ?ha
WHERE {
?x a qb:Observation .
?x qb:dataSet <http://opendata.aragon.es/recurso/DataSet/UsoSuelo> .
?x aragodef:hectareasAeropuertos ?ha .
?x aragodef:refArea ?municipio .
?municipio a dbpedia:Municipality .
?municipio aragodef:areaTotal ?area .
FILTER(?area<50 && ?ha != 0)
} ORDER BY DESC(?ha)

Linked Statistical Data: The Bad
• RDF Data Cube datasets are too large in size
• Rather simple datasets easily go up to 1Gb in Turtle
• Obviously, they can be always HDT-ed, compressed, etc.
• RDF Data Cube lacks some
simple property to let us
know how to aggregate values
of a dimension
• Can the values of dimension X in
this dataset be aggregated by
AVG, SUM, or something else?
• What about performance in general-purpose triple stores?
• Are analytical queries to be done on RDF Data Cube data?
• Challenge and opportunity for improved data structures
• See also the work of Benedikt Kämpgen

Linked Statistical Data: The Ugly
• Generating (and validating) Data Structure
Definitions is time consuming and error prone
• People in the audience, how do you do it?
• Manual/ad-hoc transformations (e.g., OpenRefine,
Kettle) into RDF Data Cube may lead to errors when
loading in CubeViz, OpenCube, etc.
• How can I run tests?
• We need simple services for
developers to use
• Easy-to-understand REST API
• And Linked Data for
observations
• Does it make sense?

Ugly things can always become pretty
Welcome to our Linked Statistical Data beauty center

Generating data structure definitions (I)
+

Generating data structure definitions (II)

Tests and validators for RDF Data Cube datasets

Simple APIs to make use of RDF Data Cube (I)
• Get servers
• http://stats.linkeddata.es/services/getServers
• Get available datasets from a server or all servers
• http://stats.linkeddata.es/services/getStatistics?Server=http://
sandbox.linkeddata.es/sparql
• http://stats.linkeddata.es/services/getStatistics?Server=ALL
• Get available datasets from a geo resource
• http://stats.linkeddata.es/services/getStatistics?Server=http://
localidata.oeg-
upm.net/sparql&URI=http://datos.localidata.com/recurso/terri
torio/Provincia/Madrid/Municipio/madrid/Distrito/09/Seccion/0
43

Simple APIs to make use of RDF Data Cube (II)
• Get dimensions from server, resource and dataset
• http://stats.linkeddata.es/services/getDimensions?Server=htt
p://localidata.oeg-upm.net/sparql&
Statistic=http://datos.localidata.com/recurso/CityStats/Provin
cia/Madrid/Poblacion/2012/12
• Get values for X axis
• http://stats.linkeddata.es/services/getStatisticsXValues?Serv
er=http://localidata.oeg-upm.net/sparql&
cia/Madrid/Poblacion/2012/12&
Dimension=http://datos.localidata.com/def/CityStats/dimensi
on%23refPaisNacionalidad

Simple APIs to make use of RDF Data Cube (III)
• Get values for the X and Y axis, aggregation: SUM
• http://stats.linkeddata.es/services/getStatisticsValues?Server
=http://localidata.oeg-upm.net/sparql&
cia/Madrid/Poblacion/2012/12&
Dimension=http://datos.localidata.com/def/CityStats/dimensi
on%23refPaisNacionalidad&
URI=http://datos.localidata.com/recurso/territorio/Provincia/M
adrid/Municipio/madrid/Distrito/09/Seccion/043&
DimensionY=http://datos.localidata.com/def/CityStats/stats%
23numeroHabitantes&
aggr=SUM

Linked Data for RDF Data Cube
• ELDA profiles for datasets and observations

My wish list to make this guy even prettier
• A (SKOS/XKOS) codelist finder
• RAMON (http://ec.europa.eu/eurostat/ramon/) for SKOS
• Given the values for this dimension, tell me which codelists I
may want to make use of
• Specifying applicable aggregators for dimensions
• SDMX/PCAxis connectors to automate
transformations
• I hate starting from CSVs
• JSON-stat convertor in
OpenCube?
• Optimised operators and data
structures to deal with queries
• A paper at the COLD workshop talking about this: Optimizing
RDF Data Cubes for Efficient Processing of Analytical
Queries
not
so

does it actually pay off?
or… The Good, The Bad and The
not-so Ugly
Keynote at
(SemStats 2015)
11/10/2015
Oscar Corcho
ocorcho@fi.upm.es
@ocorcho

Linked Statistical Data: does it actually pay off?

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (10)

Semelhante a Linked Statistical Data: does it actually pay off?

Semelhante a Linked Statistical Data: does it actually pay off? (20)

Mais de Oscar Corcho

Mais de Oscar Corcho (19)

Último

Último (20)

Linked Statistical Data: does it actually pay off?

Notas do Editor