Invited keynote at the ISWC2015 Workshop on Semantics and Statistics (SemStats 2015). http://semstats.github.io/2015/
The release of the W3C RDF Data Cube recommendation was a significant milestone towards improving the maturity of the area of Linked Statistical Data. Many Data Cube-based datasets have been released since then. Tools for the generation and exploitation of such datasets have also appeared. While the benefits for the usage of RDF Data Cube and the generation of Linked Data in this area seem to be clear, there are still many challenges associated to the generation and exploitation of such data. In this talk we will reflect about them, based on our experience on generating and exploiting such type of data, and hopefully provoke some discussion about what the next steps should be.
Boost Fertility New Invention Ups Success Rates.pdf
Linked Statistical Data: does it actually pay off?
1. Linked Statistical Data:
does it actually pay off?
Keynote at
3rd International Workshop on Semantic Statistics
(SemStats 2015)
11/10/2015
Oscar Corcho
ocorcho@fi.upm.es
@ocorcho
https://www.slideshare.com/ocorcho
2. Disclaimers
• I am convinced about the
potential benefits of
joining Semantics and
Statistics
• I will provide a very
practical point of view
• From my own experience
in working with semantics
and statistics
• I may have not followed
all recent advances
• I am here to learn as well
• I will be provocative
• Food for thought…
3. Structure of the talk
Part I. Some RDF Data Cube datasets that I have
created and (simple) applications on top of them
Part II. Lessons learned, reflections and a view towards
the future
4. Structure of the talk
Part I. Some RDF Data Cube datasets that I have
created and (simple) applications on top of them
Part II. Lessons learned, reflections and a view towards
the future
5. Why did they call me?
• I did not participate actively on the W3C RDF Data
Cube discussions…
• I have not submitted any papers to SemStats
• I have not participated in the yearly challenges
• Even if I always say “I have to do it…”.
• So what? Let’s look into what I have done in this
area…
6. A few places where I have worked on data cubes
From the lab…
…to the market
Map4RDF
Map4RDF-iOS
7. Visualisation tools from the lab…
• Map4RDF and Map4RDF-iOS
• http://oeg-upm.github.io/map4rdf/
• Visualisations tools originally created for Geographical
Linked Data
• Faceted browsing
• Map-based visualisation
• Data inspection
• Data curation
• Extra features: bounding boxes, route planning, etc.
• And extended to RDF Data Cube-related data
10. RDF Data Cube visualisations at Localidata
• https://youtu.be/aPqg_eoLVt4
11. Statistical data in Aragón
• Early work already available…
• http://opendata.aragon.es/
• Land use
• Recycling
• Lodging
• Work in progress now
with a list of 1940 reports
13. Data Cube at the norm UNE 178301:2015
• UNE 178301:2015
• Norm on Open Data for
Smart Cities
• Organised by
• AENOR CTN 178 group
• Government and Mobility
• Government
• Open Data
(led by Localidata)
• Formed by
• Several cities
• Private companies
• Nation-wide
organisations
W3C RDF Data Cube proposed as the vocabulary to use for publishing open data about population
14. Structure of the talk
Part I. Some RDF Data Cube datasets that I have
created and (simple) applications on top of them
Part II. Lessons learned, reflections and a view towards
the future
15. Linked Statistical Data:
The Good, The Bad and the Ugly
Keynote at
3rd International Workshop on Semantic Statistics
(SemStats 2015)
11/10/2015
Oscar Corcho
ocorcho@fi.upm.es
@ocorcho
https://www.slideshare.com/ocorcho
Note: Not sure about the license of this image
16. Linked Statistical Data: The Good
• URIs everywhere
• Easer treatment and linking
• Effective visualisations
• Especially map-based
• They allow breaking the data silos of a statistical office (even
when micro-data, macro-data and indicators are published)
• Easier cross-dataset querying
• E.g., give me the statistics about recycling of places with
more than 5000 inhabitants and ruled by political party X.
• See my talk at the COLD workshop tomorrow to learn more
• Simplified manner of accessing SDMX/PC-Axis/TSV
data for outsiders
• Non statisticians who know a bit of SPARQL and don’t
dislike SKOS (XKOS)
17. The Good: URIs everywhere and ontologies
• What do the columns
mean?
• unit PER
• geotime FI
• Which are their units of
measurement?
• All these should be
attached to a
methodology page, but
this is not always the
case (e.g. Eurostat)
18. The Good: a single language for cross-dataset querying
• Get municipalities and the number of hectares dedicated to airports for
those municipalities with an area smaller than 50 square kilometers
PREFIX aragodef: <http://opendata.aragon.es/def/Aragopedia#>
PREFIX dbpedia: <http://dbpedia.org/ontology/>
PREFIX qb: <http://purl.org/linked-data/cube#>
SELECT DISTINCT ?municipio ?ha
WHERE {
?x a qb:Observation .
?x qb:dataSet <http://opendata.aragon.es/recurso/DataSet/UsoSuelo> .
?x aragodef:hectareasAeropuertos ?ha .
?x aragodef:refArea ?municipio .
?municipio a dbpedia:Municipality .
?municipio aragodef:areaTotal ?area .
FILTER(?area<50 && ?ha != 0)
} ORDER BY DESC(?ha)
19. Linked Statistical Data: The Bad
• RDF Data Cube datasets are too large in size
• Rather simple datasets easily go up to 1Gb in Turtle
• Obviously, they can be always HDT-ed, compressed, etc.
• RDF Data Cube lacks some
simple property to let us
know how to aggregate values
of a dimension
• Can the values of dimension X in
this dataset be aggregated by
AVG, SUM, or something else?
• What about performance in general-purpose triple stores?
• Are analytical queries to be done on RDF Data Cube data?
• Challenge and opportunity for improved data structures
• See also the work of Benedikt Kämpgen
20. Linked Statistical Data: The Ugly
• Generating (and validating) Data Structure
Definitions is time consuming and error prone
• People in the audience, how do you do it?
• Manual/ad-hoc transformations (e.g., OpenRefine,
Kettle) into RDF Data Cube may lead to errors when
loading in CubeViz, OpenCube, etc.
• How can I run tests?
• We need simple services for
developers to use
• Easy-to-understand REST API
• And Linked Data for
observations
• Does it make sense?
21. Ugly things can always become pretty
Welcome to our Linked Statistical Data beauty center
25. Simple APIs to make use of RDF Data Cube (I)
• Get servers
• http://stats.linkeddata.es/services/getServers
• Get available datasets from a server or all servers
• http://stats.linkeddata.es/services/getStatistics?Server=http://
sandbox.linkeddata.es/sparql
• http://stats.linkeddata.es/services/getStatistics?Server=ALL
• Get available datasets from a geo resource
• http://stats.linkeddata.es/services/getStatistics?Server=http://
localidata.oeg-
upm.net/sparql&URI=http://datos.localidata.com/recurso/terri
torio/Provincia/Madrid/Municipio/madrid/Distrito/09/Seccion/0
43
26. Simple APIs to make use of RDF Data Cube (II)
• Get dimensions from server, resource and dataset
• http://stats.linkeddata.es/services/getDimensions?Server=htt
p://localidata.oeg-upm.net/sparql&
Statistic=http://datos.localidata.com/recurso/CityStats/Provin
cia/Madrid/Poblacion/2012/12
• Get values for X axis
• http://stats.linkeddata.es/services/getStatisticsXValues?Serv
er=http://localidata.oeg-upm.net/sparql&
Statistic=http://datos.localidata.com/recurso/CityStats/Provin
cia/Madrid/Poblacion/2012/12&
Dimension=http://datos.localidata.com/def/CityStats/dimensi
on%23refPaisNacionalidad
27. Simple APIs to make use of RDF Data Cube (III)
• Get values for the X and Y axis, aggregation: SUM
• http://stats.linkeddata.es/services/getStatisticsValues?Server
=http://localidata.oeg-upm.net/sparql&
Statistic=http://datos.localidata.com/recurso/CityStats/Provin
cia/Madrid/Poblacion/2012/12&
Dimension=http://datos.localidata.com/def/CityStats/dimensi
on%23refPaisNacionalidad&
URI=http://datos.localidata.com/recurso/territorio/Provincia/M
adrid/Municipio/madrid/Distrito/09/Seccion/043&
DimensionY=http://datos.localidata.com/def/CityStats/stats%
23numeroHabitantes&
aggr=SUM
28. Linked Data for RDF Data Cube
• ELDA profiles for datasets and observations
29. My wish list to make this guy even prettier
• A (SKOS/XKOS) codelist finder
• RAMON (http://ec.europa.eu/eurostat/ramon/) for SKOS
• Given the values for this dimension, tell me which codelists I
may want to make use of
• Specifying applicable aggregators for dimensions
• SDMX/PCAxis connectors to automate
transformations
• I hate starting from CSVs
• JSON-stat convertor in
OpenCube?
• Optimised operators and data
structures to deal with queries
• A paper at the COLD workshop talking about this: Optimizing
RDF Data Cubes for Efficient Processing of Analytical
Queries
not
so
30. Linked Statistical Data:
does it actually pay off?
or… The Good, The Bad and The
not-so Ugly
Keynote at
3rd International Workshop on Semantic Statistics
(SemStats 2015)
11/10/2015
Oscar Corcho
ocorcho@fi.upm.es
@ocorcho
https://www.slideshare.com/ocorcho
Notas do Editor
The release of the W3C RDF Data Cube recommendation was a significant milestone towards improving the maturity of the area of Linked Statistical Data. Many Data Cube-based datasets have been released since then. Tools for the generation and exploitation of such datasets have also appeared. While the benefits for the usage of RDF Data Cube and the generation of Linked Data in this area seem to be clear, there are still many challenges associated to the generation and exploitation of such data. In this talk we will reflect about them, based on our experience on generating and exploiting such type of data, and hopefully provoke some discussion about what the next steps should be.