Nowadays, information management systems deal with data originating from different sources including relational databases, NoSQL data stores, and Web data formats, varying not only in terms of data formats, but also in the underlying data model. Integrating data from heterogeneous data sources is a time-consuming and error-prone engineering task; part of this process requires that the data has to be transformed from its original form to other forms, repeating all along the life cycle. With this report we provide a principled overview on the fundamental data shapes tabular, tree, and graph as well as transformations between them, in order to gain a better understanding for performing said transformations more efficiently and effectively.
1. Data Shapes
and
Data Transformations
Michael Hausenblas1, Boris Villazón-Terrazas2, and Richard Cyganiak1
1 DERI, NUI Galway, Ireland
firstname.lastname@deri.org
2 iSOCO, Madrid, Spain
bvillazon@isoco.com
Paper available at: http://arxiv.org/abs/1211.1565
4. Motivation
Current data systems combine data from a
tremendous number of resources 1.
load
extract transform
1. Pat Helland. If You Have Too Much Data, then 'Good Enough' Is Good Enough. Queue,
9:40:40-40:50, May 2011.
http://queue.acm.org/detail.cfm?id=1988603
4
5. Motivation
We use the term data shape to refer on how data is
arranged and structured.
resource data shape
5
7. Tabular
A tabular data shape organizes data items into a
table.
Location Environmental Services
Carlow County Council 40
Cavan County Council 36
Clare County Council 38
Cork City Council 51
Cork County Council 47
Donegal County Council 45
Dublin City Council 43
7
8. Tree
A tree data shape organizes data items into a
hierarchy. A data item is designated to be the root of
the tree while the remaining data items are
partitioned into non-empty sets each of which is a
subtree of the root.
8
9. Graph
A graph data shape consists of a set of vertexes,
and a set of edges. An edge is a pair of vertexes.
The two vertexes are called edge endpoints.
TM
9
12. Features
Lossy transformation: all queries that are
possible on the original shape are also possible
on the resultant shape
12
13. Tabular - Tabular
• RDB – RDB
• SQL Select
SELECT Location as Region, EServices as EnvServices
FROM services
Location EServices Regjon EnvServices
Carlow County Council 40 Carlow County Council 40
Cavan County Council 36 Cavan County Council 36
Clare County Council 38
Data shape Clare County Council 38
Cork City Council 51 transformation Cork City Council 51
Cork County Council 47 Cork County Council 47
Donegal County Council 45 Donegal County Council 45
Dublin City Council 43 Dublin City Council 43
• Declarative
• No Information loss
• No provenance
• Standard language, SQL
13
14. Tabular - Tree
• RDB – XML
• XML representation of a relational database
Location EnvironmentalServices
Carlow County Council 40
Cavan County Council 36
Clare County Council 38 Data shape
Cork City Council 51 transformation
Cork County Council 47
Donegal County Council 45
Dublin City Council 43
• Operational
• No Information loss
14
15. Tabular - Graph
• RDB – RDF
• W3C RDB2RDF WG – R2RML 1
ID Name
10 Venus
Data shape
20 Felipe
transformation
R2RML Mapping
• Declarative
• No Information loss
• W3C Recommendation
1. http://www.w3.org/TR/r2rml/
15
16. Tree - Tabular
• XML - RDB
• A technique and tool that rely on the XSD of the XML 1
Location EnvironmentalServices
Carlow County Council 40
Cavan County Council 36
Data shape Clare County Council 38
transformation Cork City Council 51
Cork County Council 47
Donegal County Council 45
Dublin City Council 43
• Operational
• No Information loss
1. Amy Flik, Transforming XML into a Relational Database Using XML Schema Document Type, 2009.
http://scholarworks.gvsu.edu/cistechlib/48/ 16
17. Tree - Tree
• XML - XML
• XSLT 1
Data shape
transformation
• Declarative
• No Information loss
• W3C Recommendation
1. http://www.w3.org/TR/xslt
17
18. Tree - Graph
• XML - RDF
• Gleaning Resource Descriptions from Dialects of Languages -
GRDDL 1
Data shape
transformation
• Declarative
• No Information loss
• W3C Recommendation
1. http://www.w3.org/TR/grddl/
18
19. Graph - Tabular
• RDF - RDB
• SPARQL 1 SELECT
Data shape
transformation
• Declarative
• Information loss
• W3C Recommendation
1. http://www.w3.org/TR/rdf-sparql-query/
19
20. Graph - Tree
• RDF - XML
• Rhizomik ReDeFer RDF2XHTML 1, relies on XSLT
Data shape
transformation
• Declarative (XSLT)
• Information loss
• Ad-hoc tool
1. http://rhizomik.net/html/redefer/
20
21. Graph - Graph
• RDF - RDF
• SPARQL 1 Construct
Data shape
transformation
• Declarative
• No Information loss
• W3C Recommendation
1. http://www.w3.org/TR/rdf-sparql-query/
21
24. Discussion
We can perform (loss-less) data shape transformations
between certain shapes.
A number of data shape transformations are already
standards
- For RDB2RDF, see R2RML and Direct Mapping.
- For XML2XML, see XSLT.
- For XML2RDF, see GRDDL.
Some data shape transformations are declarative in nature.
In certain cases we have to deal with lossy transformations.
24
26. Data Shapes
and
Data Transformations
Michael Hausenblas1, Boris Villazón-Terrazas2, and Richard Cyganiak1
1 DERI, NUI Galway, Ireland
firstname.lastname@deri.org
2 iSOCO, Madrid, Spain
bvillazon@isoco.com
Paper available at: http://arxiv.org/abs/1211.1565