O SlideShare utiliza cookies para otimizar a funcionalidade e o desempenho do site, assim como para apresentar publicidade mais relevante aos nossos usuários. Se você continuar a navegar o site, você aceita o uso de cookies. Leia nosso Contrato do Usuário e nossa Política de Privacidade.
O SlideShare utiliza cookies para otimizar a funcionalidade e o desempenho do site, assim como para apresentar publicidade mais relevante aos nossos usuários. Se você continuar a utilizar o site, você aceita o uso de cookies. Leia nossa Política de Privacidade e nosso Contrato do Usuário para obter mais detalhes.
(1)Standardizing for Open DataIvan Herman, W3C Open Data Week Marseille, France, June 26 2013 Slides at: http://www.w3.org/2013/Talks/0626-Marseille-IH/
(2)Data is everywhere on the Web! l Public, private, behind enterprise ﬁrewalls l Ranges from informal to highly curated l Ranges from machine readable to human readable l HTML tables, twitter feeds, local vocabularies, spreadsheets, … l Expressed in diverse models l tree, graph, table, … l Serialized in many ways l XML, CSV, RDF, PDF, HTML Tables, microdata,…
(8)W3C’s standardization focus was, traditionally, on Web scale integration of data l Some basic principles: l use of URIs everywhere (to uniquely identify things) l relate resources among one another (to connect things on the Web) l discover new relationships through inferences l This is what the Semantic Web technologies are all about
(9)We have a number of standards RDF 1.1 SPARQL 1.1 URI JSON-‐LD Turtle RDFa RDF/XML RDF: data model, links, basic assertions; diﬀerent serializations SPARQL: querying data A fairly stable set of technologies by now!
(10)We have a number of standards RDB2RDF RDF 1.1 RDFS 1.1 SPARQL 1.1 OWL 2 URI JSON-‐LD Turtle RDFa RDF/XML RDF: data model, links, basic assertions; diﬀerent serializations SPARQL: querying data RDFS: simple vocabularies OWL: complex vocabularies, ontologies RDB2RDF: databases to RDF A fairly stable set of technologies by now!
(12)Integration is done in diﬀerent ways l Very roughly: l data is accessed directly as RDF and turned into something useful l relies on data being “preprocessed” and published as RDF l data is collected from diﬀerent sources, integrated internally l using, say, a triple store
(15)However… l There is a price to pay: a relatively heavy ecosystem l many developers shy away from using RDF and related tools l Not all applications need this! l data may be used directly, no need for integration concerns l the emphasis may be on easy production and manipulation of data with simple tools
(16)Typical situation on the Web l Data published in CSV, JSON, XML l An application uses only 1-‐2 datasets, integration done by direct programming is straightforward l e.g., in a Web Application l Data is often very large, direct manipulation is more eﬃcient
(17)Non-‐RDF Data l In some setting that data can be converted into RDF l But, in many cases, it is not done l e.g., CSV data is way too big l RDF tooling may not be adequate for the task at hand l integration is not a major issue
(19)What that application does… l Gets the data published by NHS l Processes the data (e.g., through Hadoop) l Integrates the result of the analysis with geographical data Ie: the raw data is used without integration
(20)The reality of data on the Web… l It is still a fairly messy space out there L l many diﬀerent formats are used l data is diﬃcult to ﬁnd l published data are messy, erroneous, l tools are complex, unﬁnished…
(21)How do developers perceive this? ‘When transportation agencies consider data integration, one pervasive notion is that the analysis of existing information needs and infrastructure, much less the organization of data into viable channels for integration, requires a monumental initial commitment of resources and staﬀ. Resource-‐scarce agencies identify this perceived major upfront overhaul as "unachievable" and "disruptive.”’ -‐-‐ Data Integration Primer: Challenges to Data Integration, US Dept. of Transportation
(22)One may look at the problem through diﬀerent goggles l Two alternatives come to the fore: 1. provide tools, environments, etc., to help outsiders to publish Linked Data (in RDF) easily l a typical example is the Datalift project 2. forget about RDF, Linked Data, etc, and concentrate on the raw data instead
(24)But religions and cultures can coexist… J
(25)Open Data on the Web Workshop l Had a successful workshop in London, in April: l around 100 participants l coming from diﬀerent horizons: publishers and users of Linked Data, CSV, PDF, …
(26)We also talked to our “stakeholders” l Member organizations and companies l Open Data Institute, Open Knowledge Foundation, Schema.org l …
(27)Some takeaway l The Semantic Web community needs stability of the technology l do not add yet another technology block J l existing technologies should be maintained
(28)Some takeaway l Look at the more general space, too l importance of metadata l deal with non-‐RDF data formats l best practices are necessary to raise the quality of published data
(29)We need to meet app developers where they are!
(30)Metadata is of a major importance l Metadata describes the characteristics of the dataset l structure, datatypes used l access rights, licenses l provenance, authorship l etc. l Vocabularies are also key for Linked Data
(31)Vocabulary Management Action l Standard vocabularies are necessary to describe data l there are already some initiatives: W3C’s data cube, data catalog, PROV, schema.org, DCMI, … l At the moment, it is a fairly chaotic world… l many, possibly overlapping vocabularies l diﬃcult to locate the one that is needed l vocabularies may not be properly managed, maintained, versioned, provided persistence…
(32)W3C’s plan: l Provide a space whereby l communities can develop l host vocabularies at W3C if requested l annotate vocabularies with a proper set of metadata terms l establish a vocabulary directory l The exact structure is still being discussed: http://www.w3.org/2013/04/vocabs/
(34)CSV on the Web l Planned work areas: l metadata vocabulary to describe CSV data l structure, reference to access rights, annotations, etc. l methods to ﬁnd the metadata l part of an HTTP header, special rows and columns, packaging formats… l mapping content to RDF, JSON, XML l Possibly at a later phase: l API standards to access CSV data
(36)Open Data Best Practices l Document best practices for data publishers l management of persistence, versioning, URI design l use of core vocabularies (provenance, access control, ownership, annotations,…) l business models l Specialized Metadata vocabularies l quality description (quality of the data, update frequencies, correction policies, etc.) l description of data access API-‐s l …
(37)Summary l Data on the Web has many diﬀerent facets l We have concentrated on the integration aspects in the past years l We have to take a more general view, look at other types of data published on the Web
(38)In future… l We should look at other formats, not only CSV l MARC, GIS, ABIF,… l Better outreach to data publishing communities and organizations l WF, RDA, ODI, OKFN, …