O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Data Segmenting in Anzo

3.325 visualizações

Publicada em

Brief look at data segmenting decisions and use of Semantic Web technologies within Anzo. Presented at the 2011 W3C Linked Enterprise Data Patterns workshop.

Publicada em: Tecnologia
  • Seja o primeiro a comentar

Data Segmenting in Anzo

  1. 1. Data Segmenting in AnzoContact:Lee Feigenbaumlee@cambridgesemantics.com ©2011 Cambridge Semantics Inc. All rights reserved.
  2. 2. Simple Introduction to Cambridge Semantics & Anzo • Cambridge Semantics is a software startup founded by a team of engineers from IBM’s Advanced Internet Technology group in 2007 • We sell the Anzo platform and tools to (mainly) Fortune 500 companies • Anzo is Semantic Web middleware that often stores large amounts of data for diverse uses2 ©2011 Cambridge Semantics Inc. All rights reserved.
  3. 3. We Use Named Graphs • Primary tool for segmenting data in Anzo • Smallest unit of granularity for: – Versioning & provenance – Access control – Notifications – Replication • (Concretely: we use TriG extensively)3 ©2011 Cambridge Semantics Inc. All rights reserved.
  4. 4. Which Triples Go Into a Named Graph? • Everything – Effectively a triple store • Single triple – Gives per statement access control, etc. • Whatever was in the source document – OK in some cases, but documents are often an artificial construct – What happens when doing a bulk load of hundreds of millions of triples? • All triples that share a subject – Decent compromise / default state in our experience • Closure of triples from a given subject following predicated annotated as “internal”4 ©2011 Cambridge Semantics Inc. All rights reserved.
  5. 5. Typical Anzo Data Segmenting debut showing 10/14/1994 Pulp Fiction budget $ 8,500,000 director directed Tarantino Reservoir Dogs birth date full name Quentin Jerome 3/27/1963 Tarantino5 ©2011 Cambridge Semantics Inc. All rights reserved.
  6. 6. Impact of Typical Anzo Data Segmenting • Many, many (millions) of small graphs • Often corresponds with the natural granularity at which you want to do things like permissions, versioning, alerting, etc. • Significant overhead for per-graph metadata – Sometimes encourages other partitioning schemes6 ©2011 Cambridge Semantics Inc. All rights reserved.
  7. 7. Finding the Graph for a Particular Resource • Default case: graph name is the same as the resource name – Not Kosher, but works well • Fallback case: system-wide SPARQL query • General case: graph resolution framework that can identify appropriate graph(s) via: – SPARQL DESCRIBE query (just kicks the can down the road a bit) – Lookup (registry) – Pattern matching (similar to POWDER) • (Graphs do not have to be local; sometimes resolution ends up retrieving them via HTTP or from an RDB)7 ©2011 Cambridge Semantics Inc. All rights reserved.
  8. 8. Accessing Graphs • Replication service – Chunked to handle large graphs gracefully – Client replicas kept up to date via JMS-driven notification service – Replicas are cached aggressively – encourages smaller graphs to limit client memory footprint (e.g. in a Web browser)8 ©2011 Cambridge Semantics Inc. All rights reserved.
  9. 9. Linked Data in Anzo • Data in Anzo can be exposed as linked data • Anzo will dereference external URIs to get at data, but that’s of limited utility – Allows single-instance views, but not faceted browsing • Anzo does not use linked data internally for data access • Linked Data consumption/publication is a feature, not a core part of Anzo’s architecture9 ©2011 Cambridge Semantics Inc. All rights reserved.
  10. 10. Accessing Graphs • SPARQL queries – Clients (e.g. Anzo on the Web facetted browser) target subsets of the server data with SPARQL queries – Impractical to enumerate millions of graphs in FROM or FROM NAMED clauses – Extend SPARQL with named datasets • Server-based lists of graphs that comprise an RDF dataset (default graph and named graphs) • Add FROM DATASET clause to reference named datasets from a query10 ©2011 Cambridge Semantics Inc. All rights reserved.
  11. 11. Anzo and other Sem Web Technologies • Everything described in RDFS and OWL (used as a rich data modeling language mostly) • We publish RDFa • We use JSON serializations of SPARQL results and RDF • We implement SPARQL Update but don’t use it from our tools • SPARQL-based rules (used to be CONSTRUCT, now INSERT ) • We use SPARQL ASK queries for transaction pre- conditions and validation • We have our own long-in-the-tooth implementation of the D2RQ mapping language that we don’t use often11 ©2011 Cambridge Semantics Inc. All rights reserved.
  12. 12. This is the full architecture that drives the Anzo Server and applications.
  13. 13. These parts are drivenprimarily by SemWeb technologies.
  14. 14. These parts are driven primarily by qualitysoftware engineering.
  15. 15. We can’t & shouldn’t standardize everything. • Need to leave room for competitive differentiation that goes beyond simply who has the “best” implementation of a standard • For standardization work, take a disciplined approach to identifying what problems are both: – Costly (a.k.a. valuable to solve) – Impacting interoperability15 ©2011 Cambridge Semantics Inc. All rights reserved.
  16. 16. What we could use • We often get asked “can we use your tools against <insert arbitrary SPARQL endpoint or linked data source here>?” – “No.” • We need standards for & adoption of: – Richly advertising contents of linked data sources • c.f. VoID – Richly advertising capabilities of SPARQL endpoints • c.f. SPARQL 1.1 Service Description and Basic Federated Query – Named datasets – Various other SPARQL extensions (though we can work around many of these)16 ©2011 Cambridge Semantics Inc. All rights reserved.