O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Carregando em…3
×

Confira estes a seguir

1 de 70 Anúncio
Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Semelhante a Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark (20)

Anúncio

Mais recentes (20)

Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark

  1. 1. Today: Marc C. Hadfield, Founder
 Vital AI
 http://vital.ai marc@vital.ai 917.463.4776 MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
  2. 2. <intro> Marc C. Hadfield, Founder Vital AI
 http://vital.ai
 marc@vital.ai MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
 Quick Overview
  3. 3. agenda MetaQL Intro Motivation Domain Models (Schema) MetaQL DSL MetaQL Implementations Examples
  4. 4. MetaQL Leverage Domain Model (Schema) Compose Queries in Code: Typed Execute Queries on Databases, Interchangeably Minimize TCO:
 Separation of Concerns
 Developer Efficiency Query Framework Executable JVM Code! (Groovy Closure)
  5. 5. MetaQL Origin Across many data-driven application implementations, a desire for:
 Reusable Processes, Tools:
 Stop re-inventing the wheel. Tools to manage “schema” across an application & organization. Tools to combine Semantic Web, NOSQL, and Hadoop/Spark. Team Collaboration:
 Human Labor is usually limiting factor.
  6. 6. sample Recipient Sender EMail hasRecipient hasSender
  7. 7. sample Recipient Sender EMail hasRecipient hasSender ARC ARC
  8. 8. sample Recipient Sender EMail hasRecipient hasSender notEqual type:Person
 Address:john@example.org type:Person type:hasSender type:hasRecipient type:Email
  9. 9. sample MetaQL graph query GRAPH { value segments: ["mydata"] ARC { node_constraint { Email.class } constraint { "?person1 != ?person2" } ARC_AND { ARC { edge_constraint { Edge_hasSender.class } node_constraint { Person.props().emailAddress.equalTo(“john@example.org")
 } node_constraint { Person.class } node_provides { "person1 = URI" } } ARC { edge_constraint { Edge_hasRecipient.class } node_constraint { Person.class } node_provides { "person2 = URI" } } } } }
  10. 10. Internet of Things Amazon Echo
  11. 11. Internet of Things Coffee
  12. 12. Internet of Things: Batch and Stream Processing Amazon Echo Amazon Echo Service haley-app webservice Vert.X Vital Prime Database DataScript Hadoop - HDFS Apache Spark Streaming, MLLIB, NLP, GraphX Aspen Datawarehouse Analytics Layer Serving Layer Haley Device Raspberry Pi Voice to Text API Cognitive Application NLP and Inference to process User request. Query Knowledge in DB Streaming Prediction Models: “Should I really have more Coffee?” External APIs…
  13. 13. Demo Examples Vital Prime Database Vert.X Vital-Vertx JavaScript WebApp VitalService-JS Prediction Models DataScript https://github.com/vital-ai/vital-examples
  14. 14. Demo Example https://demos.vital.ai/enron-js-app/index.html https://github.com/vital-ai/vital-examples/tree/master/enron-js-app
  15. 15. Demo Example
  16. 16. Demo Example
  17. 17. Demo Example Recipient EMailhasRecipient
  18. 18. Cytoscape Plugin https://github.com/vital-ai/vital-cytoscape http://cytoscape.org/
  19. 19. Cytoscape Plugin
  20. 20. Cytoscape Plugin
  21. 21. Cytoscape Plugin
  22. 22. Cytoscape Plugin
  23. 23. Cytoscape Plugin: Wordnet Data, “wine, vino”
  24. 24. where are we using MetaQL? Financial Services Healthcare Internet-of-Things Start-Ups, Recommendation Apps
  25. 25. motivation for MetaQL
  26. 26. application architecture Batch and Stream Processing Web / Mobile Application Application Server Transactional Database Hadoop - HDFS Apache Spark Streaming, MLLIB, GraphX Analytics Layer Serving Layer Key/Value Cache External APIs Exrernal API Services Multiple Databases + Analytics + External APIs
  27. 27. enterprise application architecture Dashboard Application Server Enterprise Datawarehouse Data Silo Data Silo Data Silo Data Silo Data Silo ∞ Many Many Many Data Models…
  28. 28. volume, velocity, variety polyglot persistance = multiple database technologies …but we also have very many data models. many databases, many data models, changing rapidly. too many moving parts for a developer to reasonably manage! need fewer APIs to learn!
  29. 29. what happens when changes occur? Task Infrastructure DevOps Data Scientists Business + Domain Experts Developers Roles
  30. 30. what changes? Data Model Changes New Data Sources Infrastructure Change Switch Databases New Prediction Models / Features New Service APIs… Many Interdependencies… Example: Change in the taxonomy of a categorization service breaks all the logic tied to the old categories.
  31. 31. total cost of ownership How much code changes when we modify our data model to include new sources? How to minimize by decoupling dependencies? When we switch database technologies?
  32. 32. Domain Model as “Contract” Infrastructure DevOps Data Scientists Business + Domain Experts Developers Domain Model Everyone to agree (or at least be aware) of the definition of Domain Concepts. Ue semantics to map “views”.
  33. 33. MetaQL Abstraction Infrastructure DevOps Data Scientists Business + Domain Experts Developers Domain Model MetaQL Abstraction to give breathing room to Infrastructure.
  34. 34. Infrastructure / DevOps Database Types: • Key/Value • Document • RDF Graph • NOSQL • Relational • Timeseries ACID vs. BASE Optimizing Query Generation Tuning Secondary Indices Update MetaQL DSL for new DB features CAP Theorem
  35. 35. Domain Model (Schema)
  36. 36. Domain Model Implementation Combine: SQL-style Schema with Hadoop Data Serialization Schema (Avro, Thrift, Protocol Buffers, Kyro, Parquet) add Semantics: the “Meaning” of objects Not a table “person”, but define the concept of Person to be used throughout an application. The implementation decides how to store “Person” data in it’s database.
  37. 37. Domain Model Implementation Domain Model definition resolves: RDF vs Property Graph model Object Relational Impedance Mismatch Use OWL to capture Domain Model: SubClasses SubProperties Multiple Inheritance Marginal technology performance gains are hugely outweighed by Human productively gains, and wider choice of tools. Compromise across modeling paradigms .
  38. 38. Domain Model Implementation Example: Healthcare Application:
 URI<Person123> IS_A: • Patient • BillableAccount • InsuredEntity Same URI across three domain concepts:
 Diagnostics Records, Billing System, Insurance System. Implementation Note: We generate code for the JVM using “traits” as a way to implement multiple inheritance (Groovy, Scala, Java8). The trait is used as a semantic marker to link to the Domain Model.
  39. 39. Domain Model - Core Classes Node NodeEdge HyperNodeHyperEdge Properties: • URI • Primary Type • Types
 Edges/HyperEdges: • Source URI • Destination URI Edges: • Peer • Taxonomy Class Instances contain Properties.
  40. 40. Protege OWL Editor
  41. 41. VitalSigns: Domain Model Dev Kit $ vitalsigns generate -o ./domain-ontology/enron-dataset-1.0.0.owl $ ls domain-groovy-jar enron-dataset-groovy-1.0.0.jar $ ls domain-json-schema enron-dataset-1.0.0.js OWL can be compiled into JVM code statically (create an artifact for maven), or done dynamically at runtime.
  42. 42. Development with the Domain Model Code Completion from Domain Model
  43. 43. Development with the Domain Model VitalSigns vs = VitalSigns.get() Musician john = new Musician().generateURI(“john") john.name = "John Lennon" john.birthday = "October 9, 1940"^xsd.xdatetime("MMMM d, yyyy”) MusicGroup thebeatles = new MusicGroup().generateURI("thebeatles") thebeatles.name = "The Beatles" // try to assign the wrong property, throws an exception try { thebeatles.birthday = "January 1, 1970"^xsd.xdatetime("MMMM d, yyyy”) } catch(Exception ex) { println ex } // no such property exception vs.addToCache( thebeatles.addEdge_hasMember(john) ) // use cache to resolve queries thebeatles.getMembers().each{ println it.name } // use database to resolve queries thebeatles.getMembers(ServiceWide).each{ println it.name } Implicit MetaQL Queries
  44. 44. VitalService API • Open/Close Endpoint • Create/Remove Segment • Create/Read/Update/Delete Object • Queries (MetaQL as input closure) • Service Operations (MetaQL as input closure) • callFunction (DataScript) • init Transaction/Commit/Rollback A “Segment” is a Database (container of objects)
  45. 45. MetaQL VitalSigns: Domain Model Manager • MetaQL DSL • Prediction Model DSL • Pipeline Transformation DSL (ETL) (in development) A tricky bit is find the best way to express the DSL within the allowed grammar of the host language (Groovy).
 It’s an ongoing effort.
  46. 46. Query Types AGGREGATION PATH GRAPH SELECT
  47. 47. Query Elements • constraints: node_constraint, edge_constraint, … • comparators (equalTo, greaterThan, …) • provides, ?reference • AND, OR • OPTIONAL • Sort Criteria
  48. 48. SELECT query SELECT { value limit: 100 value offset: 0 value segments: ["mydata"] constraint { Person.class } constraint { Person.props().name.equalTo("John" ) } }
  49. 49. GRAPH query GRAPH { value segments: ["mydata"] ARC { node_constraint { Email.class } constraint { "?person1 != ?person2" } ARC_AND { ARC { edge_constraint { Edge_hasSender.class } node_constraint { Person.props().emailAddress.equalTo(“john@example.org") } node_constraint { Person.class } node_provides { "person1 = URI" } } ARC { edge_constraint { Edge_hasRecipient.class } node_constraint { Person.class } node_provides { "person2 = URI" } } } } }
  50. 50. GRAPH query (2) GRAPH { value segments: [VitalSegment.withId('wordnet')] value inlineObjects: true ARC { node_bind { "node1" } node_constraint { SynsetNode.expandSubclasses(true) } node_constraint { SynsetNode.props().name.contains_i("happy") } ARC { edge_bind { "edge" } node_bind { "node2" } } } } Code iterating over Results can use bind names to reference objects in each solution: node1, edge, node2. <—- inline objects
  51. 51. PATH query def forward = true def reverse = false PATH { value segments: segments value maxdepth: 5
 value rootURIs: [URIProperty.withString(inputURI)] if( forward ) { ARC { value direction: 'forward' // accept any edge: edge_constraint { } // accept any node: node_constraint { } } } if( reverse ) { ARC { value direction: 'reverse' // accept any edge: edge_constraint { } // accept any node: node_constraint { } } } }
  52. 52. AGGREGATION query SUM Product.props().cost AVERAGE Person.props().birthday COUNT_DISTINCT Document.props().active FIRST { DISTINCT Document.props().title, expandProperty : false, order: Order.ASC } Part of a SELECT query
  53. 53. Service Operations DSL Insert Update Delete
  54. 54. Service Operations INSERT { value segment: 'testing' insert(MusicGroup.class, provides: "thebeatles") { MusicGroup thebeatles -> thebeatles.name = "The Beatles" thebeatles.URI = "thebeatles" } insert(Musician.class, provides: "john") { Musician john -> john.name = "John" john.URI = "john" } insert(Edge_hasMember) { Edge_hasMember member -> member.sourceURI = ref("thebeatles").toString() member.destinationURI = ref("john").toString() member.URI = "edge1" } } <— Using “provides” values
  55. 55. Transactions def xid = service.startTransaction() service.save(xid, person123) service.commitTransaction(xid) Implemented at the service level:
  56. 56. MetaQL Implementations MetaQL Executable Query Query Generator
  57. 57. Sparql/RDF Implementation G S P O Quad Store Franz Allegrograph
  58. 58. Sparql/RDF Implementation VitalGraphQuery q = builder.query { GRAPH { value segments: ["documents"] ARC { node_constraint { Person.class } node_constraint { Person.props().emailID.equalTo(“k.lay@enron.com" ) } ARC { node_constraint { EMailMessage.class } edge_constraint { Edge_hasEMailMessage.class } } }
 } }.toQuery() println "Query: " + q.toSparql()
  59. 59. Sparql/RDF Implementation PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> PREFIX vital-core: <http://vital.ai/ontology/vital-core#> PREFIX p0: <http://vital.ai/ontology/enron-emails#> SELECT DISTINCT ?s1 ?d2 ?e2 FROM <segment:customer__app__documents> WHERE { { ?s1 p0:hasEmailID ?value1 . ?s1 rdf:type ?value2 . FILTER ( ?value2 = p0:Person && ?value1 = “k.lay@enron.com"^^xsd:string ) { ?d2 rdf:type ?value3 . ?e2 rdf:type ?value4 . FILTER ( ?value3 = p0:EMailMessage && ?value4 = p0:Edge_hasEMailMessage ) ?e2 vital-core:hasEdgeSource ?s1 . ?e2 vital-core:hasEdgeDestination ?d2 . } } }
  60. 60. Spark-SQL / Dataframe URI P V Segment RDD Property RDD K V Experimenting with: new Dataframe Optimizer: Catalyst, new Dataframe DSL for query generation, and using GraphX for isolated Graph Query cases Generate “Bad” queries, with optimizer fixing them and Spark partitioning RDDs, as long as Spark is aware of Schema.
  61. 61. Key/Value Implementation K V URI —> Serialized Object
  62. 62. Lucene/SOLR Implementation DocID 1 2 3 P1 V1 V1 P2 V2 V2 P3 V3 V3 P4 V4 V4 Inverted Index of Property Values…
  63. 63. NoSQL BigTable Implementation DynamoDB (HBase, Cassandra, Accumulo, …) ROWID 1 2 3 C1 K1=V1 K1=V1 K1=V1 C2 K1=V1 K1=V1 K1=V1 C3 K1=V1 K1=V1 K1=V1 C4 K1=V1, K1=V1 K1=V1, K1=V1 K1=V1, K1=V1 URI P V Per Segment object table Per Segment property table + Secondary Indices + Secondary Indices
  64. 64. SQL Implementation SQL, Hive-SQL, Redshift, … G S P O Per Segment Table with Partitioning (Hive)
  65. 65. implementation DSL Documentation to be posted: http://www.metaql.org/ VitalSigns, VitalService, MetaQL https://dashboard.vital.ai/ Vital AI github: https://github.com/vital-ai/ Sample Code
 Spark Code: Aspen, Aspen-Datawarehouse Documentation Coming!
  66. 66. closing thoughts Separation of Concerns yields the Agility needed to keep up with rapidly evolving Data. “Domain Model as Contract” provides a framework for consistent interpretation of Data across an application. MetaQL provides a framework for the consistent access and query of Data across an application. Context: Data-Driven Application / Cognitive Applications:
  67. 67. Thank You! Marc C. Hadfield, Founder
 Vital AI
 http://vital.ai marc@vital.ai 917.463.4776
  68. 68. Pipeline DSL (ETL) PIPELINE { // Workflow PIPE { // a Workflow Component with dependencies TRANSFORM { // Joins across Datasets
 IF (RULE { } ) // Boolean, Query, Construct, …
 THEN { RULE { } }
 ELSE { RULE { } }
 } PIPE { … } // dependent PIPE } // Output Dataset PIPE { …
 } } Influenced by Spark Pipeline and Google Dataflow Pipeline
  69. 69. Schema Upgrade/Downgrade UPGRADE { upgrade( oldClass: OLD_Person.class, newClass: NEW_Person.class ) { person_old, person_new -> person_new.newName = person_old.oldName } } DOWNGRADE { downgrade( newClass: NEW_Person.class, oldClass: OLD_Person.class ) { person_new, person_old -> person_old.oldName = person_new.newName } }
  70. 70. Multiple Endpoints def service1 = VitalService.getService(profile:”kv-users”) def service2 = VitalService.getService(profile:”posts-db”) def service3 = VitalService.getService(profile:”friendgraph-db”) // given user URI:user123@email.org // get user object from service1 // find friends of user in friendgraph via service3 // find posts of friends in posts-db // update service1 with cache of user-to-friends-postings // send postings of friends to user in UI

×