O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Powers of Ten Redux

774 visualizações

Publicada em

One of the first problems a developer encounters when evaluating a graph database is how to construct a graph efficiently. Recognizing this need in 2014, TinkerPop's Stephen Mallette penned a series of blog posts titled "Powers of Ten" which addressed several bulkload techniques for Titan. Since then Titan has gone away, and the open source graph database landscape has evolved significantly. Do the same approaches stand the test of time? In this session, we will take a deep dive into strategies for loading data of various sizes into modern Apache TinkerPop graph systems. We will discuss bulkloading with JanusGraph, the scalable graph database forked from Titan, to better understand how its architecture can be optimized for ingestion. Presented at Data Day Texas on January 27, 2018.

Publicada em: Software
  • Seja o primeiro a comentar

Powers of Ten Redux

  2. 2. OPEN SOURCE GRAPH TECH Property Graph Connected Data Model Apache TinkerPop™ Graph Computing Framework JanusGraph® Scalable Graph Database Image credits: Apache TinkerPop (ALv2) and JanusGraph (CC-BY-4.
  3. 3. POWERS OF TEN Stephen Mallette Image credit: spmallette on Twitter
  4. 4. 101 TEN Dart Paper Airplane Image credit: Akkana on Wikimedia Commons, CC BY-SA 3.0
  5. 5. GRAPH TRAVERSALS Vertex id: 0 label: person • name: Jason Vertex id: 2 label: airplane • name: Dart • type: paper Edge id: 5, outV: 0, inV: 2 label: throws • distance: 10
  6. 6. 103 ONE THOUSAND Wright Flyer Image credit: John T. Daniels on Wikimedia Commons, Public Domain
  7. 7. GREMLIN CONSOLE • Read-Eval-Print Loop • Instant gratification • Help with reproducible scripts Image credit: Apache TinkerPop, ALv2
  8. 8. AIR ROUTES DATA, CSV TO PROPERTY GRAPH Vertex id: 0 label: airport • code: AAE • desc: Annaba Vertex id: 2 label: airport • code: ALG • desc: Algiers Edge id: 5, outV: 0, inV: 2 label: route • distance: 254 airports.csv (3,374) routes.csv (43,400)
  9. 9. CSV LOADING • Leverage CSV libraries • Be aware of auto-iteration • Get-or-Create pattern with coalesce()
  10. 10. 105 ONE HUNDRED THOUSAND Spirit of St. Louis Image credit: Ad Meskens on Wikimedia Commons, CC BY-SA 3.0
  11. 11. GREMLIN SERVER AND REMOTE GRAPHS • Gremlin Language Variants (GLV) for queries, not for bulkload • Gremlin Client Drivers enable efficient batch scripting • Use Script Parameterization. Period. Image credit: Apache TinkerPop, ALv2
  12. 12. NO PARAMETERIZATION • Each script gets compiled and cached on the server – EXPENSIVE • Eventually will exceed the GC overhead limit
  13. 13. BASIC PARAMETERIZATION • Script is compiled once and reused on future requests
  14. 14. ADVANCED PARAMETERIZATION • Leverage Groovy script evaluation to handle more complex scripts Gremlin-Groovy script Parameters JSON
  15. 15. STRUCTURED RETURN VALUES • Serializing all vertex properties and values can be expensive • Judiciously decide what to include in the response • Leverage Groovy scripting in combination with Gremlin traversals for maximum efficiency Image credit: Apache TinkerPop, ALv2
  16. 16. 106 ONE MILLION Cessna 172 Skyhawk Image credit: Adrian Pingstone on Wikimedia Commons, Public Domain
  17. 17. JANUSGRAPH • Open source project with open governance • Community driven development • Full implementation of Apache TinkerPop • Apache license • Broad adoption Image credits: The Linux Foundation® and JanusGraph (CC-BY-4.
  18. 18. JANUSGRAPH STORAGE BACKENDS • In-Memory • Apache Cassandra, ScyllaDB • Apache HBase, Google Cloud Bigtable • Oracle Berkeley DB Java Edition • Amazon DynamoDB Image credit: Apache TinkerPop, ALv2
  19. 19. JANUSGRAPH SCHEMA AND INDEXING • Graph schema • Vertex labels • Edge labels: multiplicity • Vertex properties: data types, cardinality • Indexing • Composite index: exact matches • Mixed index: full-text search, numerical range, geospatial • Vertex-centric index: local per vertex, a solution for supernodes Image credit: JanusGraph, CC-BY-4.0
  20. 20. JANUSGRAPH QUICK-START DISTRIBUTION • Local server mode • Client, Storage, and Gremlin Server on a single machine • Great for testing out JanusGraph, but not recommended for production use
  21. 21. JANUSGRAPH DEPLOYMENT OPTIONS • Remote server mode • Client on first machine • Storage on second machine • Remote server mode with Gremlin Server • Client on first machine • Gremlin Server on second machine • Storage on third machine Image credit: JanusGraph, CC-BY-4.0
  22. 22. 107 TEN MILLION Bombardier CRJ700 Image credit: Aero Icarus on Wikimedia Commons, CC BY-SA 2.0
  23. 23. BATCHGRAPH FOR BOUTIQUE GRAPHS • Wrapper for a graph instance • Handle intermediate commits • Maintain vertex cache • For loading data only • Not in Apache TinkerPop 3 or JanusGraph • Moved away from graph wrapper approach Image credit: Apache TinkerPop, ALv2
  24. 24. REPLACING BATCHGRAPH • Intermediate commits • Count the mutations and commit periodically • Vertex cache • Enable fast lookup of vertices to connect with edges • Composite index • LRU cache https://github.com/ben- manes/caffeine • Pre-sort the data to maximize cache hits Image credit: Apache TinkerPop, ALv2
  25. 25. storage.batch- loading • Disables automatic schema • Disables transaction logging • Disables transactions on storage backend • Bigger dirty transaction cache size • Disables external vertex existence checks • Disables consistency checks (verify uniqueness, acquire locks) Image credit: Apache TinkerPop, ALv2
  26. 26. MULTI-MODEL APPROACHES • Only store the data you need for graph queries in the graph • Rehydrate non-graph properties from another store • Direct index queries Image credit: Apache TinkerPop, ALv2
  27. 27. 108 ONE HUNDRED MILLION Boeing 737 Image credit: JTOcchialini on Wikimedia Commons, CC BY-SA 2.0
  28. 28. FAUNUS / TITAN-HADOOP • Faunus was the distributed graph analytics engine from Aurelius • Used Hadoop to do breadth-first traversals using MapReduce • OLAP abstraction was pulled into Apache TinkerPop 3 Image credit: Apache TinkerPop, ALv2
  29. 29. HADOOPGRAPH I/O FORMATS • TinkerPop formats pull from files • GraphSONInputFormat • GryoInputFormat • ScriptInputFormat • JanusGraph formats pull from storage • Cassandra3InputFormat • HBaseInputFormat Image credit: JanusGraph, CC-BY-4.0
  30. 30. SPARKGRAPHCOMPUTER AND BULKLOADERVERTEXPROGRAM • Flexible Spark deployment options • Spark local with multiple threads • Spark master with multiple workers • Configure BLVP with ScriptInputFormat • Script and data shared across workers via HDFS • Assorted tips • Pre-define schema before loading • Define an index on “bulkLoader.vertex.id” • gremlin.spark.persistStorageLevel=DISK_ONLY Image credit: Apache TinkerPop, ALv2
  31. 31. 109 ONE BILLION Airbus A380 Image credit: Maarten Visser on Wikipedia, CC BY-SA 2.0
  32. 32. FULLY- DISTRIBUTED CLUSTER COMPUTING • Same loading mechanics as pseudo- distributed • Consider a Hadoop distribution, like Apache Ambari or Hortonworks Data Platform • Be aware of differences between distributions, especially software versions Image credit: Apache TinkerPop, ALv2
  33. 33. DON’T WHEELIE THE DUCATI Ducati Wheelie Image credit: David Hurt on Flickr, CC BY 2.0
  34. 34. THANK YOU! @pluradj RESOURCES • Apache TinkerPop • @apachetinkerpop • https://tinkerpop.apache.org • JanusGraph • @janusgraph • https://janusgraph.org • Powers of Ten • Stephen Mallette @spmallette • https://www.datastax.com/dev/blog/powers- of-ten-part-i • https://www.datastax.com/dev/blog/powers- of-ten-part-ii • Practical Gremlin • Kelvin Lawrence @gfxman • https://github.com/krlawrence/graph • JanusGraph Code Patterns • IBM Code @ibmcode • https://github.com/IBM/janusgraph-utils • HadoopMarcʼs Blog • http://yaaics.blogspot.com • JanusGraph Nuts and Bolts • Ted Wilmes @trwilmes • https://www.experoinc.com/post/janusgraph- nuts-and-bolts-part-1-write-performance