Slides from a recent webinar by Objectivity showing how the ThingSpan platform is ideal for graph analytics to uncover patterns and insights within large, complex data sets in order to make efficient decisions.
Transaction events are read from a Kafka topic, but they could have easily been read from other streaming technologies like Spark Streaming, Flume or Stream Sets. Each event is formed into vertices and edges. The edges are further decomposed into triples to reduce lock contention and allow the parallel processing of the edge. This results in a lower latency of each operation and increased throughput.
The upsert of vertices and the insert of edges (decomposed to triples) are funneled into Samza tasks running on the cluster and managed by YARN. These upserts are consistent and idempotent.
ThingSpan runs queries in parallel. Each query is partitioned into parts, and a part, or partition, of the query is sent to each machine where it is executed as a YARN job. The query returns multiple paths from each partition, and these are collated into a single result. In the Spark world, this process can be described as transforming (mapPartition) each input partition into an RDD or DataFrame.
Using Spark DataFrames allows results from ThingSpan to be processed even further. Spark SQL statements can join, aggregate, and select from multiple tables. DataFrame operations are processed in parallel across the cluster.
The parallelism of queries allows near linear scaling of query throughput by “scaling out” the cluster.
ThingSpan runs queries in parallel. Each query is partitioned into parts, and a part, or partition, of the query is sent to each machine where it is executed as a YARN job. The query returns multiple paths from each partition, and these are collated into a single result. In the Spark world, this process can be described as transforming (mapPartition) each input partition into an RDD or DataFrame.
Using Spark DataFrames allows results from ThingSpan to be processed even further. Spark SQL statements can join, aggregate, and select from multiple tables. DataFrame operations are processed in parallel across the cluster.
The parallelism of queries allows near linear scaling of query throughput by “scaling out” the cluster.
Graph size for a billion fix transactions events
2- For a client basket, show all tasks that processed it and the timing.
Start point: Basket: m_Id
End point: Service
DevOps
- Discovery of hotspots
- How much resource is used
- Metrics