We explain how we use Grakn as part of a wider solution to deliver next generation Data Operations (Data Ops) tooling, enabling us to deliver sophisticated "Run Graph Analytics".
The Run Graph is a component to passively track and trace our data assets as they move across the organisation, and is used to quickly reverse engineer our global flows of data to better plan change and understand hidden dependencies. When operational failures do arise, we demonstrate how Grakn quickly allows us to assess the inferred impacts downstream, and to prioritise and communicate the impacts of outages to stakeholders.
5. “Many hands” legacy:
Loosely connected yet critical
data pipelines
are remarkably complex
in enterprises when viewed
as a whole system.
They are hard to manage, operate
and improve as a group.
“Entanglement” a major risk.
Loosely
connected,
use-case
focussed,
data
pipelines
6. How Complex?
Typically across an Enterprise:
100s of Production OLTP databases
Multiple Orchestration/scheduling tools
10s of ETL tools / instances
Many Kafka/Confluent installations
Multiple Logging/monitoring frameworks
10-100 OLAP reporting solutions
1000s of Reports
1000s of Web pages and/or microservices
Several Clouds and Data Centres
Several Data Warehouses
10+ Data Science sand boxes
Multiple Data Lakes
Loosely
connected,
use-case
focussed,
data
pipelines
7. Challenges:
Data management is difficult:
● Managing change effectively
● Managing quality of service
● Delivering service oversight
● Attributing clear issue ownership
● Resolving complex failures
● Delivering trust: “Ground Truth”
Loosely
connected,
use-case
focussed,
data
pipelines
8. A side issue, also surfacing
Top-Down Ent. Data Architecture
(methods/governance) are deeply
unpopular, especially with engineers.
Why? It is ineffective
●
●
●
●
●
Loosely
connected,
use-case
focussed,
data
pipelines
10. How complex? (a)
Example:
Here’s a “logical” summary of
data flows in one enterprise,
between production systems.
Shows: 100’s of logical
data pipelines, made up of:
- Batch ETL
- Messaging
- Streaming
Complex Pipeline Dependencies
11. How complex? (b)
Complexity also exists in the
content, not just in the pipes.
Here’s a conceptual model, a
“canonical data model,” for
most of a global firm:
410 core entities, 14 subjects.
Complex Content Dependencies
12. How complex? (c)
Even our OLAP reporting
architectures, are now
pipeline oriented and are
“inside out” rather than older
“star schemas”
Fact Pipelines and
sinks
Core dimension
pipelines and sinks
Peripheral Dimensions:
“side inputs,” lookups,
dictionaries, tags
Complex Information Dependencies
13. Notice the shape of this meta-data?
Notice the amount of existing engineering that must sit behind these views?
14. The weight of legacy
There is a huge amount of legacy data pipelines, and migrating
them requires retesting everything. Heterogeneous approach.
“Can you stabilize my operation, while moving Net New functionality to the cloud?”
- Many legacy ETL systems
- Many Orchestration / Scheduling instances
- Many datacentres, not just Cloud
- Many monolithic applications, still
- Many legacy flows undocumented, misunderstood
- Many hidden pipelines, in DB stored procedures
- New functionality in Cloud.
Legacy
Pipelines
New
Pipelines
The combined service.
25. RunGraph
● We can register ANY pipeline on our
estate, run using any orchestration tool
or ETL scheduler
● We can retro-fit legacy pipelines into the
run graph, even legacy ETL tools
● We can build up complex enterprise
architecture views, and establish ground
truth, about
● We can determine “normal” pipeline
behaviors, and identify strange
behaviours and raise flags
● We can use GRAKN ML facilties to start
doing predictive analytics on operations
Data Ops Enabled Pipelines
We build
Pipeline
Intelligence
via
GRAKN
Registration
(tool agnostic)
Instrumentation
(tool agnostic)
26. RunGraph
● Studies all Pipeline Instrumentation
● Tracks Data Flows / Lineage
● Creates Data Quality Expectations
● Does Impact Analysis of Failures (usual
ancestors) and prioritisation
● Identifies Key and Critical Data Assets,
(ie Core Dimensions)
● Tracks Data Lineage vs Data Quality
● Maps complex consumers to sources
bringing commercial line of sight
● Does Change Impact Analysis
Data Ops Enabled Pipelines
We build
Pipeline
Intelligence
via
GRAKN
Registration
(tool agnostic)
Instrumentation
(tool agnostic)
27. Hybrid Data Ops Console
Once we can instrument across legacy and new cloud
environments, we can construct a combined Ops Console.
Legacy
Pipelines
New
Pipelines
The combined service.
Consumer Service Dashboards
and Operations Console.
31. RunGraph: Registration + Job
Policy
Source
Feeds
Jobs
Data
We can summarise the core registration
needs here.
Registering these makes them addressable,
actionable, and enriches the pipeline analytics.
32. RunGraph: Analytics
Even simple use cases, drive out value quickly:
On failure, unplanned change:
- Find descendants - remediation based on impact, contagion
- Find ancestors - apply pressure / corrections upstream
Planned change
- Run analytic queries to show typical connections over 6 months -- reverse
engineer your architectures
- Identify key risks in planned change
34. Try it at home
There are some great open-source projects to check out:
35. Get in touch
Dr. Daniel A. Smith
Emerging Technology
dan.smith@6point6.co.uk
About 6point6
Integrating digital technology into your business can result in
fundamental changes to how you operate and deliver value to your
customers. To go digital is to reinvent yourself to the core, opening
yourself and your clients to a world of possibilities.
6point6 is a technology consultancy. We bring a wealth of hands-on
experience to help financial service providers, media houses and
government achieve more with digital. Using cutting edge technology
and agile delivery methods, we help you reinvent, transform and
secure a brighter digital future.
Visit us on www.6point6.co.uk
Twitter: @6point6ltd
LinkedIn: linkedin.com/company/6point6