At Gloo.us, we face a challenge in providing platform data to heterogeneous applications in a way that eliminates access contention, avoids high latency ETLs, and ensures consistency for many teams. We're solving this problem by adopting Data Mesh principles and leveraging Kafka, Kafka Connect, and Kafka streams to build an event driven architecture to connect applications to the data they need. A domain driven design keeps the boundaries between specialized process domains and singularly focused data domains clear, distinct, and disciplined. Applying the principles of a Data Mesh, process domains assume the responsibility of transforming, enriching, or aggregating data rather than relying on these changes at the source of truth -- the data domains. Architecturally, we've broken centralized big data lakes into smaller data stores that can be consumed into storage managed by process domains.
This session covers how we’re applying Kafka tools to enable our data mesh architecture. This includes how we interpret and apply the data mesh paradigm, the role of Kafka as the backbone for a mesh of connectivity, the role of Kafka Connect to generate and consume data events, and the use of KSQL to perform minor transformations for consumers.
Injustice - Developers Among Us (SciFiDevCon 2024)
How a Data Mesh is Driving our Platform | Trey Hicks, Gloo
1. How the Data Mesh
is Driving Our
Platform
Trey Hicks
Director of Engineering
2. • Mentors
• Faith
• Recovery Centers
• Resources
Applications That Help People
Building Technologies To Connect People
3. • Diverse application types and purpose
• Serving several verticals
• Varying resource needs
• Apps are built internally by Gloo
or with partners
• Common means of connectivity to
data and services
Supporting The Mission
Common Platform Must Consider
4. Technical Landscape
• Microservices
• Datastores per service or
application domains
• Domain based services
• Event Driven
• Domain Driven
• Kubernetes
• AWS
• Confluent Cloud
Ø Kafka
Ø KsqlDB
• Kafka Connect cluster
• Docker
Our Approach Consists of
Architectural Infrastructure
5. • Heterogeneous apps
• Resource contention
• Gravitational pull to put application use-cases lower in the stack
• Tight coupling due to customization of shared services
• Blocking development due to cross-team dependencies
• Limits to our ability to scale the organization
Challenges
Challenges in Building the Platform
6. v Our value prop isn’t the applications, it’s the data
v Application specific use-cases low in the stack
causes problems
Platform Facts
7. Enter Data Mesh
• Domain-driven architecture
• Data as a product
• Self-serve architecture
• Governance
Zhamak Dehghani
https://martinfowler.com/articles/data-monolith-
to-mesh.html
Perhaps the ideas have existed before
• Data Emphasis
• Domain Driven Design
• Service Oriented Architectures
Provides terminology to shift the
conversation UPWARDS to form a
BROAD data strategy as opposed to
being a technical concern
Principles
Data Mesh Paradigm
8. Solving the Challenges
Domain-Driven
Architecture
Principle Appeal Solves
Data As a Product
Self-Serve
Infrastructure
Governance
• Microservice
architecture
• Primary value
• Apps are transient
• Easy connectivity to
data and domains
• Secure data ports
• Community trust
• Privacy
• Many apps
• Resource contention
• App requirements in
core services
• Blocking development
• Tight coupling
• Blocking development
• App requirements in
stack
• Tight coupling
• Blocking development
9. Adopting The Principles
• Establish common terminology and language
• Promote a data first philosophy
• Embrace democratized ownership and the associated responsibilities
• Acceptance of eventual consistency
• In our case, embracing event streams
Culture Shift
10. Data As a Product
How We Define Data Products
• Our data is our unique value
• Foundation for apps and services that drive success
• Requires governance
Ø Security
Ø Availability
Ø Accessibility
Ø Change controls
• Free of application use-cases
• Integrity
11. • Person
• Organization
• Catalysts
• Relationships
Data Product Examples
Core Data Objects
Secondary Objects
• Cohorts/Collections
• Growth Intelligence
• Assessments
13. Sharing the Data
• Distributed Data Products
• Domain boundaries
• Process/Application domains apply
their use-cases
• Domains may use sub-sets or
combinations
• Derived Data Products
Conceptual Architecture
17. Connecting to the Data Mesh
Sharing the Data Product
• Governed data available
• Options for Access
Ø Download with ETL or ELT
Ø Kafka
• Both have complications
Ø Manual processes
Ø Lack of consuming process
Ø Skillsets not aligned
19. Enter Kafka Ecosystem
Data Mesh Platform Using Kafka
• Kafka is perfect for one to many
• Event streams/batches provide a means keeping the consuming
domains in sync with the data product
• Kafka Connect is perfect for turning datastores into event streams
• Kafka Connect is perfect for sinking the streams into a datastore
• KsqlDB is perfect for selecting subsets of data or combining streams to
shape the data
20. Kafka Connect
Building the Mesh
• Connect Data Product
Ø S3 Source Connector
• Connect Consumers
Ø JDBC Sinks
Ø ES Sink
26. • Bloated infrastructure
Ø Expensive footprint
Ø K8s is great, maybe too easy to spin up new instances
• Experimentation leaves dead instances and other bones
• Complicated data model and APIs
Revisiting Technical Landscape
New Concerns
27. • Simplify the overall footprint
Ø Fewer and simpler services
Ø Smaller clusters
Ø Fewer instances
• Improve database schema
• Rethink our APIs
Going Forward In Reverse
Rethinking Parts of the Platform
28. Event Sourcing
● Major changes without
interruption
Ø Tables restructure
Ø Elements combined or removed
● Existing streams via
Connectors
● Need additional JDBC sinks
Changing the Schema
30. More On Infrastructure
• Structured like other engineering “pods”
Ø Engineers
Ø Product
• Charter is to build the self-serve connectivity
• Responsible for Data Mesh infrastructure
• Create reference configs for all Kafka Connectors
• Make it super simple to define, add, and govern new data products
• One team responsible for connectivity and data movement
Creation of Data Mesh Engineering
31. Discovery
• Provide a catalog of all data products
Ø Documentation or manual catalogs are DOA
Ø Must be automatic
• All data products
• Communication channels
• Consuming domains
• Provide schemas
• Data ports
Keeping Track of All the Things
32. Deployment
• Kafka Configs Project
Ø Project for all Connectors, KsqlDB, and topic configurations
Ø Updates trigger deployment
• Uses REST Proxies to deploy updates
• Open Source?
• Kafka JMX Exporter to collect metrics used in Grafana
dashboards
Continuous Deployment
33. Closure
• Data first organization
• Data mesh paradigm helps us solves problems
• Kafka ecosystem is the core of the data mesh driving the platform
• Serving our application domains by using Kafka Connect and KsqlDB
• Future
Ø Improve self-serve
Ø Discovery App à If you have experienced this problem, let’s chat!
Summary
34. Acknowledgments
● Collin Shaafsma – Leadership
● Ken Griesi – Inspiration, guidance, and discovering the articles
Alex Lauderbaugh
All things Data and ghost writer
Scott Symmank
Technical lead
Hannah Manry
Amazing engineer
Mitch Ertle
Resident BA expert and principal consumer
Chicken
Mascot
* We’re Hiring