Data Modelling at Scale

Data Modelling at Scale
David Simons | @SwamWithTurtles

W H O A M I ?
• David Simons (@SwamWithTurtles)
• Data Architect at Ovo Energy
• Technical All-Rounder
• Kafka implementation & Cloud
integration at Citi
• Linking of the court and prison
services with the Ministry of Justice
• Organised our wedding seating plan
with Python.

A B R I E F ( B U T I M P O R TA N T ) A S I D E
Black Lives
Matter
(UK orgs, Worldwide Orgs)
Trans Rights are
Human Rights
(LGBT orgs, Trans specific orgs,
International orgs)
Open Source Projects that could use contributions
• Github Collection of Open Source Projects for Social Good
• Data Kind
• Kaggle (Data Science Volunteering that often has Social Good causes)
• Police Brutality Register
• Data Police Subreddit (Increasing Accessibility of Policing Data)
• Data for Black Lives

A G E N D A
• What is Data Modelling & Why
should I care?
• Data Modelling with Kafka
• Scaling your Data Model

W H AT I S D ATA M O D E L L I N G A N D
W H Y S H O U L D I C A R E ?
C H A P T E R O N E
Recommended Reading:
Data & Reality
— William Kent

C AU T I O N
P H I L O S O P H Y A H E A D

A C O M P L E X M E S S O F
H U M A N S A N D J O K E S
A N D E M O T I O N S A N D
F I C T I O N A N D I L L O G I C A N D
F A K E N E W S A N D T I M E A N D
E N T R O P Y A N D C O N F U S I O N A N D C O L O U R
A N D W O N D E R A N D FAT E A N D H A T R E D A N D L O V E A N D
WA R A N D N U A N C E A N D J E A L O U S Y A N D D O G S ( W H O A R E G O O D B O Y S )
A N D C AT S ( W H O A R E N O T ) A N D C O U N T R I E S A N D B O R D E R S A N D M U S I C A N D S O U N D A N D T I M E
A N D T I D E A N D D A R K N E S S A N D B O O K S A N D W O R D S A N D C O M P L E X I T I E S A N D R O L L E R C O A S T E R S A N D FAV O U R I T E F L AV O U R S O F I C E C R E A M A N D J O H N L E N N O N A N D G E N D E R ( O R M AY B E N O T. )
T H AT N O C O M P U T E R
C A N E V E R H O P E T O
C A P T U R E

d o e s o u r
s o f t w a r e n e e d t o
e x i s t i n t h e
w o r l d ?

M AY B E N O T T H E
W O R L D …
B U T A N Y S O F T WA R E
O F S U F F I C I E N T
C O M P L E X I T Y E X I S T S
W I T H I N A
{ b u s i n e s s | p r o b l e m |
w o r l d | d o m a i n |
i n d u s t r y }

w h a t s u b s e t o f t h e
w o r l d d o e s o u r
s o f t w a r e e x i s t s
i n ?

data model, n:
an agreed set of
assumptions and features
that distill the world (in
which our software
exists) into something we
can hope to capture
programatically

W H AT D O E S T H AT
M E A N F O R
T E C H N I C A L P E O P L E ?
B U T I ’ M N O T A P H I L O S O P H E R …

T Y P E S O F M O D E L
C O N C E P T U A L M O D E L
T H E R E A L W O R L D
L O G I C A L M O D E L
P H Y S I C A L M O D E L
• Expresses the subset of the domain in
terms of concepts and relations
independent of design concerns.
• Explicitly expresses how we have stored
our data in systems (column names, DB)
• Expresses the concepts in terms of data
structures or underlying technologies
More:
Usable, Low-Level,
Requires Technical
Expertise
More:
Accurate, Generic,
Conceptual

T H E Q U E S T I O N S
W E A S K …
• What kind of things do we deal with?
• For each kind of things, what aspects
of it do we care about? What are the
constraints of these aspects?
• When are two things the same thing?
• As something evolves, when does it
stop being the same thing?
• How do two things relate to each
other?

I S N ’ T T H I S E A S Y ?
D ATA M O D E L L I N G …

Are two of these the same thing?

{
amount: 500,
currency: “USD”
}
{
amount: 500,
currency: “USD”
}

Sugababes
2009
Sugababes
1998

MBS
2011
Sugababes
1998

W H AT I S T H E
C O R R E C T D ATA
M O D E L ?

S I G N S O F A G O O D
D ATA M O D E L
• It is simple. It models what you need
and nothing else.
• It is built with your technology and
software system in mind
• It does not contradict the actual world
• It is extensible
• Non-technical people understand it.
Domain experts even chip in.

D ATA M O D E L L I N G W I T H K A F K A
C H A P T E R T W O
Designing Event-Driven
Systems
— Ben Stopford

• Expresses the concepts in terms of data
structures or underlying technologies
Kafka
???
SQL/RDBMS
What are the tables & for
which
entities? What are the
keys/
constraints? How do we
normalise everything?
Neo4j/Graph DBs
Graph modelling - what
are our entities? what are
their properties? how do
they relate (and what are
the relations’ properties)
(more details here)
Mongo/Document Stores
What are our entities?
Which ones get top-level
documents? What
document validations
should we enforce?

W H AT I S K A F K A ?
J U S T I N C A S E …

W H AT I S K A F K A ?
• Immutable log data store, with a
multicast/pub-sub message
interface
• It’s technically not these things but they may
be a helpful abstraction:
• Message Queue with a DB store
behind it
• Real-time Streaming with a catch-up
facility
• ESB without the rules and bloatedness
that make it bad.

W H Y D O E S T H I S S H A P E
O U R D ATA M O D E L S ?
• Easy Answer: It’s a different data
store and therefore our low-level
models will be shaped by its
implementation details
• But Kafka has taken off not despite
its implementation details but
because of it.

A N E W D ATA
M O D E L L I N G
PA R A D I G M
E V E N T S T R E A M I N G

E V E N T S T R E A M I N G
• Do not store the “state” of an object
as your primary model
• Instead store a sequence of events
that have transpired that will build up
that state.

M O T I VAT I O N
• You can construct this state in
different ways for different purposes
• Back-up/Restoration for free!
• You can reconstruct the state of the
system at a given moment in time
• Better support for distributed/highly
concurrent systems

S O M E FA M I L I A R
E X A M P L E S
• Git
• Bank Statements
• Blockchain
• Accounting Ledgers

W H AT D O E S T H I S
L O O K L I K E I N T H E
R E A L W O R L D ?
B U T …

E X A M P L E : D E C RY P T O
https://github.com/SwamWithTurtles/decrypto-be
https://github.com/SwamWithTurtles/decrypto-fe
• A multi-player board game.
• Players can see a subset of words and
must communicate them to their team
mates without being intercepted (by
being too literal).
• Challenge: Make a web-app version
of this game for people to play over
hangouts during lockdown.

E X A M P L E : D E C RY P T O

C O N C E P T U A L M O D E L
T H E R E A L W O R L D
P H Y S I C A L M O D E L
• The list of events & their attributes
• The constraints of when they can occur
• The impact they have
• The classes/text keys for events
• The name and types of their attributes
?

W H AT G O O D I S A N
S T R E A M O F E V E N T S ?
B U T …

W H AT G O O D I S A N
S T R E A M O F E V E N T S ?
• (Some) Domain Experts
• Front-end Applications
• Human Reasoning
• Data Science/Analytics Teams

C A N I S T O R E M Y
D ATA E L S E W H E R E ?
• Yes!
• Pull into whatever high-fidelity data
store you want - e.g. Neo4j,
DynamoDB, ElasticSearch
Firebase……*
• * KSQLDB is attempting to solve this problem
• You can even use Kafka Connect but
be careful about the coupling of
physical data models.

P R A C T I C A L T I P S
G I V E M E S O M E …

S H O U L D I D O I T ?
• There is an overhead involved
(development, performance, resiliency). It
is not right for every team.
• Highly stateful services
• Highly concurrent service
• High throughput inputs
• Futureproofing
• Many different consumers of data.
[SPOILERS!]

S H O U L D I D O I T:
PA R T I I
• Event sourcing wants you to keep
events in an immutable log forever.
• GDPR frowns upon keeping personal
data forever.

H O W D O I D E F I N E
E V E N T S
• Events should be driven by domain
understanding from domain experts.
They should not be simple CRUD
statements (“UserAccountMapping
Created”)
• Events should correspond to actual,
definitive changes in state - not requests
to do so.
• Event Storming is the name given to a
domain modelling session in the event
sourcing world.

T H E T E C H N I C A L
B I T S
• Architectural Patterns: CQS/CQRS,
Event-Driven or Reactive
Programming
• Tooling to look into: RxJS (JS/front-
end), Akka (backend), Kafka, Event
Store (data layer)
• Recommended Videos :
• https://www.infoq.com/presentations/event-sourcing-jvm/
• https://www.infoq.com/presentations/event-driven-benefits-pitfalls/
• https://www.infoq.com/presentations/systems-event-driven/

S C A L I N G Y O U R D ATA M O D E L
C H A P T E R T H R E E
Domain-Driven Design
— Eric Evans

W H AT A R E T H E
P R O B L E M S A S Y O U
G E T B I G G E R ?

W H AT A R E T H E
P R O B L E M S A S Y O U G E T
B I G G E R ?
• As your scope grows, the complexity
of your model increases
(“user”, “person” or “account” will often be the worst offender.)
• As your software grows,
discoverability, traceability and lineage
grows harder
• As your team grows, you will either
have many more meetings or will suffer
from breaking changes and poor
communication around your model.

S P L I T I T U P
T H E S O L U T I O N …

C AU T I O N
O P I N I O N P R E S E N T E D A S FA C T A H E A D

D ATA S C O P I N G U T O P I A
The rules
• Each piece of data must have exactly one
point of truth on your system.
• Models within other contexts can
duplicate concepts from other contexts as
long as they know who is the boss.
• Models within a concept should be
encapsulated and should not be impacted
by changes to other teams models*.

M A S T E RY O F D ATA
• Each piece of data must have exactly
one point of truth on your system
• Does everyone know who the
point of truth is? Is it defined or
documented?
• How do we ensure all changes
are registered in this system.

D E N O R M A L I S E D
M O D E L S
• Models within other contexts can —
perhaps even should — duplicate
concepts from other contexts in the
format they want. As long as they know
who is the boss.
• This means they must stay in sync
(including respecting of alterations)
• They should only get the data they
need but they should feel free to
transform it.

M O D E L
E N C A P S U L AT I O N
encapsulated and should not be
impacted by changes to other teams
models*.
• This includes validation - this
should only be applied where
data is mastered.
• *Possible exception: Changes to translation layers may
need to happen due to changes in physical model.

B O U N D E D C O N T E X T U T O P I A
Prisons
In-Court
Transcription
Court
Scheduling
Defendant
• Name
• CrimeType
• Availability
• Special Needs
Hearing
• Time
• Court Room
Inmate
• Name
• CrimeType
Person
• Name
Role
Type e.g. Defense Barrister,
Judge, Defendant

T H E P R O B L E M S … ( N O M O R E ! )
PROBLEM SOLVED BY…
• As your scope grows, the complexity of
your model increases
• Each piece of data must have exactly one
point of truth on your system.
• As your software grows,
discoverability, traceability and lineage
grows harder
• Models within other contexts can
duplicate concepts from other contexts
as long as they know who is the boss.
• As your team grows, you will either
have many more meetings or will suffer
from breaking changes and poor
communication around your model.
encapsulated and should not be
impacted by changes to other teams
models.

W H E R E A R E T H E
B O U N D A R I E S ?
B U T …

– E R I C E VA N S
“A bounded context delimits the applicability of a particular
model so that team members have a clear and shared understanding
of what has to be consistent and how it relates to other contexts.
Within that context, work to keep the model logically uniﬁed but
do not worry about applicability outside those bounds.”

“ B O U N D E D
C O N T E X T ” S M E L L S
• Too Big:
• Polysemes/False Cognates
• Duplicate Concepts
• Too Small:
• Data/Feature Envy
• Incomplete Model

W I T H I N Y O U R
C O N T E X T…
• See Chapters 1 and 2 of this talk.

T H I S M AY B E A G O O D
WAY T O S T R U C T U R E
Y O U R T E A M S
A N A S I D E …

T H I S I S M A D E M U C H
E A S I E R I N A N E V E N T
S O U R C I N G W O R L D .
I C L A I M T H AT …

I N T H E O L D S TAT E -
W O R L D
• How would we notify and push
changes?
• How do we translate information
between the different services?
• How do we decouple the physical
models?

I N S T E A D …
• All state changes are business-driven
events. Contexts can listen to these
and do what they want with them.
• New contexts can be spun up and
construct their state from past
events.
• Events are perfect candidates for
MQs or Kafka to unlock a push-based
system.

P U T T I N G I T A L L
T O G E T H E R

Split your domain model up into
“bounded contexts”.
This may incorporate multiple
teams or systems but should be a
reasonable size.

All stakeholders of this bounded
context should define the
boundary and understand what
the conceptual model (state,
entities, relationships). This should
be documented and
discoverable.

After that, they should event
storm event storm to drive
understanding of their data
model:
What (in the real world) can make the entities/
relationships in the conceptual data model change?
When can they happen? What info is needed to
action them?

Enter… Kafka.
These events should be published on Kafka. They represent
your team’s (internally) public interface and should be
documented/publicised.
This should be the source of truth.

You and any other team that cares about this event can now
use it to update their readable/high-fidelity state (e.g. RDBMS,
Elastic, Neo4j).

And it ended happily ever after.

I N S U M M A RY
• Data Modelling is the crystallisation of the assumptions we have made
about the real world within our domain. It is an imprecise science, but a
good model will allow frictionless progress.
• Event Sourcing asks what if we build our domain model around state
changes instead of state. Kafka is a great backbone for this kind of
architecture that is reliable, and futureproof and highly scalable.
• As we scale up, data modelling as a whole org is unsustainable. We break
our model into independently changeable sections that are called
bounded contexts. Kafka can acts well as a central nervous system.

Questions?David Simons | @SwamWithTurtles

Data Modelling at Scale

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (6)

Semelhante a Data Modelling at Scale

Semelhante a Data Modelling at Scale (20)

Mais de David Simons

Mais de David Simons (10)

Último

Último (20)

Data Modelling at Scale