Jupyter Notebook may be one of the most controversial open source projects released in the last ten years! Love them or hate them, they’ve become a mainstay of data science and machine learning, and a significant part of the Python ecosystem. While Jupyter can simplify experimentation, rapid prototyping, documentation, and visualization, it often impedes version control, code review, and test coverage. Dev teams must accept the good with the bad… but what if they didn’t have to? In this talk we introduce conflict-free replicated data types (CRDT), a special object that supports strong consistency, and which can be used to enhance Jupyter notebooks for a truly collaborative experience.
First proposed by Shapiro et al in 2011 conflict-free replicated data types (CRDTs) evolved out of the Distributed Systems community for replication of data across a network of replicas. CRDTs are objects that come with a special guarantee — namely, that two different copies of that object can be strongly consistent, meaning they can be kept in sync. While CRDTs have enjoyed a good amount of attention from academia over the last years, primarily amongst database and cloud researchers, they have not led to many practical applications for everyday developers. However, recent work by Kleppmann et al shows CRDTs can be used for real-time rich-text collaboration — creating a “Google doc”-type experience with any document in a networked file system.
In this talk, we’ll present the basics of CRDTs and demonstrate how they work with examples written in Python. Next, we’ll explain how CRDTs can enable more collaborative Jupyter notebooks, opening up features such as synchronous insertions, diffs, and auto-merges, even with multiple simultaneous contributors!
3. About Us
Rebecca Bilbro
Patrick Deziel
Patrick is a software
engineer & machine
learning specialist. He was
employee #1 at Rotational
Labs.
He’s also a rock climbing
enthusiast.
Rebecca is a machine learning
engineer & distributed
systems researcher. She’s
Founder/ CTO at Rotational
Labs
She prefers earth-bound
activities.
Rotational Labs rotational.io
4. –Leslie Lamport
“A distributed system is one in which the failure of
a computer you didn’t even know existed can
render your own computer unusable.”
Rotational Labs rotational.io
9. Context: A Single Server
1 2
3
4
5
6
Rotational Labs rotational.io
1 2
3
4
5
6
It is always consistent
(responds predictably to
requests) - that’s convenient!
But what if there’s a failure? The entire system becomes
unavailable. Data loss can occur for information stored on
volatile memory. This is why we need distributed systems!
10. Inconsistency & Concurrency
PUT(x, 42)
ok
GET(x)
not found
Rotational Labs rotational.io
PUT(x, 42)
PUT(x, 27)
GET(x) → ?
Servers in a distributed system need to
communicate to remain in the same state.
Communication takes time (latency); more
servers means more latency.
Delays in communication can allow two
clients to perform operations concurrently.
From the system’s perspective, they happen
at the same time.
11. Time in a Distributed System
Due to clock drift, we can’t expect any two nodes
in a distributed system to have the same
perception of physical time.
In the absence of specialized hardware
(Spanner), logical clocks can be used to impose
a partial ordering of events.
To obtain a total ordering of events, we need
some arbitrary mechanism to break ties (e.g., the
node name).
Rotational Labs rotational.io
PUT(x, 42, t=Alice@1)
PUT(x, 27, t=Bob@2)
GET(x) → 27
13. CRDT: A data structure designed for replication
CRDTs are a good alternative to more expensive,
heavyweight coordination methods, such as:
Some representation of mutable state.
Some function M which merges two states and
produces a deterministic value.
M’s operations are idempotent, associative, and
commutative…
A = M(A, A)
M(A, B) = M(B, A)
M(A, M(B, C)) = M(C, (M(A, B))
…not unlike a Python set!
Locking
(shared lock, x-lock, etc)
Limits collaboration
between users
Consensus algorithms
(Paxos, Raft, ePaxos)
Network-intensive,
difficult to implement
14. Key Intuition:
We can combine multiple CRDTs
to make more complex CRDTs
Rotational Labs rotational.io
15. Simple CRDTs
Grow-only Counter
● A monotonically increasing counter across all replicas, each of which is assigned a unique ID
● The counter value at any point in time is equal to the sum of all values across the replicas
● Can be implemented using a dict() in Python
Grow-only Set
● A set which only supports adding new items
● No way to “delete” an item
● Similar to Python’s set()
Rotational Labs rotational.io
16. Compound CRDTs
Positive-Negative Counters
Combination of two Grow-only Counters, supports incrementation and decrementation
Two-Phase Sets
Combination of two Grow-only Sets, one is a “tombstone” set to support deletion
Last-Write-Wins-Element-Set
Improvement on Two-Phase Set which includes a timestamp to allow for items to be “undeleted”
Observed-Remove Set
Similar to Last-Write-Wins-Element-Set but uses unique tags rather than timestamps
Sequence CRDTs
Implements an ordered set with familiar list operations such as append, insert, remove.
We can use this to build a collaborative editor!
Rotational Labs rotational.io
18. Rotational Labs rotational.io
Hypothesis
We can compound a few CRDTs together to create a collaborative “notebook” ala Jupyter
Our composite CRDT needs to support the following operations
● High level operations: Insert and Remove notebook “cells”
● Low level operations: Insert and Remove characters within each cell
● Support merging at both the notebook level and the cell level to enable consistency
Key understanding
● Individual cell data can be represented by Sequence CRDTs
● The list of “cells” in a notebook is also a Sequence!
A Practical Example…
19. To achieve eventual consistency, each peer needs to agree on:
1. The set of operations
2. The order of operations
To achieve a total ordering of operations:
1. Assign each operation a unique ID based on client name and timestamp, e.g. INSERT(0, “a”) ⇒ alice@1
2. Lower timestamp values always go first
3. Order by client name to break ties
alice@1 -> bob@2 -> alice@3 -> bob@3
Total Ordering of Operations
20. Realizing the Object Order
Note: Object payloads are generic, so
we can nest Sequences within
Sequences. This advantage comes from
Python being dynamically typed!
alice@1
“a”
alice@2
“c”
bob@5
“b”
alice@5
“d”
bob@3
“x”
23. Sequence: Composite CRDT containing ordered set of items
Notebook: Contains a Sequence of Cells
Cell: Contains a Sequence of characters
GCounter: The shared logical clock
GSet: The entire history of operations
Operation: A single insert or delete performed by a node
OpId: Unique identifier for operations
Object: Represents an item in a sequence
24. GCounter
class GCounter:
"""Implements a grow-only counter CRDT. It must be instantiated with a network-unique ID."""
...
def add(self, value):
"""Adds a non-negative value to the counter."""
if value < 0:
raise ValueError("Only non-negative values are allowed for add()"
)
self.counts[self.id] += value
def merge(self, other):
"""Merges another GCounter with this one."""
if not isinstance(other, GCounter):
raise ValueError("Incompatible CRDT for merge(), expected GCounter"
)
for id, count in other.counts.items():
self.counts[id] = max(self.counts.get(id, 0), count)
return self
def get(self):
"""Returns the current value of the counter."""
return sum(self.counts.values())
Rotational Labs rotational.io
25. Sequence.merge
def merge(self, other):
# Merge the two Sequences
self.merge_operations(other)
other.merge_operations(
self)
...
# Recursive merge of the sub-sequences
for i in range(len(this_sequence)):
if isinstance(this_sequence[i], Sequence) andisinstance(other_sequence[i], Sequence):
this_sequence[i].merge(other_sequence[i])
this_sequence[i].id =self.id
return self
def merge_operations(self, other):
# Sync the local clock with the remote clock and apply the unseen operations
self.clock = self.clock.merge(other.clock)
patch_ops = other.operations.get().difference(
self.operations.get())
patch_log = sorted(patch_ops, key=cmp_to_key(
self.compare_operations))
for op in patch_log:
op.do(
self.objects)
# Merge the two operation logs
self.operations = self.operations.merge(other.operations)
Rotational Labs rotational.io
26. ObjectTree
class ObjectTree():
"""Add-only data structure which stores a sequence of Objects."""
def __init__(self):
self.roots = []
def find_insert(self, target, object, iter):
for root, i, obj in iter:
op = obj.operation
if op == target:
# We found the target
return root, i
elif op.target == target and object.operation < op:
# Same target (conflicting operations), so order the operations
return root, i
return None, -1
def insert_node(self, target, object):
root, i = self.find_insert(target, object, self.enumerate_nodes)
if root is None:
self.roots[-1].nodes.append(
object)
else:
root.nodes.insert(i,object)
Rotational Labs rotational.io
28. CRDT Limitations, Possibilities, and Resources
Limitations
● Eventual strong consistency
● Append-only data type
● Buffer size limitations
● Increasing egress costs
● Need for compaction/pruning
Rotational Labs rotational.io
Applications
● Testing
● Merging
● Branching
● Commenting
● Metadata Resolution
● Collaborative Editing
Resources
● eirene: a client for collaborative Python development with CRDT
● nbdime: tools for diffing and merging of Jupyter Notebooks
● peritext: a CRDT for rich-text collaboration
● Martin Kleppmann — CRDTs: The hard parts
● Michael Whittaker — Consistency in Distributed Systems