Hidden inside MongoDB is the WiredTiger data engine, an Open Source, pluggable storage engine that became the database's default in 3.2. Written in C, WiredTiger uses a variety of techniques to provide unmatched performance, low latency and scalability. This talk will explore data structures and techniques C/C++ programmers can use to support heavily threaded applications on modern hardware, using examples from the WiredTiger code base. Data structures and techniques to be covered include hazard pointers, skiplists, ticket locks, atomic instructions and memory barriers.
6. #MDBE16
WiredTiger
• From (some of) the folks that brought you Berkeley DB
• High performance data engine
• scalable throughput with low latency
• MongoDB’s default storage engine
• a general-purpose workhorse
7. Next
Ø Hardware (is the problem)
• Hazard pointers
• Skiplists
• Ticket locks
9. #MDBE16
Each core has multiple memory caches
core
3
core
2
core
1
core
N
two or
more
caches
two or
more
caches
two or
more
caches
two or
more
caches
10. #MDBE16
Cache coherence: cores “snoop” on writes
core
3
core
2
core
1
core
N
two or
more
caches
two or
more
caches
two or
more
caches
Main Memory
two or
more
caches
11. #MDBE16
Traditional data engines struggle with this architecture
• Writing “shared” memory is slow
• but databases exist to manage shared access to data!
• Snoopy cache-coherence scales poorly
12. #MDBE16
Programmers solve with locking
• Locks are complex objects
• get exclusive access to the lock state
• review and update the lock state
• “publish” (ensure every CPU sees the changes)
• release exclusive access
13. #MDBE16
Locking is slow
• Every operation requires exclusive access
• even shared (“read”) locks require a lock/unlock cycle
• thread stall is inevitable
• Locks require notification of every CPU
• Locks require exclusive access to the memory bus
14. #MDBE16
Locking is expensive
• A lock per object is too much memory
• POSIX locks cache-aligned, up to 128B
• grouping objects under locks makes contention worse
• More complexity to make locks “fair” and avoid starvation
• add thread queues
• wake-up the next thread waiting for the lock
15. #MDBE16
We need to find something else
If we can’t use locks, what do we use instead?
Today we’re going to talk about ways to get rid of locks.
16. #MDBE16
WiredTiger is written in C
• Java or C++ are better choices for system programming
• automatic memory management vs. malloc/free
• exception handling vs. explicit error paths
• widespread availability of reusable components
• Giving up programmer productivity
17. #MDBE16
C is “portable assembler”
• Marshall typed values to/from unaligned memory
• streaming compression, encryption, checksums
• unstructured I/O to/from stable storage
• Light-weight access to shared data
• use the underlying machine primitives that make up locks
• algorithms/structures based on those primitives
20. #MDBE16
Pages in the WiredTiger cache
page 2
page 6
page 8
page 9
Lots and lots (and lots) of pages
MongoDB worker threads read from disk
WiredTiger server threads evict to disk
21. #MDBE16
A reasonable page-locking implementation
• MongoDB worker threads read, modify pages
• WiredTiger server threads evict pages from the cache
• Allocate a lock per page
• MongoDB worker threads share pages
• WiredTiger eviction threads require exclusive access
22. #MDBE16
Page locking in the WiredTiger cache
page 2
page 6
page 8
page 9
eviction
lock
lock
lock
lock
writer
reader
thread stall on read locks!
vulnerable to starvation
too much memory
23. #MDBE16
Introducing memory barriers
• Memory barriers
• order reads, writes or both across a line of code
• compiler won’t cache values or reorder across a barrier
• Locks imply memory barriers
24. #MDBE16
Something faster
• Hazard pointers: a technique for avoiding locks
• MongoDB worker threads
• “log” intention to access a page
• publish: a memory barrier to ensure global CPU visibility
• Write to a per-thread memory location
• write won’t collide with other worker threads
25. #MDBE16
What about eviction starvation?
• Add a per-page “blocker”
• MongoDB worker won’t proceed if the page is blocked
• Cheap:
• it’s only a bit of information
• a read-only operation for workers
26. #MDBE16
Worker threads
• Publish intent to access the page
• Memory barrier to ensure global CPU visibility
• If the page not blocked, it’s accessible
• Clear intent to access when done
27. #MDBE16
Hazard pointers for workers
page 2
page 6
page 8
page 9
flag
writer
reader
flag
flag
flag
page 9
page 2
page 6
page 2
page 9
28. #MDBE16
Eviction server
• Block future worker thread access
• Memory barrier to ensure global CPU visibility
• Review worker thread access intentions
• can either wait or quit
• Unblock worker thread access when done
29. #MDBE16
Hazard pointers for workers and eviction
page 2
page 6
page 8
page 9
flag
flag
flag
flag
writer
reader
page 9
page 2
page 6
page 2
page 9
eviction
30. #MDBE16
Something faster: hazard pointers
Replaces two lock/unlock pairs for each page access
... with a single memory barrier instruction.
• Transfers work to the eviction server
• MongoDB worker latency is what we care about
• Memory costs
• per-worker-thread list
• per-page blocking flag
32. #MDBE16
Introducing atomic instructions
• Atomic increment or decrement
• read a value
• change it and store it back without the possibility of racing
• Based on compare-and-swap (CAS) instruction
• read value
• update value if the value is unchanged
• but fail if the value has changed
33. #MDBE16
Atomic prepend to singly-linked list
Update head if (and only if), head’s value is unchanged
head
NEW
new.next = head
compare_and_swap(head, new.next, new)
34. #MDBE16
How WiredTiger uses skiplists
• WiredTiger pages start with a disk image
• a compact representation we don’t want to modify
• Inserts and updates for the disk image stored in skiplists
35. #MDBE16
Skiplists start with a linked list
Singly-linked list with sorted values: 7, 10, 13, 18, 21, 24
7 10 211813 24
36. #MDBE16
Skiplists: add additional linked lists
Each higher level “skips” over more of the list
1:7
3:7
2:7
1:10 1:211:181:13 1:24
2:13 2:21
3:21
2:24
43. #MDBE16
Skiplists, the great
Replaces a lock/unlock pair over the entire skiplist
with one atomic memory instruction per object level
• Insert without locking
• Search without locking, while inserting
• Forward & backward traversal without locking, while inserting
44. #MDBE16
Skiplists, the good
• Simpler code than a Btree
• WiredTiger binary search ~200 lines of code
• a typical skiplist search < 20
• Fast search
• a Btree guarantees search in logarithmic time
• skiplists don’t offer a guarantee, but are usually close
45. #MDBE16
Skiplists, the not-so-good
• Cache-unfriendly
• every indirection a CPU cache miss
• Memory-unfriendly
• needs more memory for a data set than a Btree
• Removal requires locking
• WiredTiger is an MVCC engine (multiple values per key)
• removal less important to WiredTiger
47. #MDBE16
Ticket locks
• WiredTiger still needs to lock objects
• but we can make locks faster
• Ticket locks
• customers take a unique ticket number
• customers served in ticket order
49. #MDBE16
Ticket locks
• Two incrementing counters:
ticket: the next available ticket number
serving: the ticket number now being served
• Thread takes a ticket number
• Thread increments “next available”
• Thread waits for “serving” to match its ticket number
• When thread finishes, increments “serving”
51. #MDBE16
Ticket locks are almost what we need
• Ticket locks avoid starvation and are “fair”
• Smaller memory footprint
• Can be made significantly faster than POSIX locks
• remember our compare-and-swap instructions!
• But POSIX locks are shared between readers
52. #MDBE16
Ticket locks: shared vs. exclusive
• Three incrementing counters:
ticket: the next available ticket number
readers: the next reader to be served
writers: the next writer to be served
53. #MDBE16
Readers run in parallel
40
Writers Readers
39
Thread A
39
40
41
41
39
40
41
42
39
40
41
42
Thread B
Thread
C
54. #MDBE16
Multiple variable updates without locking
• Updating multiple counters would require locking
... but we can write the bus width atomically
• Encode the entire lock state in a single 8B value
lock {
uint16_t readers;
uint16_t writers;
uint16_t ticket; // 64K simultaneous threads
uint16_t unused;
}
56. #MDBE16
That’s a (very) fast introduction....
• Hazard pointers
• Skiplists
• Ticket locks
Open Source implementations are available in WiredTiger, including Public
Domain ticket locks.
57. #MDBE16
WiredTiger distribution
• Standalone application database toolkit library
• key-value store (NoSQL)
• row-store, column-store and LSM engines
• schema layer includes data types and indexes
• Another MongoDB Open Source contribution
• WiredTiger available for other applications
• https://github.com/wiredtiger