These are the slides from eurosys'18 talk of paper
"Scale-Out ccNUMA: Exploiting Skew with Strongly Consistent Caching"
paper link: http://homepages.inf.ed.ac.uk/vnagaraj/papers/eurosys18.pdf
Boost Fertility New Invention Ups Success Rates.pdf
Scale-out ccNUMA - Eurosys'18
1. Scale-Out ccNUMA:
Exploiting Skew with Strongly Consistent Caching
Antonios Katsarakis*, Vasilis Gavrielatos*,
A. Joshi, N. Oswald, B. Grot, V. Nagarajan
The University of Edinburgh
This work was supported by EPSRC, ARM and Microsoft through their PhD Fellowship Programs
*The first two authors equally contribute to this work
5. … … …
…
KVS Performance 101
5
In-memory storage:
Avoid slow disk access
Partitioning:
• Shard the dataset across multiple nodes
• Enables high capacity in-memory storage
6. … … …
…
KVS Performance 101
6
In-memory storage:
Avoid slow disk access
Partitioning:
• Shard the dataset across multiple nodes
• Enables high capacity in-memory storage
Remote Direct Memory Access (RDMA):
Avoid costly TCP/IP processing via
• Kernel bypass
• H/w network stack processing
7. … … …
…
KVS Performance 101
7
In-memory storage:
Avoid slow disk access
Partitioning:
• Shard the dataset across multiple nodes
• Enables high capacity in-memory storage
Remote Direct Memory Access (RDMA):
Avoid costly TCP/IP processing via
• Kernel bypass
• H/w network stack processing
Good start, but there is a problem…
8. Skewed Access Distribution
8
Real-world datasets → mixed popularity
• Popularity follows a power-law distribution
• Small number of objects hot; most are not
Mixed popularity → load imbalance
• Node(s) storing hottest objects
get highly loaded
• Majority of nodes are under-utilized
128 Servers
… … …
Overloaded
YCSB, skew exponent = 0.99
9. Skewed Access Distribution
9
Real-world datasets → mixed popularity
• Popularity follows a power-law distribution
• Small number of objects hot; most are not
Mixed popularity → load imbalance
• Node(s) storing hottest objects
get highly loaded
• Majority of nodes are under-utilized
128 Servers
… … …
Overloaded
YCSB, skew exponent = 0.99
Skew-induced load imbalance limits system throughput
10. Centralized cache [SOCC’11, SOSP’17]
• Dedicated node resides in front of the KVS
caching hot objects.
◦ Filters the skew with a small cache
◦ Throughput is limited by the single cache
Existing Skew Mitigation Techniques
10
… … …
← Cache
11. Centralized cache [SOCC’11, SOSP’17]
• Dedicated node resides in front of the KVS
caching hot objects.
◦ Filters the skew with a small cache
◦ Throughput is limited by the single cache
NUMA abstraction [NSDI’14, SOCC’16]
• Uniformly distribute requests to all servers
• Remote objects RDMA’ed from home node
◦ Load balance the client requests
◦ No locality → excessive network b/w
Most requests require remote access
Existing Skew Mitigation Techniques
11
… … …
… … …
← Cache
12. Centralized cache [SOCC’11, SOSP’17]
• Dedicated node resides in front of the KVS
caching hot objects.
◦ Filters the skew with a small cache
◦ Throughput is limited by the single cache
NUMA abstraction [NSDI’14, SOCC’16]
• Uniformly distribute requests to all servers
• Remote objects RDMA’ed from home node
◦ Load balance the client requests
◦ No locality → excessive network b/w
Most requests require remote access
Existing Skew Mitigation Techniques
12
… … …
… … …
Can we get the best of both worlds?
← Cache
14. 14
Caching + NUMA
… … …
+
Scale-Out ccNUMA!
What are the challenges?
… … …
via distributed caching
15. Scale-Out ccNUMA Challenges
15
Challenge 1: Distributed cache architecture design
• Which items to cache and where?
• How to steer traffic for maximum load balance & hit rate?
Challenge 2: Keeping the caches consistent
(i.e. what happens on a write)
• How to locate replicas?
• How to execute writes efficiently?
16. Scale-Out ccNUMA Challenges
16
Challenge 1: Distributed cache architecture design
• Which items to cache and where?
• How to steer traffic for maximum load balance & hit rate?
Challenge 2: Keeping the caches consistent
(i.e. what happens on a write)
• How to locate replicas?
• How to execute writes efficiently?
Solving Challenge 1 with Symmetric Caching
17. 17
Which items to cache and where?
• Insight: hottest objects see most hits
• Idea: all nodes cache hottest objects →
Implication: all caches have same content
• Symmetric caching:
small cache with hottest objects at each node
How to steer traffic for maximum load balance and hit rate?
• Insight: symmetric caching → all caches equal (highest) hit rate
• Idea: uniformly spread requests
• Requests for hottest objects → served locally on any node
• Cache misses served as in NUMA Abstraction
Benefits:
• Load balances and filters the skew
• Throughput scales with number of servers
• Less network b/w: most requests are served locally
Symmetric Caching
… … …
18. 18
Which items to cache and where?
• Insight: hottest objects see most hits
• Idea: all nodes cache hottest objects →
Implication: all caches have same content
• Symmetric caching:
small cache with hottest objects at each node
How to steer traffic for maximum load balance and hit rate?
• Insight: symmetric caching → all caches equal (highest) hit rate
• Idea: uniformly spread requests
• Requests for hottest objects → served locally on any node
• Cache misses served as in NUMA abstraction
Benefits:
• Load balances and filters the skew
• Throughput scales with number of servers
• Less network b/w: most requests are served locally
Symmetric Caching
… … …
19. 19
Which items to cache and where?
• Insight: hottest objects see most hits
• Idea: all nodes cache hottest objects →
Implication: all caches have same content
• Symmetric caching:
small cache with hottest objects at each node
How to steer traffic for maximum load balance and hit rate?
• Insight: symmetric caching → all caches equal (highest) hit rate
• Idea: uniformly spread requests
• Requests for hottest objects → served locally on any node
• Cache misses served as in NUMA abstraction
Benefits:
• Load balances and filters the skew
• Throughput scales with number of servers
• Less network b/w: most requests are served locally
Symmetric Caching
… … …
20. 20
Which items to cache and where?
• Insight: hottest objects see most hits
• Idea: all nodes cache hottest objects →
Implication: all caches have same content
• Symmetric caching:
small cache with hottest objects at each node
How to steer traffic for maximum load balance and hit rate?
• Insight: symmetric caching → all caches equal (highest) hit rate
• Idea: uniformly spread requests
• Requests for hottest objects → served locally on any node
• Cache misses served as in NUMA abstraction
Benefits:
• Load balances and filters the skew
• Throughput scales with number of servers
• Less network b/w: most requests are served locally
Symmetric Caching
… … …
Challenge 2: How to keep the caches consistent?
21. Keeping the caches consistent
21
Requirement:
On a write, inform all replicas of the new value
How to locate replicas?
- Easy with Symmetric Caching!
If object in local cache → all nodes cache it
22. Keeping the caches consistent
22
Requirement:
On a write, inform all replicas of the new value
How to locate replicas?
- Easy with Symmetric Caching!
If object in local cache → all nodes cache it
23. Keeping the caches consistent
23
Requirement:
On a write, inform all replicas of the new value
How to locate replicas?
- Easy with Symmetric Caching!
If object in local cache → all nodes cache it
How to execute writes efficiently?
• Typical protocols:
◦ Ensure global write ordering via a primary
◦ Primary executes all writes → hot-spot Primary executes all writes
Write( )Write( )
Primary
24. Keeping the caches consistent
24
Requirement:
On a write, inform all replicas of the new value
How to locate replicas?
- Easy with Symmetric Caching!
If object in local cache → all nodes cache it
How to execute writes efficiently?
• Typical protocols:
◦ Ensure global write ordering via a primary
◦ Primary executes all writes → hot-spot
• Fully distributed writes
Can guarantee ordering via logical clocks
Avoid hot-spots
Evenly spread write propagation costs
Primary executes all writes
Write( )Write( )
Primary
Fully distributed writes
Write( ) Write( )
25. Protocols in Scale-out ccNUMA
25
Efficient RDMA implementation
Fully distributed writes via logical clocks
Two (per-key) strongly consistent flavours:
Write( )
26. Protocols in Scale-out ccNUMA
26
Efficient RDMA implementation
Fully distributed writes via logical clocks
Two (per-key) strongly consistent flavours:
◦ Linearizability (Lin): 2 RTTs
Write( )
27. Protocols in Scale-out ccNUMA
27
Efficient RDMA implementation
Fully distributed writes via logical clocks
Two (per-key) strongly consistent flavours:
◦ Linearizability (Lin): 2 RTTs
Broadcast Invalidations*
* along with logical (Lamport) clocks
Lin
Invalidate all caches
Write( )
28. Protocols in Scale-out ccNUMA
28
Efficient RDMA implementation
Fully distributed writes via logical clocks
Two (per-key) strongly consistent flavours:
◦ Linearizability (Lin): 2 RTTs
Broadcast Invalidations*
Broadcast Updates*
* along with logical (Lamport) clocks
Lin
Invalidate all caches
Write( )
Broadcast Updates
29. Protocols in Scale-out ccNUMA
29
Efficient RDMA implementation
Fully distributed writes via logical clocks
Two (per-key) strongly consistent flavours:
◦ Linearizability (Lin): 2 RTTs
Broadcast Invalidations*
Broadcast Updates*
◦ Sequential Consistency (SC): 1 RTT
Broadcast Updates*
* along with logical (Lamport) clocks
Lin
SC
Invalidate all caches
Write( )
Broadcast Updates
34. 2.2χ
Performance
34
Both systems are network bound
Lin: >3x throughput at low write ratio, 1.6x at 5% writes
SC: higher throughput at higher write ratios: 2.2x at 5% writes
>3χ
1.6χ
35. Conclusion
35
Scale-Out ccNUMA:
Distributed cache → best of Caching + NUMA
• Symmetric Caching:
◦ Load balances and filters skew
◦ Throughput scales with number of servers
◦ Less network b/w: most requests are local
• Fully distributed protocols:
◦ Efficient RDMA Implementation
◦ Fully distributed writes
◦ Two strong consistency guarantees
Up to 3x performance of state-of-the-art
while guaranteeing per-key Linearizability
Symmetric Caching
Fully distributed protocols
Write( ) Write( )
… … …