1. Latency Trumps All
Chris Saari
twitter.com/chrissaari
blog.chrissaari.com
saari@yahoo-inc.com
Thursday, November 19, 2009
2. Packet Latency
Time for a packet to get between points A and B
Physical distance + time queued in devices along the way
~60ms
Thursday, November 19, 2009
4. Anytime...
... the system is waiting for data
The system is end to end
- Human response time
- Network card buffering
- System bus/interconnect speed
- Interrupt handling
- Network stacks
- Process scheduling delays
- Application process waiting for data from memory to get
to CPU, or from disk to memory to CPU
- Routers, modems, last mile speeds
- Backbone speed and operating condition
- Inter-cluster/colo performance
Thursday, November 19, 2009
5. Big Picture
Disk
Network
CPU
User
Thursday, November 19, 2009
Memory
7. Latency vs. Bandwidth
Bandwidth
Bits / Second
Latency
Time
Thursday, November 19, 2009
8. Bandwidth of a Truck Full of Tape
Thursday, November 19, 2009
9. Latency Lags Bandwidth -David Patterson
Given the record of Ethernet, no matter which
advances in bandwidth ver- actually provides better
sus latency, the logical value. One can argue that
question is why? Here are greater advances in band-
five technical reasons and width led to marketing
one marketing reason. techniques to sell band-
1. Moore’s Law helps width that in turn trained
bandwidth more than customers to desire it. No
latency. The scaling of matter what the real chain
semiconductor processes of events, unquestionably
provides both faster transis- higher bandwidth for
tors and many more on a processors, memories, or
chip. Moore’s Law predicts the networks is easier to
a periodic doubling in the sell today than latency.
number of transistors per Since bandwidth sells,
chip, due to scaling and in engineering resources tend
part to larger chips; to be thrown at band-
recently, that rate has been width, which further tips
22–24 months [6]. Band- the balance.
width is helped by faster 4. Latency helps band-
transistors, more transis- width. Technology im-
tors, and more pins operat- provements that help
ing in parallel. The faster latency usually also help
transistors help latency, but bandwidth, but not vice
the larger number of tran- versa. For example,
sistors and the relatively Figure 1. Log-log plot of DRAM latency determines the number of accesses per
bandwidth and latency
longer distances on the milestones from Table 1 second, so lower latency means more accesses per sec-
actually larger chips limit relative to the first milestone. ond and hence higher bandwidth. Also, spinning
the benefits of scaling to disks faster reduces the rotational latency, but the read
latency. For example,
Thursday, November 19, 2009 head must read data at the new faster rate as well.
10. The Problem
Relative Data Access Latencies, Fastest to Slowest
- CPU Registers (1)
- L1 Cache (1-2)
- L2 Cache (6-10)
- Main memory (25-100)
--- don’t cross this line, don’t go off mother board! ---
- Hard drive (1e7)
- LAN (1e7-1e8)
- WAN (1e9-2e9)
Thursday, November 19, 2009
11. Relative Data Access Latency
Fast Slow
CPU Register L1 L2 RAM
Thursday, November 19, 2009
12. Relative Data Access Latency
Fast Slow
CPU Register L1 L2 RAM Hard Disk
Thursday, November 19, 2009
13. Relative Data Access Latency
Lower Higher
Register L1 L2 RAM Hard Disk LANFloppy/CD-ROMWAN
Thursday, November 19, 2009
14. CPU Register
CPU Register Latency - Average Human Height
Thursday, November 19, 2009
18. Hard Drive
0.4 x equatorial
circumference of
Earth
x 10 M
Thursday, November 19, 2009
19. WAN
x 100 M
0.42 x Earth to Moon Distance
Thursday, November 19, 2009
20. To experience pain...
Mobile phone network latency 2-10x that of wired
- iPhone 3G 500ms ping
x 500 M
2 x Earth to Moon Distance
Thursday, November 19, 2009
22. Google SPDY
“It is designed specifically for
minimizing latency through features
such as multiplexed streams, request
prioritization and HTTP header
compression.”
Thursday, November 19, 2009
23. Strategy Pattern: Move Data Up
Relative Data Access Latencies
- CPU Registers (1)
- L1 Cache (1-2)
- L2 Cache (6-10)
- Main memory (25-50)
- Hard drive (1e7)
- LAN (1e7-1e8)
- WAN (1e9-2e9)
Thursday, November 19, 2009
26. Let’s Dig In
Relative Data Access Latencies, Fastest to Slowest
- CPU Registers (1)
- L1 Cache (1-2)
- L2 Cache (6-10)
- Main memory (25-100)
- Hard drive (1e7)
- LAN (1e7-1e8)
- WAN (1e9-2e9)
Thursday, November 19, 2009
27. Network
If you can’t Move Data Up, minimize accesses
Thursday, November 19, 2009
28. Network
If you can’t Move Data Up, minimize accesses
Souders Performance Rules
1) Make fewer HTTP requests
- Avoid going halfway to the moon whenever possible
Thursday, November 19, 2009
29. Network
If you can’t Move Data Up, minimize accesses
Souders Performance Rules
1) Make fewer HTTP requests
- Avoid going halfway to the moon whenever possible
2) Use a content delivery network
- Edge caching gets data physically closer to the user
Thursday, November 19, 2009
30. Network
If you can’t Move Data Up, minimize accesses
Souders Performance Rules
1) Make fewer HTTP requests
- Avoid going halfway to the moon whenever possible
2) Use a content delivery network
- Edge caching gets data physically closer to the user
3) Add an expires header
- Instead of going halfway to the moon (Network),
climb Godzilla (RAM) or go 40% of the way around
the Earth (Disk) instead
Thursday, November 19, 2009
31. Network: Packets and Latency
Less data = less packets = less packet loss = less latency
Thursday, November 19, 2009
32. Network
1) Make fewer HTTP requests
2) Use a content delivery network
3) Add an expires header
4) Gzip components
Thursday, November 19, 2009
34. Jim Gray, Microsoft 2006
Tape is Dead
Disk is Tape
Flash is Disk
RAM Locality is King
Thursday, November 19, 2009
35. Strategy: Move Up: Disk to RAM
RAM gets you above the exponential latency line
- Linear cost and power consumption = $$$
Main memory (25-50)
Hard drive (1e7)
Thursday, November 19, 2009
36. Strategy: Avoidance: Bloom Filters
- Probabilistic answer to question if a member is in a set
- Constant time via multiple hashes
- Constant space bit string
- Used in BigTable, Cassandra, Squid
Thursday, November 19, 2009
37. In Memory Indexes
Haystack keeps file system indexes in RAM
- Cut disk access per image from 3 to 1
Search index compression
GFS master node prefix compression of names
Thursday, November 19, 2009
42. 1TB Random Read
Sunday Monday Tuesday Wednes Thursda Friday Saturda
day y y
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15
Done!
Thursday, November 19, 2009
43. Strategy: Batching and Streaming
Fewer reads/writes of large contiguous chunks of data
- GFS 64MB chunks
Thursday, November 19, 2009
44. Strategy: Batching and Streaming
Fewer reads/writes of large contiguous chunks of data
- GFS 64MB chunks
Requires data locality
- BigTable app specified data layout and compression
Thursday, November 19, 2009
49. Multicore Makes It Worse!
More cores accelerates the rate of divergence
- CPU performance doubled 3x over the past 5 years
- Memory performance doubled once
Thursday, November 19, 2009
50. Evolving CPU Memory Access Designs
Intel Nehalem integrated memory controller and new high-
speed interconnect
40 percent shorter latency and increased bandwidth,
4-6x faster system
Thursday, November 19, 2009
51. More CPU evolution
Intel Nehalem-EX
- 8 core, 24MB of cache, 2 integrated memory controllers
- ring interconnect on-die network designed to speed
the movement of data among the caches used by
each of the cores
IBM Power 7
- 32MB Level 3 cache
AMD Magny-Cours
- 12 cores, 12MB of Level 3 cache
Thursday, November 19, 2009
53. Cache Line Awareness
Linked list
- Each node as a separate allocation is Bad
Thursday, November 19, 2009
54. Cache Line Awareness
Linked list
- Each node as a separate allocation is Bad
Hash table
- Reprobe on collision with stride of 1
Thursday, November 19, 2009
55. Cache Line Awareness
Linked list
- Each node as a separate allocation is Bad
Hash table
- Reprobe on collision with stride of 1
Stack allocation
- Top of stack is usually in cache, top of the heap is
usually not in cache
Thursday, November 19, 2009
56. Cache Line Awareness
Linked list
- Each node as a separate allocation is Bad
Hash table
- Reprobe on collision with stride of 1
Stack allocation
- Top of stack is usually in cache, top of the heap is
usually not in cache
Pipeline processing
- Stages of operations on a piece of data do them all at
once vs. each stage separately
Thursday, November 19, 2009
57. Cache Line Awareness
Linked list
- Each node as a separate allocation is Bad
Hash table
- Reprobe on collision with stride of 1
Stack allocation
- Top of stack is usually in cache, top of the heap is
usually not in cache
Pipeline processing
- Stages of operations on a piece of data do them all at
once vs. each stage separately
Optimize for size
- Might be faster execution than code optimized for speed
Thursday, November 19, 2009
58. Cycles to Burn
1) Make fewer HTTP requests
2) Use a content delivery network
3) Add an expires header
4) Gzip components
- Use excess compute for compression
Thursday, November 19, 2009
60. Datacenter Storage Heiracrchy
Storage hierarchy: a different view
- Jeff Dean, Google
A bumpy ride that has been getting bumpier over time
Thursday, November 19, 2009
64. Memcached Facebook Optimizations
- UDP to reduce network traffic - Less Packets
- One core saturated with network interrupt handing
- opportunistic polling of the network interfaces and
setting interrupt coalescing thresholds aggressively -
Batching
Thursday, November 19, 2009
65. Memcached Facebook Optimizations
- UDP to reduce network traffic - Less Packets
- One core saturated with network interrupt handing
- opportunistic polling of the network interfaces and
setting interrupt coalescing thresholds aggressively -
Batching
- Contention on network device transmit queue lock,
packets added/removed from the queue one at a time
- Change dequeue algorithm to batch dequeues for
transmit, drop the queue lock, and then transmit the
batched packets
Thursday, November 19, 2009
66. Memcached Facebook Optimizations
- UDP to reduce network traffic - Less Packets
- One core saturated with network interrupt handing
- opportunistic polling of the network interfaces and
setting interrupt coalescing thresholds aggressively -
Batching
- Contention on network device transmit queue lock,
packets added/removed from the queue one at a time
- Change dequeue algorithm to batch dequeues for
transmit, drop the queue lock, and then transmit the
batched packets
- More lock contention fixes
Thursday, November 19, 2009
67. Memcached Facebook Optimizations
- UDP to reduce network traffic - Less Packets
- One core saturated with network interrupt handing
- opportunistic polling of the network interfaces and
setting interrupt coalescing thresholds aggressively -
Batching
- Contention on network device transmit queue lock,
packets added/removed from the queue one at a time
- Change dequeue algorithm to batch dequeues for
transmit, drop the queue lock, and then transmit the
batched packets
- More lock contention fixes
- Result 200,000 UDP requests/second with average
latency of 173 microseconds
Thursday, November 19, 2009
68. Google BigTable
Table contains a sequence of blocks
- block index loaded into memory - Move Up
Thursday, November 19, 2009
69. Google BigTable
Table contains a sequence of blocks
- block index loaded into memory - Move Up
Table can be completely mapped into memory - Move Up
Thursday, November 19, 2009
70. Google BigTable
Table contains a sequence of blocks
- block index loaded into memory - Move Up
Table can be completely mapped into memory - Move Up
Bloom filters hint for data - Move Up
Thursday, November 19, 2009
71. Google BigTable
Table contains a sequence of blocks
- block index loaded into memory - Move Up
Table can be completely mapped into memory - Move Up
Bloom filters hint for data - Move Up
Locality groups loaded in memory - Move Up, Batching
- Clients can control compression of locality groups
Thursday, November 19, 2009
72. Google BigTable
Table contains a sequence of blocks
- block index loaded into memory - Move Up
Table can be completely mapped into memory - Move Up
Bloom filters hint for data - Move Up
Locality groups loaded in memory - Move Up, Batching
- Clients can control compression of locality groups
2 levels of caching - Move Up
- Scan cache of key/value pairs and block cache
Thursday, November 19, 2009
73. Google BigTable
Table contains a sequence of blocks
- block index loaded into memory - Move Up
Table can be completely mapped into memory - Move Up
Bloom filters hint for data - Move Up
Locality groups loaded in memory - Move Up, Batching
- Clients can control compression of locality groups
2 levels of caching - Move Up
- Scan cache of key/value pairs and block cache
Clients cache tablet server locations
- 3 to 6 network trips if cache is invalid - Move Up
Thursday, November 19, 2009
74. Facebook Cassandra
Bloom filters used for keys in files on disk - Move Up
Thursday, November 19, 2009
75. Facebook Cassandra
Bloom filters used for keys in files on disk - Move Up
Sequential disk access only - Batching
Append w/o read ahead
Thursday, November 19, 2009
76. Facebook Cassandra
Bloom filters used for keys in files on disk - Move Up
Sequential disk access only - Batching
Append w/o read ahead
Log to memory and write to commit log on dedicated disk -
Batching
Thursday, November 19, 2009
77. Facebook Cassandra
Bloom filters used for keys in files on disk - Move Up
Sequential disk access only - Batching
Append w/o read ahead
Log to memory and write to commit log on dedicated disk -
Batching
Programmer controlled data layout for locality - Batching
Thursday, November 19, 2009
78. Facebook Cassandra
Bloom filters used for keys in files on disk - Move Up
Sequential disk access only - Batching
Append w/o read ahead
Log to memory and write to commit log on dedicated disk -
Batching
Programmer controlled data layout for locality - Batching
Result: 2 orders of magnitude better performance than
MySQL
Thursday, November 19, 2009
79. Move the Compute to the Data: YQL Execute
Thursday, November 19, 2009
80. From the Browser Perspective
Performance bounded by 3 things:
Thursday, November 19, 2009
81. From the Browser Perspective
Performance bounded by 3 things:
- Fetch time
- Unless you’re bundling everything it is a cascade of
interdependent requests, at least 2 phases worth
Thursday, November 19, 2009
82. From the Browser Perspective
Performance bounded by 3 things:
- Fetch time
- Unless you’re bundling everything it is a cascade of
interdependent requests, at least 2 phases worth
- Parse time
- HTML
- CSS
- Javascript
Thursday, November 19, 2009
83. From the Browser Perspective
Performance bounded by 3 things:
- Fetch time
- Unless you’re bundling everything it is a cascade of
interdependent requests, at least 2 phases worth
- Parse time
- HTML
- CSS
- Javascript
- Execution time
- Javascript execution
- DOM construction and layout
- Style application
Thursday, November 19, 2009
84. Recap
Move Data Up
- Caching
- Compression
- If You Can’t Move All The Data Up
- Indexes
- Bloom filters
Batching and Streaming
- Maximize locality
Thursday, November 19, 2009
85. Take 2 And Call Me In The Morning
An Engineer’s Guide to Bandwidth
- http://developer.yahoo.net/blog/archives/2009/10/
a_engineers_gui.html
High Performance Web Sites
- Steve Souders
Even Faster Web Sites
- Steve Souders
Managing Gigabytes: Compressing and Indexing
Documents and Images
- Witten, Moffat, Bell
Yahoo Query Language (YQL)
- http://developer.yahoo.com/yql/
Thursday, November 19, 2009