Latency Trumps All

Latency Trumps All
Chris Saari
twitter.com/chrissaari
blog.chrissaari.com
saari@yahoo-inc.com

Thursday, November 19, 2009

Packet Latency
 Time for a packet to get between points A and B
 Physical distance + time queued in devices along the way

~60ms


...


Anytime...
 ... the system is waiting for data
 The system is end to end
- Human response time
- Network card buffering
- System bus/interconnect speed
- Interrupt handling
- Network stacks
- Process scheduling delays
- Application process waiting for data from memory to get
to CPU, or from disk to memory to CPU
- Routers, modems, last mile speeds
- Backbone speed and operating condition
- Inter-cluster/colo performance

Big Picture

Disk
Network

CPU
User

Memory

Tubes?


Latency vs. Bandwidth

Bandwidth
Bits / Second

Latency

Time


Bandwidth of a Truck Full of Tape


Latency Lags Bandwidth -David Patterson

Given the record of Ethernet, no matter which
advances in bandwidth ver- actually provides better
sus latency, the logical value. One can argue that
question is why? Here are greater advances in band-
five technical reasons and width led to marketing
one marketing reason. techniques to sell band-
1. Moore’s Law helps width that in turn trained
bandwidth more than customers to desire it. No
latency. The scaling of matter what the real chain
semiconductor processes of events, unquestionably
provides both faster transis- higher bandwidth for
tors and many more on a processors, memories, or
chip. Moore’s Law predicts the networks is easier to
a periodic doubling in the sell today than latency.
number of transistors per Since bandwidth sells,
chip, due to scaling and in engineering resources tend
part to larger chips; to be thrown at band-
recently, that rate has been width, which further tips
22–24 months [6]. Band- the balance.
width is helped by faster 4. Latency helps band-
transistors, more transis- width. Technology im-
tors, and more pins operat- provements that help
ing in parallel. The faster latency usually also help
transistors help latency, but bandwidth, but not vice
the larger number of tran- versa. For example,
sistors and the relatively Figure 1. Log-log plot of DRAM latency determines the number of accesses per
bandwidth and latency
longer distances on the milestones from Table 1 second, so lower latency means more accesses per sec-
actually larger chips limit relative to the first milestone. ond and hence higher bandwidth. Also, spinning
the benefits of scaling to disks faster reduces the rotational latency, but the read
latency. For example,
Thursday, November 19, 2009 head must read data at the new faster rate as well.

The Problem

 Relative Data Access Latencies, Fastest to Slowest
- CPU Registers (1)
- L1 Cache (1-2)
- L2 Cache (6-10)
- Main memory (25-100)
--- don’t cross this line, don’t go off mother board! ---
- Hard drive (1e7)
- LAN (1e7-1e8)
- WAN (1e9-2e9)


Relative Data Access Latency
Fast Slow

CPU Register L1 L2 RAM


Fast Slow

CPU Register L1 L2 RAM Hard Disk


Lower Higher

Register L1 L2 RAM Hard Disk LANFloppy/CD-ROMWAN


CPU Register
 CPU Register Latency - Average Human Height


L1 Cache


L2 Cache

x6 x 10


RAM

x 25 to x 100


Hard Drive

0.4 x equatorial
circumference of
Earth

x 10 M


WAN

x 100 M

0.42 x Earth to Moon Distance


To experience pain...
 Mobile phone network latency 2-10x that of wired
- iPhone 3G 500ms ping

x 500 M

2 x Earth to Moon Distance


500ms isn’t that long...


Google SPDY

“It is designed specifically for
minimizing latency through features
such as multiplexed streams, request
prioritization and HTTP header
compression.”


Strategy Pattern: Move Data Up

 Relative Data Access Latencies
- CPU Registers (1)
- L1 Cache (1-2)
- L2 Cache (6-10)

- Hard drive (1e7)
- LAN (1e7-1e8)
- WAN (1e9-2e9)


Batching: Do it Once


Batching: Maximize Data Locality


Let’s Dig In

 Relative Data Access Latencies, Fastest to Slowest
- CPU Registers (1)
- L1 Cache (1-2)
- L2 Cache (6-10)

- Hard drive (1e7)
- LAN (1e7-1e8)
- WAN (1e9-2e9)


Network
 If you can’t Move Data Up, minimize accesses


Network

 Souders Performance Rules
 1) Make fewer HTTP requests
- Avoid going halfway to the moon whenever possible


Network

 2) Use a content delivery network
- Edge caching gets data physically closer to the user


Network

- Edge caching gets data physically closer to the user
 3) Add an expires header
- Instead of going halfway to the moon (Network),
climb Godzilla (RAM) or go 40% of the way around
the Earth (Disk) instead


Network: Packets and Latency

Less data = less packets = less packet loss = less latency


Network
 4) Gzip components


Disk: Falling of the Latency Cliff


Jim Gray, Microsoft 2006

Tape is Dead
Disk is Tape
Flash is Disk
RAM Locality is King


Strategy: Move Up: Disk to RAM
 RAM gets you above the exponential latency line
- Linear cost and power consumption = $$$

Main memory (25-50)
Hard drive (1e7)


Strategy: Avoidance: Bloom Filters
- Probabilistic answer to question if a member is in a set
- Constant time via multiple hashes
- Constant space bit string
- Used in BigTable, Cassandra, Squid


In Memory Indexes
 Haystack keeps ﬁle system indexes in RAM
- Cut disk access per image from 3 to 1
 Search index compression
 GFS master node preﬁx compression of names


Managing Gigabytes -Witten, Moffat, and Bell


SSDs

Disk SSD

~ 180 - 200 (15K RPM)
I/O Ops / Sec ~ 70 - 100
~ 10K - 100K

Seek times ~ 7 - 3.2 ms ~ 0.085 - 0.05 ms

SSDs < 1/5th power consumption of spinning disk


Sequential vs. Random Disk Access

- James Hamilton


1TB Sequential Read


1TB Random Read

Sunday Monday Tuesday Wednes Thursda Friday Saturda
day y y

1 2 3 4 5 6 7

8 9 10 11 12 13 14

15
Done!


Strategy: Batching and Streaming
 Fewer reads/writes of large contiguous chunks of data
- GFS 64MB chunks


Strategy: Batching and Streaming
 Fewer reads/writes of large contiguous chunks of data
- GFS 64MB chunks
 Requires data locality
- BigTable app specified data layout and compression


The CPU


“CPU Bound”

Data in RAM CPU access to that
data


The Memory Wall


Latency Lags Bandwidth

-Dave Patterson


Multicore Makes It Worse!
 More cores accelerates the rate of divergence
- CPU performance doubled 3x over the past 5 years
- Memory performance doubled once


Evolving CPU Memory Access Designs
 Intel Nehalem integrated memory controller and new high-
speed interconnect
 40 percent shorter latency and increased bandwidth,
4-6x faster system


More CPU evolution
 Intel Nehalem-EX
- 8 core, 24MB of cache, 2 integrated memory controllers
- ring interconnect on-die network designed to speed
the movement of data among the caches used by
each of the cores
 IBM Power 7
- 32MB Level 3 cache
 AMD Magny-Cours
- 12 cores, 12MB of Level 3 cache


Cache Hit Ratio


Cache Line Awareness

 Linked list
- Each node as a separate allocation is Bad



 Linked list
 Hash table
- Reprobe on collision with stride of 1



 Linked list
 Hash table
 Stack allocation
- Top of stack is usually in cache, top of the heap is
usually not in cache



 Linked list
 Hash table
 Pipeline processing
- Stages of operations on a piece of data do them all at
once vs. each stage separately



 Linked list
 Hash table
 Pipeline processing
- Stages of operations on a piece of data do them all at
once vs. each stage separately
 Optimize for size
- Might be faster execution than code optimized for speed


Cycles to Burn
 4) Gzip components
- Use excess compute for compression


Datacenter


Datacenter Storage Heiracrchy
Storage hierarchy: a different view

- Jeff Dean, Google
A bumpy ride that has been getting bumpier over time


Intra-Datacenter Round Trip

~500 miles
~NYC to Columbus, OH

x 500,000


Datacenter Level Systems

RethinkDB Facebook Haystack
HBase
memcached Google File System

Yahoo Sherpa Facebook Cassandra
Sawzall / Pig
Redis Project Voldemort

MonetDB
Google BigTable


Memcached Facebook Optimizations
- UDP to reduce network traffic - Less Packets


- One core saturated with network interrupt handing
- opportunistic polling of the network interfaces and
setting interrupt coalescing thresholds aggressively -
Batching


Batching
- Contention on network device transmit queue lock,
packets added/removed from the queue one at a time
- Change dequeue algorithm to batch dequeues for
transmit, drop the queue lock, and then transmit the
batched packets


Batching
batched packets
- More lock contention fixes


Batching
batched packets
- More lock contention fixes

- Result 200,000 UDP requests/second with average
latency of 173 microseconds


Google BigTable
 Table contains a sequence of blocks
- block index loaded into memory - Move Up


Google BigTable
 Table can be completely mapped into memory - Move Up


Google BigTable
 Bloom filters hint for data - Move Up


Google BigTable
 Locality groups loaded in memory - Move Up, Batching
- Clients can control compression of locality groups


Google BigTable
 2 levels of caching - Move Up
- Scan cache of key/value pairs and block cache


Google BigTable
 2 levels of caching - Move Up
- Scan cache of key/value pairs and block cache
 Clients cache tablet server locations
- 3 to 6 network trips if cache is invalid - Move Up


Facebook Cassandra
 Bloom filters used for keys in files on disk - Move Up


Facebook Cassandra
 Sequential disk access only - Batching
 Append w/o read ahead


Facebook Cassandra
 Log to memory and write to commit log on dedicated disk -
Batching


Facebook Cassandra
Batching
 Programmer controlled data layout for locality - Batching


Facebook Cassandra
Batching
 Programmer controlled data layout for locality - Batching

 Result: 2 orders of magnitude better performance than
MySQL


Move the Compute to the Data: YQL Execute


From the Browser Perspective
 Performance bounded by 3 things:


- Fetch time
- Unless you’re bundling everything it is a cascade of
interdependent requests, at least 2 phases worth


- Fetch time
- Parse time
- HTML
- CSS
- Javascript


- Fetch time
- Parse time
- HTML
- CSS
- Javascript
- Execution time
- Javascript execution
- DOM construction and layout
- Style application


Recap
 Move Data Up
- Caching
- Compression
- If You Can’t Move All The Data Up
- Indexes
- Bloom filters
 Batching and Streaming
- Maximize locality


Take 2 And Call Me In The Morning
 An Engineer’s Guide to Bandwidth
- http://developer.yahoo.net/blog/archives/2009/10/
a_engineers_gui.html
 High Performance Web Sites
- Steve Souders
 Even Faster Web Sites
- Steve Souders
 Managing Gigabytes: Compressing and Indexing
Documents and Images
- Witten, Moffat, Bell
 Yahoo Query Language (YQL)
- http://developer.yahoo.com/yql/


Latency Trumps All

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (8)

Semelhante a Latency Trumps All

Semelhante a Latency Trumps All (20)

Último

Último (20)

Latency Trumps All