SlideShare uma empresa Scribd logo
1 de 85
Baixar para ler offline
Latency Trumps All
                              Chris Saari
                              twitter.com/chrissaari
                              blog.chrissaari.com
                              saari@yahoo-inc.com




Thursday, November 19, 2009
Packet Latency
   Time for a packet to get between points A and B
   Physical distance + time queued in devices along the way




                                    ~60ms




Thursday, November 19, 2009
...




Thursday, November 19, 2009
Anytime...
   ... the system is waiting for data
   The system is end to end
    - Human response time
    - Network card buffering
    - System bus/interconnect speed
    - Interrupt handling
    - Network stacks
    - Process scheduling delays
    - Application process waiting for data from memory to get
        to CPU, or from disk to memory to CPU
    - Routers, modems, last mile speeds
    - Backbone speed and operating condition
    - Inter-cluster/colo performance
Thursday, November 19, 2009
Big Picture




                                                        Disk
                                        Network

                                                  CPU
                                 User




Thursday, November 19, 2009
                                                               Memory
Tubes?




Thursday, November 19, 2009
Latency vs. Bandwidth




                      Bandwidth
                       Bits / Second



                                       Latency


                                        Time




Thursday, November 19, 2009
Bandwidth of a Truck Full of Tape




Thursday, November 19, 2009
Latency Lags Bandwidth -David Patterson

                     Given the record of                                                                         Ethernet, no matter which
                 advances in bandwidth ver-                                                                      actually provides better
                 sus latency, the logical                                                                        value. One can argue that
                 question is why? Here are                                                                       greater advances in band-
                 five technical reasons and                                                                      width led to marketing
                 one marketing reason.                                                                           techniques to sell band-
                     1. Moore’s Law helps                                                                        width that in turn trained
                 bandwidth more than                                                                             customers to desire it. No
                 latency. The scaling of                                                                         matter what the real chain
                 semiconductor processes                                                                         of events, unquestionably
                 provides both faster transis-                                                                   higher bandwidth for
                 tors and many more on a                                                                         processors, memories, or
                 chip. Moore’s Law predicts                                                                      the networks is easier to
                 a periodic doubling in the                                                                      sell today than latency.
                 number of transistors per                                                                       Since bandwidth sells,
                 chip, due to scaling and in                                                                     engineering resources tend
                 part to larger chips;                                                                           to be thrown at band-
                 recently, that rate has been                                                                    width, which further tips
                 22–24 months [6]. Band-                                                                         the balance.
                 width is helped by faster                                                                          4. Latency helps band-
                 transistors, more transis-                                                                      width. Technology im-
                 tors, and more pins operat-                                                                     provements that help
                 ing in parallel. The faster                                                                     latency usually also help
                 transistors help latency, but                                                                   bandwidth, but not vice
                 the larger number of tran-                                                                      versa.     For    example,
                 sistors and the relatively            Figure 1. Log-log plot of    DRAM latency determines the number of accesses per
                                                         bandwidth and latency
                 longer distances on the                milestones from Table 1     second, so lower latency means more accesses per sec-
                 actually larger chips limit     relative to the first milestone.   ond and hence higher bandwidth. Also, spinning
                 the benefits of scaling to                                         disks faster reduces the rotational latency, but the read
                 latency. For example,
Thursday, November 19, 2009                                                         head must read data at the new faster rate as well.
The Problem


         Relative Data Access Latencies, Fastest to Slowest
           - CPU Registers (1)
           - L1 Cache (1-2)
           - L2 Cache (6-10)
           - Main memory (25-100)
        --- don’t cross this line, don’t go off mother board! ---
           - Hard drive (1e7)
           - LAN (1e7-1e8)
           - WAN (1e9-2e9)



Thursday, November 19, 2009
Relative Data Access Latency
                               Fast           Slow




              CPU Register        L1         L2      RAM



Thursday, November 19, 2009
Relative Data Access Latency
                                   Fast        Slow




              CPU Register    L1          L2      RAM   Hard Disk



Thursday, November 19, 2009
Relative Data Access Latency
                                   Lower                     Higher




                 Register     L1   L2      RAM   Hard Disk   LANFloppy/CD-ROMWAN



Thursday, November 19, 2009
CPU Register
   CPU Register Latency - Average Human Height




Thursday, November 19, 2009
L1 Cache




Thursday, November 19, 2009
L2 Cache




               x6             x 10




Thursday, November 19, 2009
RAM




                              x 25   to   x 100




Thursday, November 19, 2009
Hard Drive

                                0.4 x equatorial
                                circumference of
                                Earth




                     x 10 M




Thursday, November 19, 2009
WAN




                                      x 100 M

                              0.42 x Earth to Moon Distance




Thursday, November 19, 2009
To experience pain...
   Mobile phone network latency 2-10x that of wired
    - iPhone 3G 500ms ping


                                     x 500 M

                              2 x Earth to Moon Distance




Thursday, November 19, 2009
500ms isn’t that long...




Thursday, November 19, 2009
Google SPDY




                              “It is designed specifically for
                              minimizing latency through features
                              such as multiplexed streams, request
                              prioritization and HTTP header
                              compression.”




Thursday, November 19, 2009
Strategy Pattern: Move Data Up

    Relative Data Access Latencies
         -     CPU Registers (1)
         -     L1 Cache (1-2)
         -     L2 Cache (6-10)
         -     Main memory (25-50)

         -     Hard drive (1e7)
         -     LAN (1e7-1e8)
         -     WAN (1e9-2e9)




Thursday, November 19, 2009
Batching: Do it Once




Thursday, November 19, 2009
Batching: Maximize Data Locality




Thursday, November 19, 2009
Let’s Dig In


                Relative Data Access Latencies, Fastest to Slowest
                 - CPU Registers (1)
                 - L1 Cache (1-2)
                 - L2 Cache (6-10)
                 - Main memory (25-100)

                     -        Hard drive (1e7)
                     -        LAN (1e7-1e8)
                     -        WAN (1e9-2e9)




Thursday, November 19, 2009
Network
   If you can’t Move Data Up, minimize accesses




Thursday, November 19, 2009
Network
   If you can’t Move Data Up, minimize accesses

   Souders Performance Rules
   1) Make fewer HTTP requests
    - Avoid going halfway to the moon whenever possible




Thursday, November 19, 2009
Network
   If you can’t Move Data Up, minimize accesses

   Souders Performance Rules
   1) Make fewer HTTP requests
    - Avoid going halfway to the moon whenever possible
   2) Use a content delivery network
    - Edge caching gets data physically closer to the user




Thursday, November 19, 2009
Network
   If you can’t Move Data Up, minimize accesses

   Souders Performance Rules
   1) Make fewer HTTP requests
    - Avoid going halfway to the moon whenever possible
   2) Use a content delivery network
    - Edge caching gets data physically closer to the user
   3) Add an expires header
    - Instead of going halfway to the moon (Network),
      climb Godzilla (RAM) or go 40% of the way around
      the Earth (Disk) instead




Thursday, November 19, 2009
Network: Packets and Latency




       Less data = less packets = less packet loss = less latency




Thursday, November 19, 2009
Network
       1) Make fewer HTTP requests
       2) Use a content delivery network
       3) Add an expires header
       4) Gzip components




Thursday, November 19, 2009
Disk: Falling of the Latency Cliff




Thursday, November 19, 2009
Jim Gray, Microsoft 2006




                               Tape is Dead
                               Disk is Tape
                               Flash is Disk
                               RAM Locality is King




Thursday, November 19, 2009
Strategy: Move Up: Disk to RAM
   RAM gets you above the exponential latency line
    - Linear cost and power consumption = $$$




                              Main memory (25-50)
                              Hard drive (1e7)




Thursday, November 19, 2009
Strategy: Avoidance: Bloom Filters
        - Probabilistic answer to question if a member is in a set
          - Constant time via multiple hashes
          - Constant space bit string
        - Used in BigTable, Cassandra, Squid




Thursday, November 19, 2009
In Memory Indexes
   Haystack keeps file system indexes in RAM
    - Cut disk access per image from 3 to 1
   Search index compression
   GFS master node prefix compression of names




Thursday, November 19, 2009
Managing Gigabytes -Witten, Moffat, and Bell




Thursday, November 19, 2009
SSDs




                                   Disk                   SSD

                                  ~ 180 - 200 (15K RPM)
      I/O Ops / Sec               ~ 70 - 100
                                                          ~ 10K - 100K



      Seek times                  ~ 7 - 3.2 ms            ~ 0.085 - 0.05 ms




      SSDs < 1/5th power consumption of spinning disk




Thursday, November 19, 2009
Sequential vs. Random Disk Access




                                                   - James Hamilton



Thursday, November 19, 2009
1TB Sequential Read




Thursday, November 19, 2009
1TB Random Read

            Sunday Monday Tuesday Wednes Thursda Friday   Saturda
                                  day    y                y



            1                 2   3    4    5     6       7




            8                 9   10   11   12    13      14




            15
              Done!




Thursday, November 19, 2009
Strategy: Batching and Streaming
   Fewer reads/writes of large contiguous chunks of data
    - GFS 64MB chunks




Thursday, November 19, 2009
Strategy: Batching and Streaming
   Fewer reads/writes of large contiguous chunks of data
    - GFS 64MB chunks
   Requires data locality
    - BigTable app specified data layout and compression




Thursday, November 19, 2009
The CPU




Thursday, November 19, 2009
“CPU Bound”




                              Data in RAM   CPU access to that
                                            data

Thursday, November 19, 2009
The Memory Wall




Thursday, November 19, 2009
Latency Lags Bandwidth




                                            -Dave Patterson




Thursday, November 19, 2009
Multicore Makes It Worse!
   More cores accelerates the rate of divergence
    - CPU performance doubled 3x over the past 5 years
    - Memory performance doubled once




Thursday, November 19, 2009
Evolving CPU Memory Access Designs
   Intel Nehalem integrated memory controller and new high-
    speed interconnect
   40 percent shorter latency and increased bandwidth,
    4-6x faster system




Thursday, November 19, 2009
More CPU evolution
   Intel Nehalem-EX
    - 8 core, 24MB of cache, 2 integrated memory controllers
       - ring interconnect on-die network designed to speed
         the movement of data among the caches used by
         each of the cores
   IBM Power 7
    - 32MB Level 3 cache
   AMD Magny-Cours
    - 12 cores, 12MB of Level 3 cache




Thursday, November 19, 2009
Cache Hit Ratio




Thursday, November 19, 2009
Cache Line Awareness

   Linked list
    - Each node as a separate allocation is Bad




Thursday, November 19, 2009
Cache Line Awareness

   Linked list
    - Each node as a separate allocation is Bad
   Hash table
    - Reprobe on collision with stride of 1




Thursday, November 19, 2009
Cache Line Awareness

   Linked list
    - Each node as a separate allocation is Bad
   Hash table
    - Reprobe on collision with stride of 1
   Stack allocation
    - Top of stack is usually in cache, top of the heap is
      usually not in cache




Thursday, November 19, 2009
Cache Line Awareness

   Linked list
    - Each node as a separate allocation is Bad
   Hash table
    - Reprobe on collision with stride of 1
   Stack allocation
    - Top of stack is usually in cache, top of the heap is
      usually not in cache
   Pipeline processing
    - Stages of operations on a piece of data do them all at
      once vs. each stage separately




Thursday, November 19, 2009
Cache Line Awareness

   Linked list
    - Each node as a separate allocation is Bad
   Hash table
    - Reprobe on collision with stride of 1
   Stack allocation
    - Top of stack is usually in cache, top of the heap is
      usually not in cache
   Pipeline processing
    - Stages of operations on a piece of data do them all at
      once vs. each stage separately
   Optimize for size
    - Might be faster execution than code optimized for speed


Thursday, November 19, 2009
Cycles to Burn
       1) Make fewer HTTP requests
       2) Use a content delivery network
       3) Add an expires header
       4) Gzip components
        - Use excess compute for compression




Thursday, November 19, 2009
Datacenter




Thursday, November 19, 2009
Datacenter Storage Heiracrchy
            Storage hierarchy: a different view




                                                                 - Jeff Dean, Google
                              A bumpy ride that has been getting bumpier over time

Thursday, November 19, 2009
Intra-Datacenter Round Trip

                                  ~500 miles
                                  ~NYC to Columbus, OH




                      x 500,000




Thursday, November 19, 2009
Datacenter Level Systems



                                RethinkDB            Facebook Haystack
                                             HBase
                         memcached                      Google File System

                              Yahoo Sherpa     Facebook Cassandra
                 Sawzall / Pig
                                       Redis           Project Voldemort

                                               MonetDB
                       Google BigTable



Thursday, November 19, 2009
Memcached Facebook Optimizations
        -     UDP to reduce network traffic - Less Packets




Thursday, November 19, 2009
Memcached Facebook Optimizations
        -     UDP to reduce network traffic - Less Packets
        -     One core saturated with network interrupt handing
              - opportunistic polling of the network interfaces and
                setting interrupt coalescing thresholds aggressively -
                Batching




Thursday, November 19, 2009
Memcached Facebook Optimizations
        -     UDP to reduce network traffic - Less Packets
        -     One core saturated with network interrupt handing
              - opportunistic polling of the network interfaces and
                setting interrupt coalescing thresholds aggressively -
                Batching
        -     Contention on network device transmit queue lock,
              packets added/removed from the queue one at a time
              - Change dequeue algorithm to batch dequeues for
                transmit, drop the queue lock, and then transmit the
                batched packets




Thursday, November 19, 2009
Memcached Facebook Optimizations
        -     UDP to reduce network traffic - Less Packets
        -     One core saturated with network interrupt handing
              - opportunistic polling of the network interfaces and
                setting interrupt coalescing thresholds aggressively -
                Batching
        -     Contention on network device transmit queue lock,
              packets added/removed from the queue one at a time
              - Change dequeue algorithm to batch dequeues for
                transmit, drop the queue lock, and then transmit the
                batched packets
        -     More lock contention fixes




Thursday, November 19, 2009
Memcached Facebook Optimizations
        -     UDP to reduce network traffic - Less Packets
        -     One core saturated with network interrupt handing
              - opportunistic polling of the network interfaces and
                setting interrupt coalescing thresholds aggressively -
                Batching
        -     Contention on network device transmit queue lock,
              packets added/removed from the queue one at a time
              - Change dequeue algorithm to batch dequeues for
                transmit, drop the queue lock, and then transmit the
                batched packets
        -     More lock contention fixes

        -     Result 200,000 UDP requests/second with average
              latency of 173 microseconds


Thursday, November 19, 2009
Google BigTable
   Table contains a sequence of blocks
    - block index loaded into memory - Move Up




Thursday, November 19, 2009
Google BigTable
   Table contains a sequence of blocks
    - block index loaded into memory - Move Up
   Table can be completely mapped into memory - Move Up




Thursday, November 19, 2009
Google BigTable
   Table contains a sequence of blocks
    - block index loaded into memory - Move Up
   Table can be completely mapped into memory - Move Up
   Bloom filters hint for data - Move Up




Thursday, November 19, 2009
Google BigTable
   Table contains a sequence of blocks
    - block index loaded into memory - Move Up
   Table can be completely mapped into memory - Move Up
   Bloom filters hint for data - Move Up
   Locality groups loaded in memory - Move Up, Batching
    - Clients can control compression of locality groups




Thursday, November 19, 2009
Google BigTable
   Table contains a sequence of blocks
    - block index loaded into memory - Move Up
   Table can be completely mapped into memory - Move Up
   Bloom filters hint for data - Move Up
   Locality groups loaded in memory - Move Up, Batching
    - Clients can control compression of locality groups
   2 levels of caching - Move Up
    - Scan cache of key/value pairs and block cache




Thursday, November 19, 2009
Google BigTable
   Table contains a sequence of blocks
    - block index loaded into memory - Move Up
   Table can be completely mapped into memory - Move Up
   Bloom filters hint for data - Move Up
   Locality groups loaded in memory - Move Up, Batching
    - Clients can control compression of locality groups
   2 levels of caching - Move Up
    - Scan cache of key/value pairs and block cache
   Clients cache tablet server locations
    - 3 to 6 network trips if cache is invalid - Move Up



Thursday, November 19, 2009
Facebook Cassandra
   Bloom filters used for keys in files on disk - Move Up




Thursday, November 19, 2009
Facebook Cassandra
   Bloom filters used for keys in files on disk - Move Up
   Sequential disk access only - Batching
   Append w/o read ahead




Thursday, November 19, 2009
Facebook Cassandra
       Bloom filters used for keys in files on disk - Move Up
       Sequential disk access only - Batching
       Append w/o read ahead
       Log to memory and write to commit log on dedicated disk -
        Batching




Thursday, November 19, 2009
Facebook Cassandra
   Bloom filters used for keys in files on disk - Move Up
   Sequential disk access only - Batching
   Append w/o read ahead
   Log to memory and write to commit log on dedicated disk -
    Batching
   Programmer controlled data layout for locality - Batching




Thursday, November 19, 2009
Facebook Cassandra
   Bloom filters used for keys in files on disk - Move Up
   Sequential disk access only - Batching
   Append w/o read ahead
   Log to memory and write to commit log on dedicated disk -
    Batching
   Programmer controlled data layout for locality - Batching

   Result: 2 orders of magnitude better performance than
    MySQL




Thursday, November 19, 2009
Move the Compute to the Data: YQL Execute




Thursday, November 19, 2009
From the Browser Perspective
   Performance bounded by 3 things:




Thursday, November 19, 2009
From the Browser Perspective
   Performance bounded by 3 things:
    - Fetch time
      - Unless you’re bundling everything it is a cascade of
        interdependent requests, at least 2 phases worth




Thursday, November 19, 2009
From the Browser Perspective
   Performance bounded by 3 things:
    - Fetch time
      - Unless you’re bundling everything it is a cascade of
        interdependent requests, at least 2 phases worth
    - Parse time
      - HTML
      - CSS
      - Javascript




Thursday, November 19, 2009
From the Browser Perspective
   Performance bounded by 3 things:
    - Fetch time
      - Unless you’re bundling everything it is a cascade of
        interdependent requests, at least 2 phases worth
    - Parse time
      - HTML
      - CSS
      - Javascript
    - Execution time
      - Javascript execution
      - DOM construction and layout
      - Style application

Thursday, November 19, 2009
Recap
   Move Data Up
    - Caching
    - Compression
    - If You Can’t Move All The Data Up
      - Indexes
      - Bloom filters
   Batching and Streaming
    - Maximize locality




Thursday, November 19, 2009
Take 2 And Call Me In The Morning
   An Engineer’s Guide to Bandwidth
    - http://developer.yahoo.net/blog/archives/2009/10/
      a_engineers_gui.html
   High Performance Web Sites
    - Steve Souders
   Even Faster Web Sites
    - Steve Souders
   Managing Gigabytes: Compressing and Indexing
    Documents and Images
    - Witten, Moffat, Bell
   Yahoo Query Language (YQL)
    - http://developer.yahoo.com/yql/


Thursday, November 19, 2009

Mais conteúdo relacionado

Mais procurados

Adv multimedia2k7 1_s
Adv multimedia2k7 1_sAdv multimedia2k7 1_s
Adv multimedia2k7 1_s
Kevin Man
 
DMMS presentation25
DMMS presentation25DMMS presentation25
DMMS presentation25
Yuri Alimov
 

Mais procurados (8)

P2P-Next: Future Internet Media Delivery to CE Devices
P2P-Next: Future Internet Media Delivery to CE DevicesP2P-Next: Future Internet Media Delivery to CE Devices
P2P-Next: Future Internet Media Delivery to CE Devices
 
Definitions
DefinitionsDefinitions
Definitions
 
Adv multimedia2k7 1_s
Adv multimedia2k7 1_sAdv multimedia2k7 1_s
Adv multimedia2k7 1_s
 
DMMS presentation25
DMMS presentation25DMMS presentation25
DMMS presentation25
 
Multimedia
MultimediaMultimedia
Multimedia
 
Ntp in Amplification Inferno
Ntp in Amplification InfernoNtp in Amplification Inferno
Ntp in Amplification Inferno
 
10 wman2[1]
10 wman2[1]10 wman2[1]
10 wman2[1]
 
Performance analysis of Multiband - OFDM systems using LDPC coder in pulsed -...
Performance analysis of Multiband - OFDM systems using LDPC coder in pulsed -...Performance analysis of Multiband - OFDM systems using LDPC coder in pulsed -...
Performance analysis of Multiband - OFDM systems using LDPC coder in pulsed -...
 

Semelhante a Latency Trumps All

Video streaming
Video streamingVideo streaming
Video streaming
Videoguy
 
Distributed Stream Processing in the real [Perl] world
Distributed Stream Processing in the real [Perl] worldDistributed Stream Processing in the real [Perl] world
Distributed Stream Processing in the real [Perl] world
SATOSHI TAGOMORI
 
Top-Down Network DesignAnalyzing Technical Goals.docx
Top-Down Network DesignAnalyzing Technical Goals.docxTop-Down Network DesignAnalyzing Technical Goals.docx
Top-Down Network DesignAnalyzing Technical Goals.docx
juliennehar
 
Download presentation source
Download presentation sourceDownload presentation source
Download presentation source
Videoguy
 
Collaborate nfs kyle_final
Collaborate nfs kyle_finalCollaborate nfs kyle_final
Collaborate nfs kyle_final
Kyle Hailey
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 
Fast mesh a low-delay high-bandwidth
Fast mesh a low-delay high-bandwidthFast mesh a low-delay high-bandwidth
Fast mesh a low-delay high-bandwidth
ambitlick
 

Semelhante a Latency Trumps All (20)

6. QoS Concepts.pdf
6. QoS Concepts.pdf6. QoS Concepts.pdf
6. QoS Concepts.pdf
 
Telephony Services for Freely Licensed Operating Systems
Telephony Services for Freely Licensed Operating SystemsTelephony Services for Freely Licensed Operating Systems
Telephony Services for Freely Licensed Operating Systems
 
Video streaming
Video streamingVideo streaming
Video streaming
 
Buffer Size Matters, within Limits
Buffer Size Matters, within LimitsBuffer Size Matters, within Limits
Buffer Size Matters, within Limits
 
20 24
20 2420 24
20 24
 
Distributed Stream Processing in the real [Perl] world
Distributed Stream Processing in the real [Perl] worldDistributed Stream Processing in the real [Perl] world
Distributed Stream Processing in the real [Perl] world
 
VOIP QOS
VOIP QOSVOIP QOS
VOIP QOS
 
WebRTC Real time media P2P, Server, Infrastructure, and Platform
WebRTC Real time media P2P, Server, Infrastructure, and PlatformWebRTC Real time media P2P, Server, Infrastructure, and Platform
WebRTC Real time media P2P, Server, Infrastructure, and Platform
 
Top-Down Network DesignAnalyzing Technical Goals.docx
Top-Down Network DesignAnalyzing Technical Goals.docxTop-Down Network DesignAnalyzing Technical Goals.docx
Top-Down Network DesignAnalyzing Technical Goals.docx
 
Issue in handling multimedia online
Issue in handling multimedia onlineIssue in handling multimedia online
Issue in handling multimedia online
 
20 Years of Streaming in 20 Minutes
20 Years of Streaming in 20 Minutes20 Years of Streaming in 20 Minutes
20 Years of Streaming in 20 Minutes
 
Sem 1 Ch 2
Sem 1 Ch 2Sem 1 Ch 2
Sem 1 Ch 2
 
Download presentation source
Download presentation sourceDownload presentation source
Download presentation source
 
The future of tape april 16
The future of tape april 16The future of tape april 16
The future of tape april 16
 
Collaborate nfs kyle_final
Collaborate nfs kyle_finalCollaborate nfs kyle_final
Collaborate nfs kyle_final
 
What is 3d torus
What is 3d torusWhat is 3d torus
What is 3d torus
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Fast mesh a low-delay high-bandwidth
Fast mesh a low-delay high-bandwidthFast mesh a low-delay high-bandwidth
Fast mesh a low-delay high-bandwidth
 
The main performance parameters of the router
The main performance parameters of the routerThe main performance parameters of the router
The main performance parameters of the router
 
Df35592595
Df35592595Df35592595
Df35592595
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Último (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 

Latency Trumps All

  • 1. Latency Trumps All Chris Saari twitter.com/chrissaari blog.chrissaari.com saari@yahoo-inc.com Thursday, November 19, 2009
  • 2. Packet Latency  Time for a packet to get between points A and B  Physical distance + time queued in devices along the way ~60ms Thursday, November 19, 2009
  • 4. Anytime...  ... the system is waiting for data  The system is end to end - Human response time - Network card buffering - System bus/interconnect speed - Interrupt handling - Network stacks - Process scheduling delays - Application process waiting for data from memory to get to CPU, or from disk to memory to CPU - Routers, modems, last mile speeds - Backbone speed and operating condition - Inter-cluster/colo performance Thursday, November 19, 2009
  • 5. Big Picture Disk Network CPU User Thursday, November 19, 2009 Memory
  • 7. Latency vs. Bandwidth Bandwidth Bits / Second Latency Time Thursday, November 19, 2009
  • 8. Bandwidth of a Truck Full of Tape Thursday, November 19, 2009
  • 9. Latency Lags Bandwidth -David Patterson Given the record of Ethernet, no matter which advances in bandwidth ver- actually provides better sus latency, the logical value. One can argue that question is why? Here are greater advances in band- five technical reasons and width led to marketing one marketing reason. techniques to sell band- 1. Moore’s Law helps width that in turn trained bandwidth more than customers to desire it. No latency. The scaling of matter what the real chain semiconductor processes of events, unquestionably provides both faster transis- higher bandwidth for tors and many more on a processors, memories, or chip. Moore’s Law predicts the networks is easier to a periodic doubling in the sell today than latency. number of transistors per Since bandwidth sells, chip, due to scaling and in engineering resources tend part to larger chips; to be thrown at band- recently, that rate has been width, which further tips 22–24 months [6]. Band- the balance. width is helped by faster 4. Latency helps band- transistors, more transis- width. Technology im- tors, and more pins operat- provements that help ing in parallel. The faster latency usually also help transistors help latency, but bandwidth, but not vice the larger number of tran- versa. For example, sistors and the relatively Figure 1. Log-log plot of DRAM latency determines the number of accesses per bandwidth and latency longer distances on the milestones from Table 1 second, so lower latency means more accesses per sec- actually larger chips limit relative to the first milestone. ond and hence higher bandwidth. Also, spinning the benefits of scaling to disks faster reduces the rotational latency, but the read latency. For example, Thursday, November 19, 2009 head must read data at the new faster rate as well.
  • 10. The Problem  Relative Data Access Latencies, Fastest to Slowest - CPU Registers (1) - L1 Cache (1-2) - L2 Cache (6-10) - Main memory (25-100) --- don’t cross this line, don’t go off mother board! --- - Hard drive (1e7) - LAN (1e7-1e8) - WAN (1e9-2e9) Thursday, November 19, 2009
  • 11. Relative Data Access Latency Fast Slow CPU Register L1 L2 RAM Thursday, November 19, 2009
  • 12. Relative Data Access Latency Fast Slow CPU Register L1 L2 RAM Hard Disk Thursday, November 19, 2009
  • 13. Relative Data Access Latency Lower Higher Register L1 L2 RAM Hard Disk LANFloppy/CD-ROMWAN Thursday, November 19, 2009
  • 14. CPU Register  CPU Register Latency - Average Human Height Thursday, November 19, 2009
  • 16. L2 Cache x6 x 10 Thursday, November 19, 2009
  • 17. RAM x 25 to x 100 Thursday, November 19, 2009
  • 18. Hard Drive 0.4 x equatorial circumference of Earth x 10 M Thursday, November 19, 2009
  • 19. WAN x 100 M 0.42 x Earth to Moon Distance Thursday, November 19, 2009
  • 20. To experience pain...  Mobile phone network latency 2-10x that of wired - iPhone 3G 500ms ping x 500 M 2 x Earth to Moon Distance Thursday, November 19, 2009
  • 21. 500ms isn’t that long... Thursday, November 19, 2009
  • 22. Google SPDY “It is designed specifically for minimizing latency through features such as multiplexed streams, request prioritization and HTTP header compression.” Thursday, November 19, 2009
  • 23. Strategy Pattern: Move Data Up  Relative Data Access Latencies - CPU Registers (1) - L1 Cache (1-2) - L2 Cache (6-10) - Main memory (25-50) - Hard drive (1e7) - LAN (1e7-1e8) - WAN (1e9-2e9) Thursday, November 19, 2009
  • 24. Batching: Do it Once Thursday, November 19, 2009
  • 25. Batching: Maximize Data Locality Thursday, November 19, 2009
  • 26. Let’s Dig In  Relative Data Access Latencies, Fastest to Slowest - CPU Registers (1) - L1 Cache (1-2) - L2 Cache (6-10) - Main memory (25-100) - Hard drive (1e7) - LAN (1e7-1e8) - WAN (1e9-2e9) Thursday, November 19, 2009
  • 27. Network  If you can’t Move Data Up, minimize accesses Thursday, November 19, 2009
  • 28. Network  If you can’t Move Data Up, minimize accesses  Souders Performance Rules  1) Make fewer HTTP requests - Avoid going halfway to the moon whenever possible Thursday, November 19, 2009
  • 29. Network  If you can’t Move Data Up, minimize accesses  Souders Performance Rules  1) Make fewer HTTP requests - Avoid going halfway to the moon whenever possible  2) Use a content delivery network - Edge caching gets data physically closer to the user Thursday, November 19, 2009
  • 30. Network  If you can’t Move Data Up, minimize accesses  Souders Performance Rules  1) Make fewer HTTP requests - Avoid going halfway to the moon whenever possible  2) Use a content delivery network - Edge caching gets data physically closer to the user  3) Add an expires header - Instead of going halfway to the moon (Network), climb Godzilla (RAM) or go 40% of the way around the Earth (Disk) instead Thursday, November 19, 2009
  • 31. Network: Packets and Latency Less data = less packets = less packet loss = less latency Thursday, November 19, 2009
  • 32. Network  1) Make fewer HTTP requests  2) Use a content delivery network  3) Add an expires header  4) Gzip components Thursday, November 19, 2009
  • 33. Disk: Falling of the Latency Cliff Thursday, November 19, 2009
  • 34. Jim Gray, Microsoft 2006 Tape is Dead Disk is Tape Flash is Disk RAM Locality is King Thursday, November 19, 2009
  • 35. Strategy: Move Up: Disk to RAM  RAM gets you above the exponential latency line - Linear cost and power consumption = $$$ Main memory (25-50) Hard drive (1e7) Thursday, November 19, 2009
  • 36. Strategy: Avoidance: Bloom Filters - Probabilistic answer to question if a member is in a set - Constant time via multiple hashes - Constant space bit string - Used in BigTable, Cassandra, Squid Thursday, November 19, 2009
  • 37. In Memory Indexes  Haystack keeps file system indexes in RAM - Cut disk access per image from 3 to 1  Search index compression  GFS master node prefix compression of names Thursday, November 19, 2009
  • 38. Managing Gigabytes -Witten, Moffat, and Bell Thursday, November 19, 2009
  • 39. SSDs Disk SSD ~ 180 - 200 (15K RPM) I/O Ops / Sec ~ 70 - 100 ~ 10K - 100K Seek times ~ 7 - 3.2 ms ~ 0.085 - 0.05 ms SSDs < 1/5th power consumption of spinning disk Thursday, November 19, 2009
  • 40. Sequential vs. Random Disk Access - James Hamilton Thursday, November 19, 2009
  • 41. 1TB Sequential Read Thursday, November 19, 2009
  • 42. 1TB Random Read Sunday Monday Tuesday Wednes Thursda Friday Saturda day y y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Done! Thursday, November 19, 2009
  • 43. Strategy: Batching and Streaming  Fewer reads/writes of large contiguous chunks of data - GFS 64MB chunks Thursday, November 19, 2009
  • 44. Strategy: Batching and Streaming  Fewer reads/writes of large contiguous chunks of data - GFS 64MB chunks  Requires data locality - BigTable app specified data layout and compression Thursday, November 19, 2009
  • 46. “CPU Bound” Data in RAM CPU access to that data Thursday, November 19, 2009
  • 47. The Memory Wall Thursday, November 19, 2009
  • 48. Latency Lags Bandwidth -Dave Patterson Thursday, November 19, 2009
  • 49. Multicore Makes It Worse!  More cores accelerates the rate of divergence - CPU performance doubled 3x over the past 5 years - Memory performance doubled once Thursday, November 19, 2009
  • 50. Evolving CPU Memory Access Designs  Intel Nehalem integrated memory controller and new high- speed interconnect  40 percent shorter latency and increased bandwidth, 4-6x faster system Thursday, November 19, 2009
  • 51. More CPU evolution  Intel Nehalem-EX - 8 core, 24MB of cache, 2 integrated memory controllers - ring interconnect on-die network designed to speed the movement of data among the caches used by each of the cores  IBM Power 7 - 32MB Level 3 cache  AMD Magny-Cours - 12 cores, 12MB of Level 3 cache Thursday, November 19, 2009
  • 52. Cache Hit Ratio Thursday, November 19, 2009
  • 53. Cache Line Awareness  Linked list - Each node as a separate allocation is Bad Thursday, November 19, 2009
  • 54. Cache Line Awareness  Linked list - Each node as a separate allocation is Bad  Hash table - Reprobe on collision with stride of 1 Thursday, November 19, 2009
  • 55. Cache Line Awareness  Linked list - Each node as a separate allocation is Bad  Hash table - Reprobe on collision with stride of 1  Stack allocation - Top of stack is usually in cache, top of the heap is usually not in cache Thursday, November 19, 2009
  • 56. Cache Line Awareness  Linked list - Each node as a separate allocation is Bad  Hash table - Reprobe on collision with stride of 1  Stack allocation - Top of stack is usually in cache, top of the heap is usually not in cache  Pipeline processing - Stages of operations on a piece of data do them all at once vs. each stage separately Thursday, November 19, 2009
  • 57. Cache Line Awareness  Linked list - Each node as a separate allocation is Bad  Hash table - Reprobe on collision with stride of 1  Stack allocation - Top of stack is usually in cache, top of the heap is usually not in cache  Pipeline processing - Stages of operations on a piece of data do them all at once vs. each stage separately  Optimize for size - Might be faster execution than code optimized for speed Thursday, November 19, 2009
  • 58. Cycles to Burn  1) Make fewer HTTP requests  2) Use a content delivery network  3) Add an expires header  4) Gzip components - Use excess compute for compression Thursday, November 19, 2009
  • 60. Datacenter Storage Heiracrchy Storage hierarchy: a different view - Jeff Dean, Google A bumpy ride that has been getting bumpier over time Thursday, November 19, 2009
  • 61. Intra-Datacenter Round Trip ~500 miles ~NYC to Columbus, OH x 500,000 Thursday, November 19, 2009
  • 62. Datacenter Level Systems RethinkDB Facebook Haystack HBase memcached Google File System Yahoo Sherpa Facebook Cassandra Sawzall / Pig Redis Project Voldemort MonetDB Google BigTable Thursday, November 19, 2009
  • 63. Memcached Facebook Optimizations - UDP to reduce network traffic - Less Packets Thursday, November 19, 2009
  • 64. Memcached Facebook Optimizations - UDP to reduce network traffic - Less Packets - One core saturated with network interrupt handing - opportunistic polling of the network interfaces and setting interrupt coalescing thresholds aggressively - Batching Thursday, November 19, 2009
  • 65. Memcached Facebook Optimizations - UDP to reduce network traffic - Less Packets - One core saturated with network interrupt handing - opportunistic polling of the network interfaces and setting interrupt coalescing thresholds aggressively - Batching - Contention on network device transmit queue lock, packets added/removed from the queue one at a time - Change dequeue algorithm to batch dequeues for transmit, drop the queue lock, and then transmit the batched packets Thursday, November 19, 2009
  • 66. Memcached Facebook Optimizations - UDP to reduce network traffic - Less Packets - One core saturated with network interrupt handing - opportunistic polling of the network interfaces and setting interrupt coalescing thresholds aggressively - Batching - Contention on network device transmit queue lock, packets added/removed from the queue one at a time - Change dequeue algorithm to batch dequeues for transmit, drop the queue lock, and then transmit the batched packets - More lock contention fixes Thursday, November 19, 2009
  • 67. Memcached Facebook Optimizations - UDP to reduce network traffic - Less Packets - One core saturated with network interrupt handing - opportunistic polling of the network interfaces and setting interrupt coalescing thresholds aggressively - Batching - Contention on network device transmit queue lock, packets added/removed from the queue one at a time - Change dequeue algorithm to batch dequeues for transmit, drop the queue lock, and then transmit the batched packets - More lock contention fixes - Result 200,000 UDP requests/second with average latency of 173 microseconds Thursday, November 19, 2009
  • 68. Google BigTable  Table contains a sequence of blocks - block index loaded into memory - Move Up Thursday, November 19, 2009
  • 69. Google BigTable  Table contains a sequence of blocks - block index loaded into memory - Move Up  Table can be completely mapped into memory - Move Up Thursday, November 19, 2009
  • 70. Google BigTable  Table contains a sequence of blocks - block index loaded into memory - Move Up  Table can be completely mapped into memory - Move Up  Bloom filters hint for data - Move Up Thursday, November 19, 2009
  • 71. Google BigTable  Table contains a sequence of blocks - block index loaded into memory - Move Up  Table can be completely mapped into memory - Move Up  Bloom filters hint for data - Move Up  Locality groups loaded in memory - Move Up, Batching - Clients can control compression of locality groups Thursday, November 19, 2009
  • 72. Google BigTable  Table contains a sequence of blocks - block index loaded into memory - Move Up  Table can be completely mapped into memory - Move Up  Bloom filters hint for data - Move Up  Locality groups loaded in memory - Move Up, Batching - Clients can control compression of locality groups  2 levels of caching - Move Up - Scan cache of key/value pairs and block cache Thursday, November 19, 2009
  • 73. Google BigTable  Table contains a sequence of blocks - block index loaded into memory - Move Up  Table can be completely mapped into memory - Move Up  Bloom filters hint for data - Move Up  Locality groups loaded in memory - Move Up, Batching - Clients can control compression of locality groups  2 levels of caching - Move Up - Scan cache of key/value pairs and block cache  Clients cache tablet server locations - 3 to 6 network trips if cache is invalid - Move Up Thursday, November 19, 2009
  • 74. Facebook Cassandra  Bloom filters used for keys in files on disk - Move Up Thursday, November 19, 2009
  • 75. Facebook Cassandra  Bloom filters used for keys in files on disk - Move Up  Sequential disk access only - Batching  Append w/o read ahead Thursday, November 19, 2009
  • 76. Facebook Cassandra  Bloom filters used for keys in files on disk - Move Up  Sequential disk access only - Batching  Append w/o read ahead  Log to memory and write to commit log on dedicated disk - Batching Thursday, November 19, 2009
  • 77. Facebook Cassandra  Bloom filters used for keys in files on disk - Move Up  Sequential disk access only - Batching  Append w/o read ahead  Log to memory and write to commit log on dedicated disk - Batching  Programmer controlled data layout for locality - Batching Thursday, November 19, 2009
  • 78. Facebook Cassandra  Bloom filters used for keys in files on disk - Move Up  Sequential disk access only - Batching  Append w/o read ahead  Log to memory and write to commit log on dedicated disk - Batching  Programmer controlled data layout for locality - Batching  Result: 2 orders of magnitude better performance than MySQL Thursday, November 19, 2009
  • 79. Move the Compute to the Data: YQL Execute Thursday, November 19, 2009
  • 80. From the Browser Perspective  Performance bounded by 3 things: Thursday, November 19, 2009
  • 81. From the Browser Perspective  Performance bounded by 3 things: - Fetch time - Unless you’re bundling everything it is a cascade of interdependent requests, at least 2 phases worth Thursday, November 19, 2009
  • 82. From the Browser Perspective  Performance bounded by 3 things: - Fetch time - Unless you’re bundling everything it is a cascade of interdependent requests, at least 2 phases worth - Parse time - HTML - CSS - Javascript Thursday, November 19, 2009
  • 83. From the Browser Perspective  Performance bounded by 3 things: - Fetch time - Unless you’re bundling everything it is a cascade of interdependent requests, at least 2 phases worth - Parse time - HTML - CSS - Javascript - Execution time - Javascript execution - DOM construction and layout - Style application Thursday, November 19, 2009
  • 84. Recap  Move Data Up - Caching - Compression - If You Can’t Move All The Data Up - Indexes - Bloom filters  Batching and Streaming - Maximize locality Thursday, November 19, 2009
  • 85. Take 2 And Call Me In The Morning  An Engineer’s Guide to Bandwidth - http://developer.yahoo.net/blog/archives/2009/10/ a_engineers_gui.html  High Performance Web Sites - Steve Souders  Even Faster Web Sites - Steve Souders  Managing Gigabytes: Compressing and Indexing Documents and Images - Witten, Moffat, Bell  Yahoo Query Language (YQL) - http://developer.yahoo.com/yql/ Thursday, November 19, 2009