O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Solve the colocation conundrum: Performance and density at scale with Kubernetes

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio

Confira estes a seguir

1 de 76 Anúncio

Solve the colocation conundrum: Performance and density at scale with Kubernetes

Baixar para ler offline

As we move from monolithic applications to microservices, the ability to colocate workloads offers a tremendous opportunity to realize greater development velocity, robustness, and resource utilization. But workload colocation can also introduce performance variability and affect service levels. Google describes the problem as the “tail at scale”—the amplification of negative results observed at the tail of the latency curve when many systems are involved.

With its latest tooling capabilities, Intel has an experiments framework to calculate the trade-offs between low latency and higher density. Niklas Nielsen discusses the challenges and complexities of workload colocation, why solving these challenges matters to your business no matter the size, and how Intel intends to help smarter resource allocations with its latest tooling capabilities and Kubernetes.

As we move from monolithic applications to microservices, the ability to colocate workloads offers a tremendous opportunity to realize greater development velocity, robustness, and resource utilization. But workload colocation can also introduce performance variability and affect service levels. Google describes the problem as the “tail at scale”—the amplification of negative results observed at the tail of the latency curve when many systems are involved.

With its latest tooling capabilities, Intel has an experiments framework to calculate the trade-offs between low latency and higher density. Niklas Nielsen discusses the challenges and complexities of workload colocation, why solving these challenges matters to your business no matter the size, and how Intel intends to help smarter resource allocations with its latest tooling capabilities and Kubernetes.

Anúncio
Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (19)

Semelhante a Solve the colocation conundrum: Performance and density at scale with Kubernetes (20)

Anúncio

Mais recentes (20)

Solve the colocation conundrum: Performance and density at scale with Kubernetes

  1. 1. Solve the colocation conundrum Performance and density at scale with Kubernetes Niklas Nielsen – Intel Corp
  2. 2. Legal Notices and Disclaimers  Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.  No computer system can be absolutely secure.  Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.  This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.  The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.  No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.  Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.  Intel, Xeon, Atom, Core, and the Intel logo are trademarks of Intel Corporation in the United States and other countries.  *Other names and brands may be claimed as the property of others.  © 2017 Intel Corporation.
  3. 3. Two Google searches
  4. 4. Notice a difference?
  5. 5. 0 1 2 3 4 5 6 First Second First and Second First ‘O’ Done typing OSCON in autocomplete list OSCON 2016 is autocomplete list Pushed enter OSCON 2017 is found Rest of search OSCON context OSCON logo 2 seconds >5 seconds
  6. 6. Let’s talk about micro services
  7. 7. Everyone is pursuing micro service architectures
  8. 8. Single outliers have a big impact at scale
  9. 9. Monolithic service uService A uService B uService C uService E uService D
  10. 10. Developer Velocity Resiliency Scale
  11. 11. The number of components increase linearly 0 2 4 6 8 10 12 1 2 3 4 5 6 7 8 9 10
  12. 12. The number of internal requests grow super linearly
  13. 13. A short experiment…
  14. 14. With one hundred services involved …
  15. 15. one out of hundred requests takes over one second… 1/100 1/100 1/100
  16. 16. One late request for the entire request to be slow Come on hurry up!
  17. 17. How many users overall will experience a latency above one second? A <30% B 30-60% C 60-100%
  18. 18. C: 63% Experiencing one second or worse! 28% of customers will not return to a slow site[1] [1] 2016 Holiday Retail Insights Report
  19. 19. 1/100 P(>1s) = 1 – (1 – R)^N R = 1/100 N=3 P(>1s) = 2.9701% R = 1/100 N=100 P(>1s) = 63.3%
  20. 20. Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Commun. ACM 56, 2 (February 2013), 74-80
  21. 21. Variability accumulates when more than one system serves a request
  22. 22. 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0 2 4 6 8 10 12 Series1 Latency Frequency 99% 1% The “tail”
  23. 23. With micro services, scale is easy but hard to control when coming to tail latency
  24. 24. You will have to deal with this
  25. 25. What causes variability?
  26. 26. Resource sharing
  27. 27. Global Local
  28. 28. Aggressor Antagoniser Noisy neighbor In Best effort tasks Interference Contention Variability for High priority tasks Causes
  29. 29. How have large infrastructure operators dealt with variability? Hedge your bets
  30. 30. Server 1 Server 2 Server 3 Server 4
  31. 31. Server 1 Server 2 Server 3 Server 4
  32. 32. Server 1 Server 2 Server 3 Server 4
  33. 33. Server 1 Server 2 Server 3 Server 4
  34. 34. Server 1 Server 2 Server 3 Server 4
  35. 35. We built a tool to help you gain insight into causes of variability Swan
  36. 36. 100% Load Objective Latency Best caseWorse
  37. 37. 100% Load10% Load Best case Interference #1 Interference #2 Best case Interference #1 Interference #2
  38. 38. for load := 10% 20% ... 100% for aggressor := A ... C for repetition := 1 ... 3 start_kubernetes() start_memcached() sustain_QPS(load) record_metrics() start(aggressor) experiment.go Import swan experiment = Experiment(‘9F2DE9AF-177E-4E6F- A994-2FF59075448B’) experiment.profile() Cassandra Snap
  39. 39. Why didn’t Kubernetes usual performance isolation protect the workload? Not a Kubernetes issue (only)
  40. 40. Logical Core1 Logical Core2 Time Process 1 Process 2 Process 3
  41. 41. Cgroups cpu shares is the defacto cpu isolation in container schedulers 1024 2048 1024 210240
  42. 42. A tiny fraction of cpu time is enough to cause severe performance issues
  43. 43. Modern CPUs is helping reduce the causes of these interferences
  44. 44. Core Core Core Core Interconnect Last Level Cache Memory bandwidth Core Core
  45. 45. IntelⓇ Resource Director Technology is an umbrella Cache occupancy Memory bandwidth Cache Allocation Code Data Prioritization
  46. 46. Scenario / Load 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0 Baseline 49% 46% 53% 48% 64% 73% 98% 108% 131% 113% Experiment 876% 945% 946% 893% 953% 898% 887% 921% 851% 901% Scenario / Load 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0 Baseline 52% 51% 45% 54% 60% 69% 89% 100% 101% 111% Experiment 167% 504% 458% 521% 545% 917% 948% 878% 886% 971% Scenario / Load 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0 Baseline 36% 34% 29% 40% 34% 42% 50% 67% 77% 98% Experiment 31% 31% 30% 37% 47% 50% 65% 84% 346% 353% Kubernetes QoS Core isolation Intel RDT
  47. 47. Cache Allocation Code Data Prioritization # mount -t resctrl resctrl /sys/fs/resctrl # cd /sys/fs/resctrl # mkdir p0 p1 # echo "L3:0=3" > /sys/fs/resctrl/p0/schemata # echo "L3:0=c" > /sys/fs/resctrl/p1/schemata 0xc 0x3 0xfFull L3 cache P0 P1
  48. 48. Cache Allocation Code Data Prioritization # echo 1234 > /sys/fs/resctrl/p0/tasks # echo C0 > /sys/fs/resctrl/p1/cpus Core 0 Core 1 Core 2 Core 3 P0 P1
  49. 49. Cache Allocation Code Data Prioritization Code Data Process Heap Stack Core I D pc *(0xf940) L2 L3
  50. 50. Cache Allocation Code Data Prioritization # mount -t resctrl resctrl -o cdp /sys/fs/resctrl # mkdir –p /sys/fs/resctrl/p0 # echo "L3data:0=3" >> /sys/fs/resctrl/p0/schemata # echo "L3code:0=c" >> /sys/fs/resctrl/p0/schemata Core I D L2 Core I D L2 L3 L1
  51. 51. Available in Linux 4.10 Cache Allocation Code Data Prioritization
  52. 52. Cache occupancy Memory bandwidth # perf stat -e intel_cqm/llc_occupancy/ -I 1000 dd if=/dev/zero of=/dev/null # time counts unit events 1.000128952 229,376 Bytes intel_cqm/llc_occupancy/ 2.000280860 327,680 Bytes intel_cqm/llc_occupancy/ 3.000444894 360,448 Bytes intel_cqm/llc_occupancy/ 4.000580058 360,448 Bytes intel_cqm/llc_occupancy/
  53. 53. How do you use this number? $ lscpu ... L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 12288K Last Level Cache Process Occupanc y
  54. 54. Cache occupancy Memory bandwidth # perf stat -e intel_cqm/local_bytes/ -I 1000 dd if=/dev/zero of=/dev/null # time counts unit events 1.000129604 0.20 MB intel_cqm/local_bytes/ 2.000284311 0.00 MB intel_cqm/local_bytes/ 3.000426805 0.00 MB intel_cqm/local_bytes/ 4.000560934 0.07 MB intel_cqm/local_bytes/
  55. 55. How do you use this number? Core Core Interconnect Last Level Cache Memory bandwidth CoreCore Process Bandwidth
  56. 56. Cache occupancy Memory bandwidth Available in Linux 4.1 Cache Monitoring Technology (CMT) Memory Bandwidth Monitoring Available in Linux 4.6
  57. 57. What’s next?
  58. 58. Leave you with 4points
  59. 59. The number of services involved in a request is increasing super linearly
  60. 60. The largest cluster users have dealt with accumulated variability for years
  61. 61. IntelⓇ helps by using priority to reduce the sources of variability through IntelⓇ RDT
  62. 62. Swan is a tool to understand the effects of interference and how to avoid it
  63. 63. Swan is under Apache 2.0 License and available for download today https://github.com/intelsdi-x/swan Read more about how to use Intel Ⓡ RDT https://github.com/01org/intel-cmt-cat/
  64. 64. Thanks to all involved in this project  Maciej Iwanowski, Pawel Palucki, Szymon Konefal, Maciej Patelczyk, Michal Stachowski, Arek Chylinski and the rest of the Swan team  Andrew Herdich and the Intel RDT teams  Tony Luck, Fenghua Yu and Intel Linux Kernel teams
  65. 65. Thank you!

Notas do Editor

  • How is everyone feeling?
    Been seeing some good talk by now?
    Just getting started?

    Not so gentle introduction to kubernetes performance
    The most important thing for me is that you understand and that I don’t loose you mid way
    So as we all have different levels of experience, feel free to shout out if something doesn’t make sense
  • First off, since I am an Intel employee and this is a sponsor talk slot, I have to remind you of our legal notice.
    Mentions of our brand and legal protection, in general and for the contents of this talk
  • That aside, I want to conduct a small experiment
    I’m going to show you two google searches and see if you can tell the difference
    Be aware, each one is only a few seconds. So I need you to pay close attention
  • An artificial 100ms delay per connection raised the response time from 2 to 5 seconds.

    I’ve tried to break down the response time here.

    Few seconds at each graph to slowly explain what the axes mean before diving into interpretation.
  • It might seem surprising, but 2.4 seconds is the sweet spot for users
    Another way to interpret this is that online retail, customers starts to turn away after this amount of time
    The user patience is steadily decreasing
    Expect instantaneous response for even the most complicated queries
  • Consider graphic here
  • Consider graphic here
  • Maybe get some numbers
  • To give you an example of the interconnectiveness, Netflix built a tool called visceral which samples network requests
  • Give options
  • Need to tie back to initial experiment
  • Every request is like flipping a coin

    Too information dense
    Include highlight

    Don’t explain the equation. Hard to talk to.
  • Insert reference

    Few seconds at each graph to slowly explain what the axes mean before diving into interpretation.

    High lights?
    At google scale this matters.
  • The reason this is called the tail at scale
  • Not only a problem for the largest companies in the world.
  • Similar how to these fellas are probably dragging their owner in each direction, each user and system are competing for access to resources in modern data centers.



  • Global
    Network oversubscription
    Queueing in leaf and spine switches

    Local
    Issue slots, L1 and L2, power budgets per core during SMT
    L3, Memory bandwidth and power budget for per socket
    I/O bandwidth
    Network links
    Kernel caches
  • Talk about what makes an application perform as desired and when it isn’t performing like we expect
  • Few seconds at each graph to slowly explain what the axes mean before diving into interpretation.

  • Sensitivity profiles have been used in academia to show how sensitive a workload is to co-location.
    Used to demonstrate performance isolation in research from Stanford and Google[2]

    Greener profiles indicate more resilience to interference
  • Network in data centers have become so fast, memory access over network can outperform disk access

    Have ‘cache clusters’ either of spare capacity or, more likely, dedicated to speed up the requests

    Normal pattern used by the largest sites
    Twitter, Facebook, Wikipedia

    We chose memcached as a high priority workload as it is notoriously hard to place anything next to.
  • Kubernetes co-location
  • Now, why is that?
  • Compute the fractions
    The process scheduler does, is to find out which process is furthest away from it’s fair share and schedules it next.
  • What we call interference
  • Explain caches in a modern server CPU
  • These are done on a Xeon D 1541 platform with a single socket
    Linux is the operating system

    High lights
    Core isolation alone is not enough
    CAT reduce the interference and keeps the SLA up to 80%
  • Explain axis
  • Some applications are extremely sensitive to these kinds of workloads
    Online web search is one
  • Why does CDP matter?
  • Maybe more realistic example
    Show how contention looks like
  • Maybe more realistic example
    Show how contention looks like
  • TODO Split into 4 slides
  • TODO Split into 4 slides
  • TODO Split into 4 slides
  • How do you know how much to give to each partition?
  • Tying things together
  • Tying things together
  • Tying things together
  • Besides this

×