Load balancing is impossible

1.192 visualizações

Publicada em

Load balancing is something most of us assume is a solved problem. But the idea that load balancing is “solved” could not be further from the truth. If you use multiple load balancers, the problem is even worse. Most of us use “random” or “round-robin” techniques, which have certain advantages but are highly inefficient. Others use more complex algorithms like “least-conns,” which can be more efficient but have horrific edge cases. “Consistent hashing” is a very useful technique but only applies to certain problems.

There are several factors that exist both in theory and practice that make efficient load balancing an exceptionally hard problem, including Poisson request arrival times, exponentially distributed response latency, and oscillations when sharing data between multiple load balancers. Luckily, there are techniques and algorithms that have been developed that can make life better. Tyler McMullen explains some of the ways that we can do better than “random,” “round-robin,” and naive “least-conns,” even with distributed load balancers.

Publicada em: Software
0 comentários
1 gostou
Estatísticas
Notas
  • Seja o primeiro a comentar

Sem downloads
Visualizações
Visualizações totais
1.192
No SlideShare
0
A partir de incorporações
0
Número de incorporações
19
Ações
Compartilhamentos
0
Downloads
16
Comentários
0
Gostaram
1
Incorporações 0
Nenhuma incorporação

Nenhuma nota no slide

Load balancing is impossible

  1. 1. L O A D B A L A N C I N G I S I M P O S S I B L E
  2. 2. SLIDE LOAD BALANCING IS IMPOSSIBLE Tyler McMullen tyler@fastly.com @tbmcmullen 2
  3. 3. WHAT IS LOAD BALANCING?
  4. 4. [DIAGRAM DESCRIBING LOAD BALANCING]
  5. 5. [ALLEGORY DESCRIBING LOAD BALANCING]
  6. 6. SLIDELOAD BALANCING IS IMPOSSIBLE 6 Abstraction Balancing LoadFailure Treat many servers as one Single entry point Simplification Transparent failover Recover seamlessly Simplification Spread the load efficiently across servers Why Load Balance? Three major reasons. The least of which is balancing load.
  7. 7. R A N D O M T H E I N G L O R I O U S D E FA U LT A N D B A N E O F M Y E X I S T E N C E
  8. 8. SLIDE LOAD BALANCING IS IMPOSSIBLE What’s good about random? 8 • Simplicity • Few edge cases • Easy failover • Works identically when distributed
  9. 9. SLIDE LOAD BALANCING IS IMPOSSIBLE What’s bad about random? 9 • Latency • Especially long-tail latency • Useable capacity
  10. 10. B A L L S - I N T O - B I N S
  11. 11. If you throw m balls into n bins, what is the maximum load of any one bin?
  12. 12. import numpy as np import numpy.random as nr n = 8 # number of servers m = 1000 # number of requests bins = [0] * n for chosen_bin in nr.randint(0, n, m): bins[chosen_bin] += 1 print bins [129, 100, 134, 113, 117, 136, 148, 123]
  13. 13. import numpy as np import numpy.random as nr n = 8 # number of servers m = 1000 # number of requests bins = [0] * n for weight in nr.uniform(0, 2, m): chosen_bin = nr.randint(0, n) bins[chosen_bin] += weight print bins [133.1, 133.9, 144.7, 124.1, 102.9, 125.4, 114.2, 121.3]
  14. 14. How do you model request latency?
  15. 15. Log-normal Distribution 50th: 0.6 75th: 1.2 95th: 3.1 99th: 6.0 99.9th: 14.1 MEAN: 1.0
  16. 16. mu = 0.0 sigma = 1.15 lognorm_mean = math.e ** (mu + sigma ** 2 / 2) desired_mean = 1.0 def normalize(value): return value / lognorm_mean * desired_mean for weight in nr.lognormal(mu, sigma, m): chosen_bin = nr.randint(0, n) bins[chosen_bin] += normalize(weight) [128.7, 116.7, 136.1, 153.1, 98.2, 89.1, 125.4, 130.4]
  17. 17. mu = 0.0 sigma = 1.15 lognorm_mean = math.e ** (mu + sigma ** 2 / 2) desired_mean = 1.0 baseline = 0.05 def normalize(value): return (value / lognorm_mean * (desired_mean - baseline) + baseline) for weight in nr.lognormal(mu, sigma, m): chosen_bin = nr.randint(0, n) bins[chosen_bin] += normalize(weight) [100.7, 137.5, 134.3, 126.2, 113.5, 175.7, 101.6, 113.7]
  18. 18. THIS IS WHY PERFECTION IS IMPOSSIBLE
  19. 19. WHY IS THAT A PROBLEM?
  20. 20. WHAT EFFECT DOES IT HAVE?
  21. 21. Random simulation Actual distribution
  22. 22. The probability of a single resource request avoiding the 99th percentile is 99%. The probability of all N resource requests in a page avoiding the 99th percentile is (99% ^ N). 99% ^ 69 = 49.9%
  23. 23. SO WHAT DO WE DO ABOUT IT?
  24. 24. Random simulation JSQ simulation
  25. 25. L E T ’ S T H R O W A W R E N C H I N T O T H I S . . . D I S T R I B U T E D L O A D B A L A N C I N G A N D W H Y I T M A K E S E V E R Y T H I N G H A R D E R
  26. 26. DISTRIBUTED RANDOM IS EXACTLY THE SAME
  27. 27. DISTRIBUTED JOIN-SHORTEST-QUEUE IS A NIGHTMARE
  28. 28. mu = 0.0 sigma = 1.15 lognorm_mean = math.e ** (mu + sigma ** 2 / 2) desired_mean = 1.0 baseline = 0.05 def normalize(value): return (value / lognorm_mean * (desired_mean - baseline) + baseline) for weight in nr.lognormal(mu, sigma, m): chosen_bin = nr.randint(0, n) bins[chosen_bin] += normalize(weight) [100.7, 137.5, 134.3, 126.2, 113.5, 175.7, 101.6, 113.7]
  29. 29. mu = 0.0 sigma = 1.15 lognorm_mean = math.e ** (mu + sigma ** 2 / 2) desired_mean = 1.0 baseline = 0.05 def normalize(value): return (value / lognorm_mean * (desired_mean - baseline) + baseline) for weight in nr.lognormal(mu, sigma, m): a = nr.randint(0, n) b = nr.randint(0, n) chosen_bin = a if bins[a] < bins[b] else b bins[chosen_bin] += normalize(weight) [130.5, 131.7, 129.7, 132.0, 131.3, 133.2, 129.9, 132.6]
  30. 30. [100.7, 137.5, 134.3, 126.2, 113.5, 175.7, 101.6, 113.7] [130.5, 131.7, 129.7, 132.0, 131.3, 133.2, 129.9, 132.6] STANDARD DEVIATION: 1.18 STANDARD DEVIATION: 22.9
  31. 31. Random simulation JSQ simulation Randomized JSQ simulation
  32. 32. A N O T H E R C R A Z Y I D E A
  33. 33. MY PROBLEM WITH THIS APPROACH... A DIFFERENCE OF PERSPECTIVE
  34. 34. WRAP UP
  35. 35. SLIDE LOAD BALANCING IS IMPOSSIBLE THANKS BYE tyler@fastly.com @tbmcmullen 48

×