O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
High Performance Erlang:
Pitfalls and Solutions
MZSPEED Team
Machine Zone, Inc.
1Machine Zone, Inc
Agenda	
  
•  Erlang at Machine Zone
•  Issues and Roadblocks
•  Lessons Learned
•  Profiling Techniques
•  Case Study: Co...
Erlang	
  at	
  Machine	
  Zone	
  
•  High performance data processing system
•  How to make it fast?
3Machine Zone, Inc
4
Typical	
  Approach:	
  Pipelined	
  tasks	
  
Execute different logic in different processes
receiver processor sender
...
How do we exchange data between processes?
How do we make it fast?
5Machine Zone, Inc
ETS
6Machine Zone, Inc
0
5 M
10 M
15 M
0
2
4
6
8
10
12
14
16
Updateoperations(total)
Numberofprocessesupdating
Timeline
expected
processes updati...
8Machine Zone, Inc
ETS:	
  update_counter,	
  Multiple	
  Keys:	
  
Observed	
  Performance	
  
0
5 M
10 M
15 M
0
2
4
6
8
...
9Machine Zone, Inc
ETS	
  has	
  Locks	
  
•  Fine-grained locks
•  64 locks per table by default
Message Passing
10Machine Zone, Inc
1 10 100 1000 N
1
10
100
M
11Machine Zone, Inc
Message	
  Passing:	
  	
  M	
  Processes	
  to	
  N	
  Processes	
  
10:100
12Machine Zone, Inc
Message	
  Passing:	
  Scheduling	
  Locks	
  
•  Add to runq with locks
13Machine Zone, Inc
Message	
  Passing:	
  Scheduler	
  
Communication	
  Locks	
  
•  Task stealing with locks
14Machine Zone, Inc
Message	
  Passing	
  
•  Limit on #messages delivered per second
•  < 10M msgs/sec on MacBook Pro
•  ...
What if we want to achieve more?
15Machine Zone, Inc
16Machine Zone, Inc
Solution	
  1:	
  Minimize	
  Message	
  Exchange	
  
Do everything in the same process and shard
rece...
17Machine Zone, Inc
Solution	
  1	
  Speed	
  
•  100x the #ETS updates
•  20x–30x the #message exchanges
•  No decrease w...
18Machine Zone, Inc
Solution	
  2:	
  Message	
  Batching	
  
Batch messages: 5x–30x more msgs/sec
1 10 100 1000 N
1
10
10...
19Machine Zone, Inc
Solution	
  3:	
  Domain-­‐Specific	
  
Optimizations	
  
•  Deliver one message to
multiple recipients...
20Machine Zone, Inc
High Performance Erlang:
Lessons Learned
21Machine Zone, Inc
Lesson	
  1:	
  Use	
  Sharding	
  
•  Avoid having single process
•  Use a pool of non-communicating ...
0
2 M
4 M
6 M
8 M
10 M
12 M
0 1 M 2 M 3 M 4 M 5 M
Timetaken(us)
Number of insertions
ets set
maps
process dictionary
22Mac...
0
5 k
10 k
15 k
20 k
25 k
1 10 100 1000 10000 100000
Timefor100kupdates(ms)
Number of elements
R18 maps
R18 process dictio...
24Machine Zone, Inc
Lesson	
  3:	
  Use	
  Limited	
  #Processes	
  
•  Don’t create too many processes
•  pid2proc lock f...
Profiling Techniques
25Machine Zone, Inc
26Machine Zone, Inc
pstack:	
  Stack	
  Sampling	
  
•  pstack (gdb): sampling call stack over multiple threads
•  $ pstac...
27Machine Zone, Inc
EEP:	
  Erlang	
  Easy	
  Profiling	
  
•  https://github.com/virtan/eep
•  Based on dbg module, low ov...
28Machine Zone, Inc
EEP:	
  	
  Example	
  Visualization	
  
29Machine Zone, Inc
oprofile:	
  System	
  Level	
  Profiling	
  
•  Initialization
$ opcontrol –deinit
$ echo 0 > /proc/sys...
30Machine Zone, Inc
oprofile:	
  Example	
  
Case Study: Counters
31Machine Zone, Inc
32Machine Zone, Inc
Need	
  for	
  High	
  Performance	
  Counters	
  
•  Thousands of global counters
•  Updated by sever...
33Machine Zone, Inc
Libraries	
  Used	
  
•  Exometer
•  ETS
•  Folsom
Benchmark	
  Environment	
  
•  Platform
•  2 CPU sockets
•  CPU - Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz
•  14 hardwar...
0
1 M
2 M
3 M
4 M
5 M
6 M
7 M
10 20 30 40 50 60
Numcounterupdatesperprocess
Num processes
ets counter
exometer counter
exo...
36Machine Zone, Inc
Limitations	
  
•  Exometer:
•  counters – scales well but slow
•  fast counters – doesn’t scale
•  Lo...
MZMETRICS
Our Solution:
37Machine Zone, Inc
38Machine Zone, Inc
Refresher	
  on	
  Cacheline	
  
Source:	
  http://www.slideshare.net/yzt/modern-­‐cpus-­‐and-­‐caches...
39Machine Zone, Inc
Occurs when two CPUs access different data
structures on the same cache line
False	
  Sharing	
  (SMP)...
40
CPUO
Main
memory
L2 L2
CPU1
foo
bar
foo
bar
foo
bar
FETCH	
  foo	
   FETCH	
  bar	
  
FETCH	
  foo	
   FETCH	
  bar	
  ...
Main
memory
41
False	
  Sharing	
  (SMP)	
  
CPUO
L2 L2
CPU1
foo
bar
WRITE	
  foo	
  
WRITE	
  foo	
  
INVL	
  bar	
  
INV...
Main
memory
42
CPUO
L2 L2
CPU1
WRITE	
  bar	
  
WRITE	
  bar	
  
INVL	
  foo	
  
INVL	
  foo	
  
foo
bar
ACK	
  
ACK	
  
A...
43Machine Zone, Inc
Benefits	
  of	
  Per-­‐CPU	
  Data	
  
Source:	
  http://lwn.net/Articles/256433/	
  
Time taken for 5...
44Machine Zone, Inc
Benefits	
  of	
  Per-­‐CPU	
  Data:	
  Reasons	
  
•  Less overhead of cache coherency protocol
•  Sig...
45Machine Zone, Inc
Transactional	
  Memory	
  
•  Intel TSX (Transactional Synchronizations
Extensions)
•  Helps with loc...
46Machine Zone, Inc
MZMETRICS	
  
•  Design based on per-CPU counters
•  Counter Operations:
•  Create a new counter – req...
47Machine Zone, Inc
MZMETRICS	
  Evaluation	
  	
  
0
2 M
4 M
6 M
8 M
10 M
12 M
14 M
16 M
18 M
20 M
10 20 30 40 50
Numcoun...
48Machine Zone, Inc
MZMETRICS	
  Evaluation:	
  Results	
  	
  
•  1 million counter updates with 48 processes
•  Exometer...
49Machine Zone, Inc
Takeaways	
  
•  Pitfalls
•  Message passing becomes expensive at scale
•  Large number of processes n...
50Machine Zone, Inc
We are Hiring
http://www.machinezone.com/careers
Próximos SlideShares
Carregando em…5
×

High Performance Erlang - Pitfalls and Solutions

378 visualizações

Publicada em

Presented at Erlang Factory 2016, San Francisco, CA.

Erlang is widely used for building concurrent applications. However, when we push the performance of our Erlang based application to handle millions of concurrent clients, some Erlang scalability issues begin to show and some conventional programming paradigm of Erlang no longer hold. We would like to share some of these issue and how we address them. In addition, we share some of our experience on how to profile an Erlang application to identify bottlenecks.

We will take a deep look at some of the basic mechanisms of Erlang and show how they behave under high load and parallelism, which includes message delivery, process management and shared data structures such as maps and ETS tables. We will demonstrate their limitations and propose techniques to alleviate the issues.

We will also share profiling techniques on how to find those bottlenecks in Erlang applications across different levels. We will share techniques for writing highly performant Erlang applications.

  • Seja o primeiro a comentar

High Performance Erlang - Pitfalls and Solutions

  1. 1. High Performance Erlang: Pitfalls and Solutions MZSPEED Team Machine Zone, Inc. 1Machine Zone, Inc
  2. 2. Agenda   •  Erlang at Machine Zone •  Issues and Roadblocks •  Lessons Learned •  Profiling Techniques •  Case Study: Counters 2Machine Zone, Inc
  3. 3. Erlang  at  Machine  Zone   •  High performance data processing system •  How to make it fast? 3Machine Zone, Inc
  4. 4. 4 Typical  Approach:  Pipelined  tasks   Execute different logic in different processes receiver processor sender Machine Zone, Inc
  5. 5. How do we exchange data between processes? How do we make it fast? 5Machine Zone, Inc
  6. 6. ETS 6Machine Zone, Inc
  7. 7. 0 5 M 10 M 15 M 0 2 4 6 8 10 12 14 16 Updateoperations(total) Numberofprocessesupdating Timeline expected processes updating 0 5 M 10 M 15 M 0 2 4 6 8 10 12 14 16 Updateoperations(total) Numberofprocessesupdating Timeline expected processes updating update operations 7Machine Zone, Inc ETS:  update_counter,  10  Keys:     Expected  vs  Observed  Performance   •  Observed Performance is far worse than Expected
  8. 8. 8Machine Zone, Inc ETS:  update_counter,  Multiple  Keys:   Observed  Performance   0 5 M 10 M 15 M 0 2 4 6 8 10 12 14 16 Updateoperations(total) Numberofprocessesupdating Timeline 1 key 10 keys 100 keys 1000 keys •  Performance is degraded even with multiple keys
  9. 9. 9Machine Zone, Inc ETS  has  Locks   •  Fine-grained locks •  64 locks per table by default
  10. 10. Message Passing 10Machine Zone, Inc
  11. 11. 1 10 100 1000 N 1 10 100 M 11Machine Zone, Inc Message  Passing:    M  Processes  to  N  Processes   10:100
  12. 12. 12Machine Zone, Inc Message  Passing:  Scheduling  Locks   •  Add to runq with locks
  13. 13. 13Machine Zone, Inc Message  Passing:  Scheduler   Communication  Locks   •  Task stealing with locks
  14. 14. 14Machine Zone, Inc Message  Passing   •  Limit on #messages delivered per second •  < 10M msgs/sec on MacBook Pro •  < 22M msgs/sec on a powerful 56-core server •  Independent of #processes
  15. 15. What if we want to achieve more? 15Machine Zone, Inc
  16. 16. 16Machine Zone, Inc Solution  1:  Minimize  Message  Exchange   Do everything in the same process and shard receiver processor sender receiver processor sender receiver processor sender
  17. 17. 17Machine Zone, Inc Solution  1  Speed   •  100x the #ETS updates •  20x–30x the #message exchanges •  No decrease with increase in #processes 0 20 M 40 M 60 M 80 M 100 M 0 20 40 60 80 100 Iterationspersec Processes iterations processes
  18. 18. 18Machine Zone, Inc Solution  2:  Message  Batching   Batch messages: 5x–30x more msgs/sec 1 10 100 1000 N 1 10 100 M 10:100
  19. 19. 19Machine Zone, Inc Solution  3:  Domain-­‐Specific   Optimizations   •  Deliver one message to multiple recipients: 1.  Receive, process, store to shared memory using NIF 2.  Read from the shared memory, send using push notification sender receiver processor shared memory Total throughput of 600M msgs/sec to many recipients shared memory
  20. 20. 20Machine Zone, Inc High Performance Erlang: Lessons Learned
  21. 21. 21Machine Zone, Inc Lesson  1:  Use  Sharding   •  Avoid having single process •  Use a pool of non-communicating shards •  erlang:phash2(Key, NumberOfShards) Shard 2 Shard 1 receive   receive   send   send   No  communication  
  22. 22. 0 2 M 4 M 6 M 8 M 10 M 12 M 0 1 M 2 M 3 M 4 M 5 M Timetaken(us) Number of insertions ets set maps process dictionary 22Machine Zone, Inc Lesson  2:  Use  Process  Dictionary   •  Quicker insertions in process dictionary compared to maps
  23. 23. 0 5 k 10 k 15 k 20 k 25 k 1 10 100 1000 10000 100000 Timefor100kupdates(ms) Number of elements R18 maps R18 process dictionary 23Machine Zone, Inc Lesson  2:  Use  Process  Dictionary   •  Quicker update operations in process dictionary compared to maps (Erlang 18)
  24. 24. 24Machine Zone, Inc Lesson  3:  Use  Limited  #Processes   •  Don’t create too many processes •  pid2proc lock for process resolution •  Scheduler has to manage the processes •  Fixed number of schedulers •  Don’t spawn/stop processes frequently •  Process resolution again
  25. 25. Profiling Techniques 25Machine Zone, Inc
  26. 26. 26Machine Zone, Inc pstack:  Stack  Sampling   •  pstack (gdb): sampling call stack over multiple threads •  $ pstack 1234 •  $ gdb -p 1234 •  (gdb) thread apply all bt 25
  27. 27. 27Machine Zone, Inc EEP:  Erlang  Easy  Profiling   •  https://github.com/virtan/eep •  Based on dbg module, low overhead trace ports •  Easy to use •  Kcachegrind visualization Basic Usage •  Collect runtime data to local file •  > eep:start_file_tracing("abc"), timer:sleep(10000), eep:stop_tracing(). •  Convert to callgrind format •  > eep:convert_tracing(“abc"). •  Visualize with kcachegrind •  $ kcachegrind callgrind.out.abc
  28. 28. 28Machine Zone, Inc EEP:    Example  Visualization  
  29. 29. 29Machine Zone, Inc oprofile:  System  Level  Profiling   •  Initialization $ opcontrol –deinit $ echo 0 > /proc/sys/kernel/nmi_watchdog •  Profiling $ operf --vmlinux /usr/lib/debug/vmlinux -- callgraph ./rebar eunit apps=perf_test •  Reporting and Visualization $ opreport -cgf | ./gprof2dot.py -f oprofile --show-samples | dot -Tpng -o output.png
  30. 30. 30Machine Zone, Inc oprofile:  Example  
  31. 31. Case Study: Counters 31Machine Zone, Inc
  32. 32. 32Machine Zone, Inc Need  for  High  Performance  Counters   •  Thousands of global counters •  Updated by several hundred processes every second
  33. 33. 33Machine Zone, Inc Libraries  Used   •  Exometer •  ETS •  Folsom
  34. 34. Benchmark  Environment   •  Platform •  2 CPU sockets •  CPU - Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz •  14 hardware threads per socket •  2 hyper-threaded threads for each hardware thread •  Erlang Runtime   Erlang/OTP 18 [erts-7.1] [source] [64-bit] [smp:56:56] [async-threads:10] [kernel-poll:false] 34Machine Zone, Inc
  35. 35. 0 1 M 2 M 3 M 4 M 5 M 6 M 7 M 10 20 30 40 50 60 Numcounterupdatesperprocess Num processes ets counter exometer counter exometer fast counter folsom counter 35Machine Zone, Inc Benchmark  Results   •  As number of processes increases, number of counter updates per process decreases
  36. 36. 36Machine Zone, Inc Limitations   •  Exometer: •  counters – scales well but slow •  fast counters – doesn’t scale •  Lock contention in ets:update_counter()
  37. 37. MZMETRICS Our Solution: 37Machine Zone, Inc
  38. 38. 38Machine Zone, Inc Refresher  on  Cacheline   Source:  http://www.slideshare.net/yzt/modern-­‐cpus-­‐and-­‐caches-­‐a-­‐starting-­‐point-­‐for-­‐programmers/52  
  39. 39. 39Machine Zone, Inc Occurs when two CPUs access different data structures on the same cache line False  Sharing  (SMP)   CPUO CPU1 struct foo struct bar Cache  line  
  40. 40. 40 CPUO Main memory L2 L2 CPU1 foo bar foo bar foo bar FETCH  foo   FETCH  bar   FETCH  foo   FETCH  bar   False  Sharing  (SMP)   Machine Zone, Inc
  41. 41. Main memory 41 False  Sharing  (SMP)   CPUO L2 L2 CPU1 foo bar WRITE  foo   WRITE  foo   INVL  bar   INVL  bar   foo bar foo bar ACK   ACK  ACK   ACK   Machine Zone, Inc
  42. 42. Main memory 42 CPUO L2 L2 CPU1 WRITE  bar   WRITE  bar   INVL  foo   INVL  foo   foo bar ACK   ACK   ACK   ACK   Machine Zone, Inc False  Sharing  (SMP)  
  43. 43. 43Machine Zone, Inc Benefits  of  Per-­‐CPU  Data   Source:  http://lwn.net/Articles/256433/   Time taken for 500 million counter increment operations on Intel Core 2 QX 6700
  44. 44. 44Machine Zone, Inc Benefits  of  Per-­‐CPU  Data:  Reasons   •  Less overhead of cache coherency protocol •  Significantly higher benefits with QPI (Quick Path Interconnect)
  45. 45. 45Machine Zone, Inc Transactional  Memory   •  Intel TSX (Transactional Synchronizations Extensions) •  Helps with locks in small critical sections •  Not yet available in x86 (disabled due to hardware bugs)
  46. 46. 46Machine Zone, Inc MZMETRICS   •  Design based on per-CPU counters •  Counter Operations: •  Create a new counter – requires lock •  Read/update counter – no lock required •  Delete counter – requires lock •  Implemented using NIF
  47. 47. 47Machine Zone, Inc MZMETRICS  Evaluation     0 2 M 4 M 6 M 8 M 10 M 12 M 14 M 16 M 18 M 20 M 10 20 30 40 50 Numcounterupdatesperprocess Num processes exometer mzmetrics •  10X faster than Exometer counters
  48. 48. 48Machine Zone, Inc MZMETRICS  Evaluation:  Results     •  1 million counter updates with 48 processes •  Exometer takes ~ 1 seconds •  MZMETRICS takes ~0.1 seconds
  49. 49. 49Machine Zone, Inc Takeaways   •  Pitfalls •  Message passing becomes expensive at scale •  Large number of processes needing pid resolution from a global table causes contention •  Existing libraries not amiable for fast counting •  Proposed Solutions •  Use sharding & Limit messaging passing •  Limit number of active processes •  Use MZMETRICS for fast counters
  50. 50. 50Machine Zone, Inc We are Hiring http://www.machinezone.com/careers

×