O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Scaling Slack

456 visualizações

Publicada em

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2qFyFKh.

Bing Wei examines the limitations that Slack's backend ran into and how they overcame them to scale from supporting small teams to serving large organizations of hundreds and thousands of users. She tells stories about the edge cache service, real-time messaging system and how they evolved for major product efforts including Grid and Shared Channels. Filmed at qconsf.com.

Bing Wei is a software engineer on the infrastructure team at Slack, working on its edge cache service. Before Slack, she was at Twitter, where she contributed to the open source RPC library Finagle, worked on core services for Tweets and Timelines, and led the migration of Tweet writes from the monolithic Rails application to the JVM-based microservices.

Publicada em: Tecnologia
  • Seja o primeiro a comentar

Scaling Slack

  1. 1. Scaling Slack Bing Wei Infrastructure@Slack
  2. 2. InfoQ.com: News & Community Site • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ slack-scalability
  3. 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  4. 4. 2
  5. 5. 3
  6. 6. Our Mission: To make people’s working lives simpler, more pleasant, and more productive. 4
  7. 7. From supporting small teams To serving gigantic organizations of hundreds of thousands of users 5
  8. 8. Slack Scale ◈ 6M+ DAU, 9M+ WAU 5M+ peak simultaneously connected ◈ Avg 10+ hrs/weekday connected Avg 2+ hrs/weekday in active use ◈ 55% of DAU outside of US 6
  9. 9. Cartoon Architecture WebApp PHP/Hack Sharded MySql Messaging Server Java Job Queue Redis/Kafka HTTP WebSocket 7
  10. 10. Outline ◈ The slowness problem ◈ Incremental Improvements ◈ Architecture changes ○ Flannel ○ Client Pub/Sub ○ Messaging Server 8
  11. 11. Challenge: Slowness Connecting to Slack 9
  12. 12. Login Flow in 2015 User 1. HTTP POST with user’s token 2. HTTP Response: a snapshot of the team & websocket URL WebApp MySql 10
  13. 13. Some examples number of users/channels response size 30 / 10 200K 500 / 200 2.5M 3K / 7K 20M 30K / 1K 60M 11
  14. 14. Login flow in 2015 User 1. HTTP POST with user’s token 2. HTTP Response: a snapshot of the team & websocket url WebApp Messaging Server3. Websocket: real-time events MySql 12
  15. 15. Real-time Events on WebSocket User Messaging Server WebSocket: 100+ types of events e.g. chat messages, typing indicator, files uploads, files comments, threads replies, user presence changes, user profile changes, reactions, pins, stars, channel creations, app installations, etc. 13
  16. 16. Login Flow in 2015 ◈ Clients Architecture ○ Download a snapshot of entire team ○ Updates trickle in through the WebSocket ○ Eventually consistent snapshot of whole team 14
  17. 17. Problems Initial team snapshot takes time 15
  18. 18. Problems Initial team snapshot takes time Large client memory footprint 16
  19. 19. Problems Initial team snapshot takes time Large client memory footprint Expensive reconnections 17
  20. 20. Problems Initial team snapshot takes time Large client memory footprint Expensive reconnections Reconnect storm 18
  21. 21. Outline ◈ The slowness problem ◈ Incremental Improvements ◈ Architecture changes ○ Flannel ○ Client Pub/Sub ○ Messaging Server 19
  22. 22. Improvements ◈ Smaller team snapshot ○ Client local storage + delta ○ Remove objects out + in parallel loading ○ Simplified objects: e.g. canonical Avatars 20
  23. 23. Improvements ◈ Incremental boot ○ Load one channel first 21
  24. 24. Improvements ◈ Rate Limit ◈ POPs ◈ Load Testing Framework 22
  25. 25. Support New Product Features Product Launch 23
  26. 26. Cope with New Product Features Product Launch 24
  27. 27. Still... Limitations ◈ What if team sizes keep growing ◈ Outages when clients dump their local storage 25
  28. 28. Outline ◈ The slowness problem ◈ Incremental Improvements ◈ Architecture changes ○ Flannel ○ Client Pub/Sub ○ Messaging Server 26
  29. 29. Client Lazy Loading Download less data upfront ◈ Fetch more on demand 27
  30. 30. Flannel: Edge Cache Service A query engine backed by cache on edge locations 28
  31. 31. What are in Flannel’s cache ◈ Support big objects first ○ Users ○ Channels Membership ○ Channels 29
  32. 32. Login and Message Flow with Flannel User Messaging Server WebApp MySQL Flannel 3. WebSocket: Stream Json events 1. WebSocket connection 2. HTTP Post: download a snapshot of the team 30
  33. 33. A Man in the Middle User Messaging Server Flannel Use real-time events to update its cache E.g. user creation, user profile change, channel creation, user joins a channel, channel convert to private WebSocket WebSocket 31
  34. 34. Edge Locations Mix of AWS & Google Cloud main region us-east-1 32
  35. 35. Examples Powered by Flannel Quick Switcher 33
  36. 36. Examples Powered by Flannel Mention Suggestion 34
  37. 37. Examples Powered by Flannel Channel Header 35
  38. 38. Examples Powered by Flannel Channel Sidebar 36
  39. 39. Examples Powered by Flannel Team Directory 37
  40. 40. Flannel Results ◈ Launched Jan 2017 ○ Load 200K user team ◈ 5M+ simultaneous connections at peak ◈ 1M+ clients queries/sec 38
  41. 41. Flannel Results 39
  42. 42. This is not the end of the story 40
  43. 43. Evolution of Flannel 41
  44. 44. Web Client Iterations Flannel Just-In-Time Annotation Right before Web clients are about to access an object, Flannel pushes that object to clients. 42
  45. 45. A Closer Look Why does Flannel sit on WebSocket? 43
  46. 46. Old Way of Cache Updates Users Messaging ServerFlannel LOTS of duplicated Json events 44
  47. 47. Publish/Subscribe (Pub/Sub) to Update Cache Users Messaging ServerFlannel Pub/Sub Thrift events 45
  48. 48. Pub/Sub Benefits Less Flannel CPU Simpler Flannel code Schema data Flexibility for cache management 46
  49. 49. Flexibility for Cache Management Previously ◈ Load when the first user connects ◈ Unload when the last user disconnects 47
  50. 50. Flexibility for Cache Management With Pub/Sub ◈ Isolate received events from user connections 48
  51. 51. Another Closer Look ◈ With Pub/Sub, does Flannel need to be on WebSocket path? 49
  52. 52. Next Step Move Flannel out of WebSocket path 50
  53. 53. Next Step Move Flannel out of WebSocket path Why? Separation & Flexibility 51
  54. 54. Evolution with Product Requirements Grid for Big Enterprise 52
  55. 55. Team Affinity for Cache Efficiency Before Grid 53
  56. 56. Team Affinity Grid Aware Now 54
  57. 57. Grid Awareness Improvements Flannel Memory Saves 22G of per host, 1.1TB total 55
  58. 58. Grid Awareness Improvements DB Shard CPU Idle 25% -> 90% P99 User Connect Latency 40s -> 4s For our biggest customer 56
  59. 59. Team Affinity Grid Aware Scatter & Gather Future 57
  60. 60. Outline ◈ The slowness problem ◈ Incremental Improvements ◈ Architecture changes ○ Flannel ○ Client Pub/Sub ○ Messaging Server 58
  61. 61. Expand Pub/Sub to Client Side Client Side Pub/Sub reduces events Clients have to handle 59
  62. 62. Presence Events 60% of all events O(N2 ) 1000 user team ⇒ 1000 * 1000 = 1M events 60
  63. 63. Presence Pub/Sub Clients ◈ Track who are in the current view ◈ Subscribe/Unsubscribe to Messaging server when view changes 61
  64. 64. Outline ◈ The slowness problem ◈ Incremental Improvements ◈ Architecture changes ○ Flannel ○ Client Pub/Sub ○ Messaging Server 62
  65. 65. What is Messaging Server Messaging Server 63
  66. 66. A Message Router Messaging Server 64
  67. 67. Events Routing and Fanout Messaging Server WebApp/ DB 1.Events happen on team 2.Events Fanout 65
  68. 68. Limitations ◈ Sharded by Team Single point of failure 66
  69. 69. Limitations ◈ Sharded by Team Single point of failure ◈ Shared Channels Shared states among teams 67
  70. 70. 68
  71. 71. Topic Sharding Everything is a Topic public/private channel, DM, group DM, user, team, grid 69
  72. 72. Topic Sharding Natural fit for shared channels 70
  73. 73. Topic Sharding Natural fit for shared channels Reduce user perceived failures 71
  74. 74. Other Improvements Auto failure recovery 72
  75. 75. Other Improvements Auto failure recovery Publish/Subscribe 73
  76. 76. Other Improvements Auto failure recovery Publish/Subscribe Fanout at the edge 74
  77. 77. Our Journey Problem Incremental Change Architectural Change Ongoing Evolution 75
  78. 78. More To ComeJourney Ahead Get in touch: https://slack.com/jobs 76
  79. 79. Thanks! Any questions? @bingw11 77
  80. 80. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ slack-scalability

×