Scaling Slack

Scaling Slack
Bing Wei
Infrastructure@Slack

InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
slack-scalability

Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com

Our Mission:
To make people’s working
lives simpler, more pleasant,
and more productive.
4

From supporting small teams
To serving gigantic organizations of
hundreds of thousands of users
5

Slack Scale
◈ 6M+ DAU, 9M+ WAU
5M+ peak simultaneously connected
◈ Avg 10+ hrs/weekday connected
Avg 2+ hrs/weekday in active use
◈ 55% of DAU outside of US
6

Cartoon Architecture
WebApp
PHP/Hack
Sharded
MySql
Messaging Server
Java
Job Queue
Redis/Kafka
HTTP
WebSocket
7

Outline
◈ The slowness problem
◈ Incremental Improvements
◈ Architecture changes
○ Flannel
○ Client Pub/Sub
○ Messaging Server
8

Challenge:
Slowness Connecting to Slack
9

Login Flow in 2015
User
1. HTTP POST
with user’s token
2. HTTP Response:
a snapshot of the team &
websocket URL
WebApp MySql
10

Some examples
number of
users/channels
response size
30 / 10 200K
500 / 200 2.5M
3K / 7K 20M
30K / 1K 60M
11

Login flow in 2015
User
1. HTTP POST
with user’s token
2. HTTP Response:
a snapshot of the team &
websocket url
WebApp
Messaging Server3. Websocket:
real-time events
MySql
12

Real-time Events on WebSocket
User
Messaging Server
WebSocket:
100+ types of events
e.g. chat messages,
typing indicator,
files uploads,
files comments,
threads replies,
user presence changes,
user profile changes,
reactions, pins, stars,
channel creations,
app installations,
etc.
13

Login Flow in 2015
◈ Clients Architecture
○ Download a snapshot of entire team
○ Updates trickle in through the WebSocket
○ Eventually consistent snapshot of whole team
14

Problems
Initial team snapshot takes time
15

Problems
Large client memory footprint
16

Problems
Expensive reconnections
17

Problems
Expensive reconnections
Reconnect storm
18

Outline
○ Flannel
○ Client Pub/Sub
19

Improvements
◈ Smaller team snapshot
○ Client local storage + delta
○ Remove objects out + in parallel loading
○ Simplified objects: e.g. canonical Avatars
20

Improvements
◈ Incremental boot
○ Load one channel first
21

Improvements
◈ Rate Limit
◈ POPs
◈ Load Testing Framework
22

Support New Product Features
Product Launch
23

Cope with New Product Features
Product Launch
24

Still...
Limitations
◈ What if team sizes keep growing
◈ Outages when clients dump their local
storage
25

Outline
○ Flannel
○ Client Pub/Sub
26

Client Lazy Loading
Download less data upfront
◈ Fetch more on demand
27

Flannel: Edge Cache Service
A query engine backed by cache
on edge locations
28

What are in Flannel’s cache
◈ Support big objects first
○ Users
○ Channels Membership
○ Channels
29

Login and Message Flow with Flannel
User Messaging Server
WebApp MySQL
Flannel
3. WebSocket:
Stream Json events
1. WebSocket
connection
2. HTTP Post:
download a snapshot
of the team
30

A Man in the Middle
User
Messaging Server
Flannel
Use real-time events to
update its cache
E.g. user creation,
user profile change,
channel creation,
user joins a channel,
channel convert to private
WebSocket WebSocket
31

Edge Locations
Mix of AWS & Google Cloud
main region
us-east-1
32

Examples Powered by Flannel
Quick
Switcher
33

Mention
Suggestion
34

Channel
Header
35

Channel
Sidebar
36

Team
Directory
37

Flannel Results
◈ Launched Jan 2017
○ Load 200K user team
◈ 5M+ simultaneous connections at
peak
◈ 1M+ clients queries/sec
38

This is not the end of the story
40

Web Client Iterations
Flannel Just-In-Time Annotation
Right before Web clients are about to access
an object, Flannel pushes that object to clients.
42

A Closer Look
Why does Flannel sit on WebSocket?
43

Old Way of Cache Updates
Users
Messaging ServerFlannel
LOTS of duplicated
Json events
44

Publish/Subscribe (Pub/Sub) to Update Cache
Users
Messaging ServerFlannel
Pub/Sub
Thrift events
45

Pub/Sub Benefits
Less Flannel CPU
Simpler Flannel code
Schema data
Flexibility for cache management
46

Flexibility for Cache Management
Previously
◈ Load when the first user connects
◈ Unload when the last user disconnects
47

Flexibility for Cache Management
With Pub/Sub
◈ Isolate received events from user
connections
48

Another Closer Look
◈ With Pub/Sub, does Flannel need to
be on WebSocket path?
49

Next Step
Move Flannel out of WebSocket path
50

Next Step
Move Flannel out of WebSocket path
Why?
Separation & Flexibility
51

Evolution with Product Requirements
Grid for Big Enterprise
52

Team Affinity for Cache Efficiency
Before Grid
53

Team Affinity
Grid Aware
Now
54

Grid Awareness Improvements
Flannel Memory
Saves 22G of per host, 1.1TB total
55

Grid Awareness Improvements
DB Shard CPU Idle 25% -> 90% P99 User Connect Latency 40s -> 4s
For our biggest customer
56

Team Affinity
Grid Aware
Scatter & Gather
Future
57

Outline
○ Flannel
○ Client Pub/Sub
58

Expand Pub/Sub to Client Side
Client Side Pub/Sub
reduces events Clients have to handle
59

Presence Events
60% of all events
O(N2
)
1000 user team ⇒
1000 * 1000 = 1M events
60

Presence Pub/Sub
Clients
◈ Track who are in the
current view
◈ Subscribe/Unsubscribe to
Messaging server when
view changes
61

Outline
○ Flannel
○ Client Pub/Sub
62

What is Messaging Server
Messaging Server
63

A Message Router
Messaging Server
64

Events Routing and Fanout
Messaging Server
WebApp/
DB
1.Events happen
on team
2.Events Fanout
65

Limitations
◈ Sharded by Team
Single point of failure
66

Limitations
◈ Sharded by Team
Single point of failure
◈ Shared Channels
Shared states among teams
67

Topic Sharding
Everything is a Topic
public/private channel, DM, group DM,
user, team, grid
69

Topic Sharding
Natural fit for shared channels
70

Topic Sharding
Natural fit for shared channels
Reduce user perceived failures
71

Other Improvements
Auto failure recovery
72

Other Improvements
Publish/Subscribe
73

Other Improvements
Publish/Subscribe
Fanout at the edge
74

Our Journey
Problem
Incremental
Change
Architectural
Change
Ongoing
Evolution
75

More To ComeJourney Ahead
Get in touch:
https://slack.com/jobs
76

Thanks!
Any questions?
@bingw11
77

Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
slack-scalability

Scaling Slack

Recomendados

Recomendados

Mais conteúdo relacionado

Mais de C4Media

Mais de C4Media (20)

Último

Último (20)

Scaling Slack