You have an application that works well on a single node, and you’ve heard that Erlang lets you scale out in a cluster. How do you go about doing that?
We’ll walk through the steps I took to turn ExVenture (a multiplayer game server) into a distributed application.
Starting with connecting nodes in development and production, to picking a cluster leader via the Raft protocol, and dealing with process groups to fan calls throughout the cluster.
Finally we’ll see some of the hurdles I encountered when spanning multiple nodes.
16. Everything Assumed One Node
● Heavy geared for a single node
● Most processes were local Registry
● Data stored in local ETS tables
● Lots of in process state (a virtual world)
31. Picking a Leader
● Each node waits a random amount of time and picks itself
as leader
● Each other node votes for the first node it sees ask
● Once a majority is chosen the leader is picked and starts
the world
32. What the leader does
● Pushes zones out across the cluster
● When a node dies
○ Looks at zones that should be online
○ Spins them up across the cluster
37. Spanning the Cluster
● Cache updates need to be updated on each node
● Process groups to handle this
● Might be a nicer way to handle these
38. Join the pg2 group on cache start
defmodule Game.Items do
@ets_key :items
def init(_) do
:ok = :pg2.create(@ets_key)
:ok = :pg2.join(@ets_key, self())
#...
end
39. Client API
def insert(item) do
members = :pg2.get_members(@ets_key)
Enum.map(members, fn member ->
GenServer.call(member, {:insert, item})
end)
end
45. Things that should happen at most once
● Attached to Gossip, a chat service
● Each node connects as a websocket to Gossip
● New messages posted at most once to the local channel
● Only the leader node should handle these actions
47. Single process being overloaded by messages
● Room processes became a bottleneck
● Create a side process that handles notifications
● PR #72
● ~230 -> ~600
48. Single process being overloaded by data size
● Session registry was overloaded with size of user data
● Pushing a large preloaded Ecto struct around
● Massively simplify what is stored
● PR #73
● ~600 -> ~1200
49. Too large messages
● The data being passed around was too large
● Same huge User structs
● Ran out of ram at 50GB
● Use same simplification in messages
● PR #74
● ~1200 -> ~3500
50.
51. Specs
- Intel Core i7-6700K
- Quad Core, with Hyperthreading
- 64GB of RAM