Matteo Merli, the tech lead for Cloud Messaging Service at Yahoo, went through their design decisions, how they reached that and how they leverage Apache BookKeeper to implement a multi-tenant messaging service.
3. What is CMS
3
• Hosted Pub / Sub
• Multi tenant (Auth / Quotas / Load Balancer)
• Horizontally scalable
• Highly available, durable and consistent storage
• Geo Replication
• In production since 2013
CMS - Technical Overview
CMS Cluster
Producer
Broker
Consumer
Bookie
ZK
Global
ZK
Replication
4. CMS key features
4 CMS - Technical Overview
• Multi-tenancy / hosted
• Operating a system at scale is hard and requires deep understanding of internals
• Authentication / Self service provisioning / Quotas
• SLAs (Write latency 2ms avg - 5ms 99pct)
• Maintain the same latencies and throughput under backlog draining scenarios
• Simple high level API with clear ordering, durability and consistency semantics
• Geo-replication
• Single API call to configure regions to replicate to
• Load balancer: Dynamically optimize topics assignment to brokers
• Support large number of topics
• Store subscription position
• Apps don’t need to store it
• Able to delete data as soon as it's consumed
• Support round-robin distribution across multiple consumers
5. Work load examples
5 CMS - Technical Overview
Challenge # Topics # Producers /
topic
# Subscriptions /
topic
Produced
msg rate / s / topic
Fan-out 1 1 1 K 1 K
Throughput & latency 1 1 1 100 K
# Topics & latency 1 M 1 10 10
Fan-in 1 1 K 1 > 100 K
• Design to support wide range of use cases
• Need to be cost effective in every case
7. Messaging model
7 CMS - Technical Overview
• Producers can attach to a topic and send messages to it
• A subscription is a durable resources that is the recipient of all messages sent to
the topic, after its creation
• Subscriptions do have a type:
• “Exclusive” means that only one consumer is allowed to attach to this subscription. First
consumer decides the type.
• “Shared” allows multiple consumers. Messages are sent in round-robin distribution. No
ordering guarantees.
• “Failover” allows multiple consumers, though only one is receiving messages at a given
point, while others are in standby mode.
Consumer-5
Failover
Subscription-C
Consumer-4
Consumer-3
Consumer-2
Subscription-B
Shared
Exclusive
Consumer-1
Subscription-AProducer-X
Producer-Y
Topic
8. Client API
8
▪ Expose messaging model concepts (producer/consumer)
▪ C++ and Java
▪ Connection pooling
▪ Handle recoverable failures transparently (reconnect / resend
messages) without compromising ordering guarantees
▪ Sync / async version of every operation
CMS - Technical Overview
9. Java producer example
9
CmsClient client = CmsClient.create("http://<broker vip>:4080");
Producer producer = client.createProducer("my-topic");
// handles retries in case of failure
producer.send("my-message".getBytes());
// Async version:
producer.sendAsync("my-message".getBytes()).thenRun(() -> {
// Message was persisted
});
CMS - Technical Overview
10. Java consumer example
10
CmsClient client = CmsClient.create(“http://<broker vip>:4080");
Consumer consumer = client.subscribe(
“my-topic",
"my-subscription-name",
SubscriptionType.Exclusive);
// Blocks until message available
Message msg = consumer.receive();
// Do something...
consumer.acknowledge(msg);
CMS - Technical Overview
11. System overview
11 CMS - Technical Overview
Broker
• State-less
• Maintain in memory cache of
messages
• Read from Bookkeeper when
cache miss
Bookkeeper
• Distributed write-ahead log
• Create many ledgers
• Append entries
• Read entries
• Delete ledger
• Consistent reads
• Single writer (the broker)
CMS Cluster
Broker
Bookie
ZK
Global
ZK
Replication
Native
dispatcher
Managed
Ledger
BK
Client
Global
replicators
Cache
Load
Balancer
Producer App
CMS client
Consumer App
CMS client
12. System overview
12 CMS - Technical Overview
Native dispatcher
• Async Netty server
Global replicators
• If topic is global, republish
messages in other regions
Global Zookeeper
• ZK instance with participants in
multiple US regions
• Consistent data store for
customers configuration
• Accept writes with one region
downCMS Cluster
Broker
Bookie
ZK
Global
ZK
Replication
Native
dispatcher
Managed
Ledger
BK
Client
Global
replicators
Cache
Load
Balancer
Producer App
CMS client
Consumer App
CMS client
13. Partitioned topics
13
▪ Client lib has a wrapper producer/
consumer implementation
▪ No API changes
▪ Producers can decide how to
assign messages to partitions:
▪ Single partition
▪ Round robin
▪ Provide a key on the message
▪ Hash of the key determines the
partition
▪ Custom routing
CMS - Technical Overview
App
CMS Cluster
Broker 1
Producer
T1
P0
P1
P2
P3
P4
T1-
P0
Broker 2
Broker 3
T1-
P1
T1-
P2
T1-
P3
T1-
P4
14. Partitioned topics
14
▪ Consumers can use all
subscription type with the same
semantics
▪ In “Failover” subscription type, the
election is done per partition
▪ Evenly spread the partitions
assignment across all available
consumers
▪ No need for ZK coordination
CMS - Technical Overview
CMS Cluster
Broker 1
App
Consumer-1
T1
C0
C1
C2
C3
C4
T1-
P0
Broker 2
Broker 3
T1-
P1
T1-
P2
T1-
P3
T1-
P4
App
Consumer-2
T1
C0
C1
C2
C3
C4
16. CMS Bookkeeper usage
16
▪ CMS uses Bookkeeper through a higher level interface of
ManagedLedger:
› A single managed ledger represent the storage of a single topic
› Maintains list of currently active BK ledgers
› Maintains the subscription positions using an additional ledger to checkpoint the last
acknowledged message in the stream
› Cache data
› Deletes ledgers when all cursors are done with them
CMS - Technical Overview
17. Bookie internal structure
17 CMS - Technical Overview
• Writes are written both to
journal and to ledger storage
(in different device)
• Ledger storage writes are
fsynced periodically
• Reads are only coming from
ledger storage
• Entries are interleaved in entry
log files
• Ledger indexes are used to
find entries offset
18. Bookkeeper issues
18
▪ Performance degrades when writing to many ledgers at the same time
▪ When there are heavy reads, the ledger storage device gets slow and
will impact writes
▪ Ledger storage flushes need to fsync many ledger index files each time
CMS - Technical Overview
19. Bookie storage improvements
19 CMS - Technical Overview
• Writes are written both to
journal and to in memory write
cache
• Entries are periodically flushed
• Entries are sorted by ledger to
be sequential on disk (per
flush period)
• Since entries are sequential,
we added read-ahead cache
• Location index is mostly kept
in memory and only updated
during flush
20. Bookkeeper write latency
20
▪ After hardware, next limit to achieve low latency is JVM GC
▪ GC pauses are unavoidable. Try to keep them around ~50ms and as
least as frequents as possible
› Switched BK client and servers to use Netty pooled ref-counted buffers and direct
memory to hide it from GC and eliminate payload copies
› Extensively profiled allocations and substantially reduced per-entry objects allocations
• Use Recycler pattern to pool objects (very efficient for same thread allocate/release)
• Primitive collections
• Array queue instead of linked queues in executors
• Open hash maps instead of linked hash maps
• BTree instead of ConcurrentSkipList
CMS - Technical Overview
23. Auto batching
23
▪ Send messages in batches throughout the system
▪ Transparent to application
▪ Configure group timing and size: e.g.: 1ms / 128Kb
▪ For the same byte/s throughput lower the txn/s through the system
› Less CPU usage in broker/bookies
› Lower GC pressure
CMS - Technical Overview
24. Low durability
24
▪ Current throughput bottleneck for bookie writes is journal syncs
▪ Could add more bookies but bigger cost
▪ Some use cases are ok to lose data in rare occasions
▪ Solution
› Store data in bookies
• No memory limitation, can build big backlog
› Don’t write to bookie journal
• Data is stored in write cache in 2 bookies + broker cache
› Can lose < 1min data in case 1 broker & 2 bookies crash
▪ Higher throughput with less bookies
▪ Lower publish latency
CMS - Technical Overview