Presentation given at the GoSF meetup on July 20, 2016. It was also recorded on BigMarker here: https://www.bigmarker.com/remote-meetup-go/GoSF-EVCache-Peripheral-I-O-Building-Origin-Cache-for-Images
5. Why Optimize for AWS
● Instances disappear
● Zones disappear
● Regions can "disappear" (Chaos Kong)
● These do happen (and we test all the time)
● Network can be lossy
○ Throttling
○ Dropped packets
● Customer requests move between regions
6. EVCache Use @ Netflix
● 70+ distinct EVCache clusters
● Used by 500+ Microservices
● Data replicated over 3 AWS regions
● Over 1.5 Million replications per second
● 65+ Billion objects
● Tens of Millions ops/second (Trillions per day)
● 170+ Terabytes of data stored
● Clusters from 3 to hundreds of instances
● 11000+ memcached instances of varying size
8. Architecture
● Complete bipartite graph between clients and servers
● Sets fan out, gets prefer closer servers
● Multiple full copies of data
us-west-2a us-west-2cus-west-2b
Client
11. Use Case: Lookaside cache
Application (Microservice)
Service Client Library
Client Ribbon Client
S S S S. . .
C C C C. . .
. . .
Data Flow
12. Use Case: Primary Store
Offline / Nearline
Precomputes for
Recommendations
Online Services
Offline Services
. . .
Online Client Application
Client Library
Client
Data Flow
13. Use Case: Transient Data Store
Online Client Application
Client Library
Client
Online Client Application
Client Library
Client
Online Client Application
Client Library
Client
. . .
14. Additional Features
● Global data replication
● Secondary indexing (debugging)
● Cache warming (faster deployments)
● Consistency checking
All powered by metadata flowing through Kafka
15. Region BRegion A
Repl Relay
Repl Proxy
KafkaRepl Relay
Repl Proxy
1 mutate
2 send
metadata
3 poll msg
5
https send
m
sg
6
mutate
4
get data
for set
APP
Kafka
Cross-Region Replication
7 read
APP
20. Moneta
Moneta: The Goddess of Memory
Juno Moneta: The Protectress of Funds for Juno
● Evolution of the EVCache server
● EVCache on SSD
● Cost optimization
● Ongoing lower EVCache cost per stream
● Takes advantage of global request patterns
21. Old Server
● Stock Memcached and Prana (Netflix sidecar)
● Solid, worked for years
● All data stored in RAM in Memcached
● Became more expensive with expansion / N+1 architecture
Memcached
Prana
Metrics & Other Processes
22. Optimization
● Global data means many copies
● Access patterns are heavily region-oriented
● In one region:
○ Hot data is used often
○ Cold data is almost never touched
● Keep hot data in RAM, cold data on SSD
● Size RAM for working set, SSD for overall dataset
23. New Server
● Adds Rend and Mnemonic
● Still looks like Memcached
● Unlocks cost-efficient storage & server-side intelligence
Rend
Prana
Metrics & Other Processes
Memcached (RAM)
Mnemonic (SSD)
external internal
25. Rend
● High-performance Memcached proxy & server
● Written in Go
○ Powerful concurrency primitives
○ Productive and fast
● Manages the L1/L2 relationship
● Server-side data chunking
● Tens of thousands of connections
26. Rend
● Modular to allow future changes / expansion of scope
○ Set of libraries and a default main()
● Manages connections, request orchestration, and backing
stores
● Low-overhead metrics library
● Multiple orchestrators
● Parallel locking for data integrity
Server Loop
Request Orchestration
Backend Handlers
M
E
T
R
I
C
S
Connection Management
Protocol
27. Moneta in Production
● Serving some of our most important personalization data
● Rend runs with two ports
○ One for regular users (read heavy or active management)
○ Another for "batch" uses: Replication and Precompute
● Maintains working set in RAM
● Optimized for precomputes
○ Smartly replaces data in L1
Std
Std
Prana
Metrics & Other Processes
Memcached
Mnemonic
external internal
Batch
29. Mnemonic
● Manages data storage to SSD
● Reuses Rend server libraries
○ Handles Memcached protocol
● Core logic implements Memcached operations into
RocksDB
Rend Server Core Lib (Go)
Mnemonic Op Handler (Go)
Mnemonic Core (C++)
RocksDB
Mnemonic Stack
30. Why RocksDB for Moneta
● Fast at medium to high write load
○ Goal: 99% read latency ~20-25ms
● LSM Tree Design minimizes random writes to SSD
○ Data writes are buffered
● SST: Static Sorted Table
Record A Record B
SST SST SST
...
memtables
31. How we use RocksDB
● FIFO "Compaction"
○ More suitable for our precompute use cases
○ Level compaction generated too much traffic to SSD
● Bloom filters and indices kept in-memory
● Records sharded across many RocksDBs per instance
○ Reduces number of SST files checked, decreasing latency
...
Mnemonic Core Lib
Key: ABC
Key: XYZ
RocksDB’s
32. FIFO Limitation
● FIFO compaction not suitable for all use cases
○ Very frequently updated records may prematurely push out other
valid records
● Future: custom compaction or level compaction
SST
Record A2
Record B1
Record B2
Record A3
Record A1
Record A2
Record B1
Record B2
Record A3
Record A1
Record B3Record B3
Record C
Record D
Record E
Record F
Record G
Record H
SST SST
time
33. Moneta Performance Benchmark
● 1.7ms 99th percentile read latency
○ Server-side latency
○ Not using batch port
● Load: 1K writes/sec, 3K reads/sec
○ Reads have 10% misses
● Instance type: i2.xlarge