3. Real Time Bidding (RTB)
â
Real-time bidding is a dynamic auction process where each
impression is a bid for in (near) real time versus a static auction
â
Kenshoo is engaged In Facebook Exchange (FBX)
â
In FBX, each bid has a life-time of 120ms. All transactions have to
complete within that period, and the winning ad is presented to the
user.
â
Kenshoo employs ad re-targeting, where search engine campaigns
are extended to the social network, thus giving a much higher ROI for
our customers
8. Requirements
â
â
Handle 25K+ requests within the 120ms bid time-frame including
network latencies
Ability to scale up to 1M per minute requests while keeping the
current latency
â
Handle ~10K writes/second with low latency
â
Multi DC Configuration, all nodes must be sync-ed in real-time
â
Seamless Operations: Compactions and Repairs
â
High Security
9. C* Physical Architecture
(US) West Region
(US) East Region
App
App
App
App
App
App
Internet
GRE
VPN
FBX WEST
VPN
FBX EAST
10. C* Cluster Information
â
â
â
â
â
â
â
â
â
Cassandra version 1.2.6
Oracle Java 7
Manual tokens, Vnodes Are Coming Soon
Multi-DC Configuration
Network Topology
DC Connectivity between VPCs via Linux GRE
Amazon C3.2xlarge instance type
Ubuntu 13.10 with EXT4
SSD (Ephemeral)
The Ring
11. C* Cluster Network Between Sites
â
For security reasons we,
â
â
â
Do not use EC2Snitch or EC2MultiRegionSnitch
Connected the nodes via VPN (Linux GRE)
Linux GRE is fast, reliable and provides high throughput
(~1Gb/s)
12. C* Cluster Storage
â
We started with Amazon EBS:
â
â
â
â
With small #nodes (up to 4 nodes): You want persistent
storage; avoid running repairs if you lose a node
4xEBS devices in RAID10 configuration: Provide up to 1000
IOPs and bursts of up to 2000 IOPS
Cheap in AWS
8 nodes with Ephemeral Devices:
â
â
â
â
Lower risk: if you lose a node, recovery isnât as heavy on the
whole cluster
We used RAID0
Higher performance (double than EBS)
Free, bundled within the instances
13. C* Cluster Storage continued
â
16 nodes with Ephemeral Devices:
â
â
â
â
When load became heavy we grew to 16 nodes
Compactions and repairs harmed the cluster latency
We had to use Provisioned IOPs devices for C* maintenance
C3 Instance type with SSD:
â Came just in time providing ephemeral SSD storage
â They solved our performance problems and enabled
seamless compactions and repairs
â Amazon currently has scarce deployment of this H/W and
nodes are not stable
â Not available yet in all regions
â C3 Nodes Deployment are not always a possiblity due to AWS
capacity issues
â Amazon promised to resolve the C3 issues next month
15. Monitoring
â
We heavily rely on DataStax OpsCenter
â
We grab OpsCenter Metrics out for graphings
â
We wrote our own Read/Write Speed Test on separate dedicated KeySpace on
each node to detect bottlenecks and problematic nodes
â
We Sample the data separately from the Application to detect if the problem
origins are C* or the application
16. What have we learned
â
â
â
â
Storage:
â Use SSD:
â It provides high and stable disk performance
â Neutralizes Compaction and Repair effects on the cluster
â Worth the money
Network:
â Use highest bandwidth VPN possible
â GRE is great (lacks encryption, but provides best bandwidth)
Maintenance:
â Run Compact Daily: It does miracle to performance on heavy loads
â If you are not on SSD, disable thrift on the node before running compaction
â Do compactions in sequence, node by node
â On high-load systems, avoid repair as possible, itâs better to decommission
and recommission a node than to run repair!
â If you have to repair, always use â-prâ flag and if possible use the
incremental repair option (requires heavy scripting)
Monitoring:
â Write a sampler and speed tester for each node to detect bottlenecks and
performance issues sources