1. Utilizing Redis in high traffic
Adtech stack
Rahul Babbar
Arjun Satya
Times Internet Ltd
2. About me
• Rahul Babbar
• Chief Manager – Technology, Adtech Colombia
• Times Internet Ltd
• Technology, soccer, philosophy, travel enthusiast.
3. Agenda
• About Times Internet
• About Colombia Adtech Stack.
• Where we use redis.
• Load testing, design decisions, cluster setup and configuration.
• Monitoring and more.
• Good practices
• Challenges
4. Times Internet Ltd
• Digital arm of Times Group
• 240+ Million Unique Visitors per month.
• Evolved from a digital media company to a digital products company.
6. Colombia
• Complete adtech stack
• Ad server
• Data Management Platform (DMP)
• Demand mediation
• Recommendation Service
• Billing, automation and self service
• powers ads on ~150 publishers, monetizes ~55% of news traffic in
India.
• ~9 billion ad impressions per month.
11. Central Caching Layer
• Implemented JSR 107 specification(JCache)
• Write through Cache
• Helps in keeping the metadata in all ad components in sync.
• Uses redis pub-sub.
13. Data Management Platform
• User : Category : Date => frequency
• Analytics / HLL
• Collocation of site data using redis hash tagging.
• Lua scripting
14. Load Testing(~2016)
• Customized for cluster and our use case.
• test the network
• test the java clients also.
• ~20K requests per second.
15. Load testing design
• Redis cluster of 15 master nodes across 3 machines (5 nodes/machine)
• Java Client ->
• java client using a jar file
• use case => get the user profile and set an attribute in user profile
• while(true){
• Execute the use case
• Print the current time in HHMMSS:averagetime to get : average time to set
• }
• 3 client machines
• Each client machine ran 4 instances of java client
17. Load test continued.
• 3 client machines(4 java client instances per machine)
• ~15,000 operations per second.
• 6 client machines(4 java client instances per machine)
• ~28,000 operations per second.
• ~ Linear Increase(confidence that redis cluster could work for our use
case)
18. Decisions
• 512 GB memory
• How many redis nodes
• How much memory per node
• Number of slaves/master
• Appropriate Java client(Jedis/Lettuce)
19. Memory per node/Nodes per machine(512
GB)
• less nodes/machine => more memory/node
• 5 nodes => ~100 GB/node
• easy to manage.
• Utilizing only 5 cores
• Slow startup of all nodes on the machine
• more nodes/machine => Less memory/node
• 20 nodes => ~25 GB/node
• Fast startup
• more core utilization.
• Difficult to manage
20. Our configuration(Cluster)
• Each machine(512 GB)
• 20 nodes/machine
• 10 masters + 10 slaves per machine.
• ~24 GB/node
• 7 such machines for Runtime cluster
• 6 such machines for operational cluster
• 1 slave/master
• Jedis + Lettuce(Async calls)
22. Monitoring
• All software systems will fail at some point or the other because of
dependency on other systems. What matters is how fast can we
detect/predict such a failure and auto-heal it if possible.
23. Node level monitoring
• A script runs on all the machines which have redis nodes.
• checks every 30 seconds that 1 redis instance is running on each of
the ports 7000…7019
• If No, starts the instance and raises an alert.
• What if the machine(s) itself is down, so no alert.
24. Stack Level(Global) Monitoring
• A script runs on 2 machines.
• It tries to “set” a key and “get” a key in each stack.
• If it fails, it raises an alert.
• If 2+ machines are down leading to the failure of redis stack, “set”
fails, it generates an alert.
25. Hourly health stats check
• Check the below every hour per node per stack
• Used memory
• No of keys
• No of connections
• Memory fragmentation ratio
• Slow queries
• Last background save was successful.
• Slaves are online and not lagging behind masters
• Raise an alert if any of these is abnormal.
• Email the report twice a day(10 AM, 6 PM) to make sure the script is
running.
27. Cluster masters distribution script
• Script checks whether each machine has equal no of masters(10 in
our case)
• Raise alert if not.
28. Graphs and more!!!
• Stats from ”info all” commands are pushed to graphite, and graphs are
created from grafana.
• Stats pushed.
• Memory
• No of keys.
• each type of command
• No of calls.
• CPU time
• Connected clients.
• New Keys
• Persistent keys.
• Input/Output bytes
34. Good practices
• Disabled save, nightly saves one after the other.
• Ensure TTL for keys
• Renamed commands
• Setting timeout for idle connections.
• ‘hz’ parameter.
• Application strategy in case of redis slowdown/failure.
36. Overall stats
• 4 clusters, 1 master-slave-sentinel
• ~160 + nodes, 2+ TB of master data.
• 1 slave per master node.
• 99+% requests served under 2 ms.
• DMP stack serves more than 2 million QPS with pipelining.
37. Challenges
• Tracking rogue clients.
• Who deleted my data?
• Who executed this slow query?
• Scan instead of keys helpful? Scan 0 match * count 1000000
• Who modified my cluster.
• What we did for security?
• private ips
• IP-tables