RedisConf17 - Redis in High Traffic Adtech Stack

Utilizing Redis in high traffic
Adtech stack
Rahul Babbar
Arjun Satya
Times Internet Ltd

About me
• Rahul Babbar
• Chief Manager – Technology, Adtech Colombia
• Times Internet Ltd
• Technology, soccer, philosophy, travel enthusiast.

Agenda
• About Times Internet
• About Colombia Adtech Stack.
• Where we use redis.
• Load testing, design decisions, cluster setup and configuration.
• Monitoring and more.
• Good practices
• Challenges

Times Internet Ltd
• Digital arm of Times Group
• 240+ Million Unique Visitors per month.
• Evolved from a digital media company to a digital products company.

Colombia
• Complete adtech stack
• Ad server
• Data Management Platform (DMP)
• Demand mediation
• Recommendation Service
• Billing, automation and self service
• powers ads on ~150 publishers, monetizes ~55% of news traffic in
India.
• ~9 billion ad impressions per month.

Redis to the rescue
Redis helped in achieving
• Low latency
• 99% of requests in under 2 ms
• Central caching layer
• Smart analytics

Redis Stacks
• Runtime cluster (User Profile)
• Master Slave (Central Caching Layer)
• Operational Cluster (for DMP)

Central Caching Layer
• Implemented JSR 107 specification(JCache)
• Write through Cache
• Helps in keeping the metadata in all ad components in sync.
• Uses redis pub-sub.

Data Management Platform
• User : Category : Date => frequency
• Analytics / HLL
• Collocation of site data using redis hash tagging.
• Lua scripting

Load Testing(~2016)
• Customized for cluster and our use case.
• test the network
• test the java clients also.
• ~20K requests per second.

Load testing design
• Redis cluster of 15 master nodes across 3 machines (5 nodes/machine)
• Java Client ->
• java client using a jar file
• use case => get the user profile and set an attribute in user profile
• while(true){
• Execute the use case
• Print the current time in HHMMSS:averagetime to get : average time to set
• }
• 3 client machines
• Each client machine ran 4 instances of java client

Load testing output
• Machine 1 file 1
• (Hour:Minute:Second:TimeToGet(microsec):TimeToSet(microsec))
• 10:00:20:1500:1000
• 10:00:20:1450:950
• …..
• 15:00:20:909:800
• ..
• …
• ……….
• Machine 3 file 4
• 10:00:20:1100:900

Load test continued.
• 3 client machines(4 java client instances per machine)
• ~15,000 operations per second.
• 6 client machines(4 java client instances per machine)
• ~28,000 operations per second.
• ~ Linear Increase(confidence that redis cluster could work for our use
case)

Decisions
• 512 GB memory
• How many redis nodes
• How much memory per node
• Number of slaves/master
• Appropriate Java client(Jedis/Lettuce)

Memory per node/Nodes per machine(512
GB)
• less nodes/machine => more memory/node
• 5 nodes => ~100 GB/node
• easy to manage.
• Utilizing only 5 cores
• Slow startup of all nodes on the machine
• more nodes/machine => Less memory/node
• 20 nodes => ~25 GB/node
• Fast startup
• more core utilization.
• Difficult to manage

Our configuration(Cluster)
• Each machine(512 GB)
• 20 nodes/machine
• 10 masters + 10 slaves per machine.
• ~24 GB/node
• 7 such machines for Runtime cluster
• 6 such machines for operational cluster
• 1 slave/master
• Jedis + Lettuce(Async calls)

Monitoring
• All software systems will fail at some point or the other because of
dependency on other systems. What matters is how fast can we
detect/predict such a failure and auto-heal it if possible.

Node level monitoring
• A script runs on all the machines which have redis nodes.
• checks every 30 seconds that 1 redis instance is running on each of
the ports 7000…7019
• If No, starts the instance and raises an alert.
• What if the machine(s) itself is down, so no alert. 

Stack Level(Global) Monitoring
• A script runs on 2 machines.
• It tries to “set” a key and “get” a key in each stack.
• If it fails, it raises an alert.
• If 2+ machines are down leading to the failure of redis stack, “set”
fails, it generates an alert.

Hourly health stats check
• Check the below every hour per node per stack
• Used memory
• No of keys
• No of connections
• Memory fragmentation ratio
• Slow queries
• Last background save was successful.
• Slaves are online and not lagging behind masters
• Raise an alert if any of these is abnormal.
• Email the report twice a day(10 AM, 6 PM) to make sure the script is
running.

Cluster masters distribution script
• Script checks whether each machine has equal no of masters(10 in
our case)
• Raise alert if not.

Graphs and more!!!
• Stats from ”info all” commands are pushed to graphite, and graphs are
created from grafana.
• Stats pushed.
• Memory
• No of keys.
• each type of command
• No of calls.
• CPU time
• Connected clients.
• New Keys
• Persistent keys.
• Input/Output bytes

Daily Stack report from random node
• Pick a random node
• Scan all records.
• Scrutinize key prefixes.
• Analyze data as per business.

Good practices
• Disabled save, nightly saves one after the other.
• Ensure TTL for keys
• Renamed commands
• Setting timeout for idle connections.
• ‘hz’ parameter.
• Application strategy in case of redis slowdown/failure.

Overall stats
• 4 clusters, 1 master-slave-sentinel
• ~160 + nodes, 2+ TB of master data.
• 1 slave per master node.
• 99+% requests served under 2 ms.
• DMP stack serves more than 2 million QPS with pipelining.

Challenges
• Tracking rogue clients.
• Who deleted my data?
• Who executed this slow query?
• Scan instead of keys helpful? Scan 0 match * count 1000000
• Who modified my cluster.
• What we did for security?
• private ips
• IP-tables

RedisConf17 - Redis in High Traffic Adtech Stack

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a RedisConf17 - Redis in High Traffic Adtech Stack

Semelhante a RedisConf17 - Redis in High Traffic Adtech Stack (20)

Mais de Redis Labs

Mais de Redis Labs (20)

Último

Último (20)

RedisConf17 - Redis in High Traffic Adtech Stack