The document discusses techniques for building resiliency into real-time bidding systems. It describes monitoring the system through logging, heartbeats, and metrics collection. It also covers detecting and recovering from errors through techniques like circuit breakers and bulkheads. Rollbacks and retries are suggested for data errors, while circuit breakers and failovers can help handle system integration errors.
1. No bid left behind
My day to day handling a resilient real time bidding platform in a JVM environment.
Marc de Palol
Trovit
2. Hey hi,
• Studied here (good to be back)
• Some research on supercomputing
• Moved to London, discovered Hadoop & intensive
data systems.
• Came back, still in the ‘Data Engineering’ stuff.
3. A classified search engine for property, jobs, cars, products and holiday rentals
• 180 Million ads,
• 170 Tb in the cluster
• 65 Million uniques / 170 Million visits
• 10 apps (iOS, Android)
• Cool office in Barcelona.
have a look at http://www.trovit.es
4. Real Time Bidding
It’s about selling ads.
• Per impression basis.
• Programmatic instantaneous auction
5. We are using ‘DoubleClick Ad Exchange’ (Google)
• Response under 100 ms.
• If 15% of our responses are invalid or timed out,
we stop getting bid requests progressively
16. • Logging with ‘mailAppender’
log4j.appender.mail=org.apache.log4j.net.SMTPAppender
log4j.appender.mail.SMTPHost=localhost
log4j.appender.mail.From=Error <error-bla@trovit.com>
log4j.appender.mail.To=tech@trovit.com, ceo@trovit.com
log4j.appender.mail.Subject=[ERROR] WE ARE GOING TO DIE
log4j.appender.mail.layout=org.apache.log4j.PatternLayout
log4j.appender.mail.threshold=ERROR
17. • Logging with ‘mailAppender’
Probably, no e-mail when you’ve got an OOM.
log4j.appender.mail=org.apache.log4j.net.SMTPAppender
log4j.appender.mail.SMTPHost=localhost
log4j.appender.mail.From=Error <error-bla@trovit.com>
log4j.appender.mail.To=tech@trovit.com, ceo@trovit.com
log4j.appender.mail.Subject=[ERROR] WE ARE GOING TO DIE
log4j.appender.mail.layout=org.apache.log4j.PatternLayout
log4j.appender.mail.threshold=ERROR
23. • Logging with ‘mailAppender’
• Bad when OOM.
• Heartbeat
• Doing some real work
24. • Logging with ‘mailAppender’
• Bad when OOM.
• Heartbeat
• Doing some real work
• Supervision with actors
• If you’re using Akka
• control flow != data flow
30. Bad data in the system
or / and
Errors in the system
31. Data errors.
Roll back (when possible)
• Keeping different versions in the DB.
• Keep the old version around.
• Know how to do a rollback.
32. Data errors.
Roll back (when possible)
• Keeping different versions in the DB.
• Keep the old version around.
• Know how to do a rollback.
33. Checks & Asserts with google guava.
checkArgument(i >= 0,
"Argument was %s but expected nonnegative", i);
checkArgument(i < j,
"Expected i < j, but %s > %s", i, j);
checkNotNull(myList,
"List should not be null")
checkState(object.isValid(),
"Object is not valid")
34. System errors
These happen mostly between system integrations.
• Your code and the DB.
• Your code and the 3rd party library.
• Your code and the queue.
35. DBs, a necessary supervillain
• Lost connection.
• Timeouts
• Can give you corrupted data.
• Can give you 0 data.
• Can give you too much data.
41. Once the circuit breaker is open,
• Notify
• Try again! maybe.
• Try to avoid DOS your own system.
• Exponential retry.
• Failover
• Restart
42. Some other bits and pieces:
• Tight coupling leads to fast propagation of errors.
• Event driven stuff
• Complete parameter checking
• Avoid SPF’s. Pretty please.
• Stateless is better.
• Bounded queues!