O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Nonconformist Resilience: DB-backed Job Queues

129 visualizações

Publicada em

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2xNzh3d.

John Mileham presents how they use, deploy, and even love Delayed::Job (a database-backed job queue) at Betterment for its transactional enqueue semantics, safe retry with exponential backoff, and its storage model, which lends itself to simple but powerful SLA-based monitoring and alerting. DJ enables engineers to pour their creativity into their features and get resilience by default. Filmed at qconnewyork.com.

John Mileham is VP of Architecture at Betterment. After a decade building industry-defining products like Berklee Online, he decided to switch to FinTech where he could build industry-defining products that must also protect people's life savings and most sensitive personal information, first at ImpulseSave, and now Betterment.

Publicada em: Tecnologia
  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

Nonconformist Resilience: DB-backed Job Queues

  1. 1. Nonconformist Resilience: Database-backed Job Queues John Mileham | @jmileham
  2. 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ betterment-delayed-job
  3. 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  4. 4. User Signup with Email Confirmation
  5. 5. User Signup with Email Confirmation A feature so easy we’re still fighting about how to do it in 2017
  6. 6. Requirements: Validate the user’s profile information Store the user record to the database Email a link When the link is clicked, mark the user as verified
  7. 7. Requirements: Validate the user’s profile information Store the user record to the database Email a link When the link is clicked, mark the user as verified
  8. 8. Take 1: Inline the email delivery
  9. 9. Take 1: Inline the email delivery … but it’s slow
  10. 10. Take 2: Spin off a thread or use a thread pool
  11. 11. Take 2: Spin off a thread or use a thread pool … but it’s unreliable
  12. 12. Take 3: Use a grown-up message bus
  13. 13. Take 3: Use a grown-up message bus … but it’s unreliable?
  14. 14. Commit-then-Enqueue
  15. 15. Commit to DB App Timeline Enqueue to bus Customer Timeline Deliver to ESP Request Response Email
  16. 16. Commit to DB App Timeline Enqueue to bus Customer Timeline Deliver to ESP Request Response Email
  17. 17. Enqueue-then-Commit
  18. 18. Enqueue to bus App Timeline Commit to DB Customer Timeline Deliver to ESP Request Response Email
  19. 19. Enqueue to bus App Timeline Commit to DB Customer Timeline Deliver to ESP Request Email Response
  20. 20. Enqueue to bus App Timeline Commit to DB Customer Timeline Deliver to ESP Request Email Build Email App Timeline Response
  21. 21. You could make the enqueue and the database commit atomic via a distributed transaction manager, but: ● Mature, robust, distributed transaction managers aren’t available for all platforms ● They’re usually proprietary ● Even where they exist, these tools have nuanced configuration, can have operational warts and are another subsystem that requires care and feeding ● The additional network pings necessary to coordinate the commit between the datastores can cause write performance problems Distributed Transactions
  22. 22. Take 4: Use the database as a queue
  23. 23. Take 4: Use the database as a queue … but it won’t scale
  24. 24. Commit & Enqueue App Timeline Customer Timeline Deliver to ESP Request Response Email
  25. 25. Commit & Enqueue App Timeline Customer Timeline Deliver to ESP Request Response Email
  26. 26. Commit & Enqueue App Timeline Customer Timeline Deliver to ESP Request Response Email
  27. 27. Robust By Default
  28. 28. Addressing the pitfalls Because everything is a tradeoff
  29. 29. DJ: Retry with Exponential Backoff Two key columns: run_at, and attempts. ● Jobs are picked up oldest first ● Only jobs with a run_at in the past are workable ● When a job fails, a new future run_at is calculated from the previous run_at and number of previous attempts ● After too many failures (days later), a job will stop being attempted
  30. 30. Message Bus Solution: DLQs Messages don’t have a desired delivery time in a message bus, so exponential backoff isn’t feasible. Message delivery will be attempted a preconfigured number of times, and then transferred to a dead-letter queue, or a cascading set of queues to approximate exponential backoff.
  31. 31. DJ: Priority Delayed::Job will work off the highest priority first. Pickup is simply a matter of sorting on priority and then run_at. We use priority to establish different service level objectives for different kinds of work. Allows developer not to worry about resourcing their jobs, leaning into DJ. Allows DJ to fully utilize its worker capacity.
  32. 32. Message Bus Solution: Topics Message busses can’t as easily support priority. To assure resource availability for important work, work is shunted to a specific topic or queue with its own resource pool. Strong assurance that one job type won’t exhaust resources of another type. But you must resource each topic individually.
  33. 33. DJ’s got topics too ;) Even though it’s not the only way to organize work, if you have a mission critical work stream that must be processed no matter what, you can use a specialized queue to keep its workers separate. Opt in for as much control as you need, only when you need it.
  34. 34. More Featureful, Not Less
  35. 35. Betterment’s Schema Users Deposits Bank Accounts Goals Investing Accounts Auto Deposits State- ments
  36. 36. Betterment’s Schema Users Deposits Bank Accounts Goals Investing Accounts Auto Deposits State- ments
  37. 37. Power Law Distribution
  38. 38. The Message Bus Isn’t a Silver Bullet
  39. 39. Coordinated Polling ● Your application chooses a global polling interval, say a half second. ● Every active worker process inserts itself into an active_workers table with a last_active_at timestamp and maintains it every 30 seconds or so. ● Every few seconds, each worker queries the number of recently active workers. ● It then multiplies the global polling interval by the number of workers and adds random jitter to prevent thundering herds ● Your app converges on the desired polling interval at arbitrary worker scale
  40. 40. When is a DB-backed queue the right tool?
  41. 41. 1. Should your app use a DB at all? You should be using an ACID SQL DB if: ● You have a read-heavy usage pattern ● You value agility in supporting new use cases ● You aren’t launching directly into #webscale ● Or even if you are, your app doesn’t exist primarily to solve a graph problem ○ if you’re going big and still want to use SQL, your dataset must inherently shardable
  42. 42. 2. Are your clients human? If clients are interacting with your app like humans, i.e.: ● They do individual operations at a reasonable pace ● The don’t generate batches of 10,000 operations at once Then you’re looking still looking good.
  43. 43. 3. Are Your Bulk Operations Cool? ● Are there relatively few of them? ● Are they customer experience-impacting? ● Are they no more than daily?
  44. 44. All Yes? All Set.
  45. 45. Operating a DB-backed Queue
  46. 46. Alerting Needs Two key alerts: 1. Max attempt count 2. Max age Both metrics are partitioned by job priority.
  47. 47. Max Attempt Count Total backoff time function: n == 0 ? 0 : n ** 4 + 5 + backoff(n-1) ● First retry in 6 seconds ● Third retry in 2 minutes ● Fifth retry in 16 minutes ● Tenth retry in 7 hours ● Twentieth retry in 8 days Our thresholds: ● INTERACTIVE errors after 2 attempts (~30 seconds) ● EVENTUAL errors after 8 attempts (~2.5 hours)
  48. 48. Max Age Age is defined as now() - run_at.
  49. 49. This is your brain on DJ
  50. 50. (your message may vary)
  51. 51. Why not just use DJ?
  52. 52. Why not just use DJ?
  53. 53. Date Author June 27th, 2017 John Mileham | @jmileham (is hiring)
  54. 54. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ betterment-delayed-job

×