O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Building an event system on top MongoDB

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Próximos SlideShares
Top-Down Approach to Monitoring
Top-Down Approach to Monitoring
Carregando em…3
×

Confira estes a seguir

1 de 22 Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Semelhante a Building an event system on top MongoDB (20)

Anúncio

Mais recentes (20)

Building an event system on top MongoDB

  1. 1. BUILDING A MISSION CRITICAL EVENT SYSTEM ON TOP OF MONGODB by @shahar_kedar
  2. 2. BIGPANDA SaaS platform that lets companies aggregate alerts from all their monitoring systems into one place for faster incident discovery and response.
  3. 3. HOW IT WORKS High CPU on prod-srv-1 18/06/14 16:05 CRITICAL High CPU on prod-srv-1 18/06/14 16:07 WARNING Memory usage on prod-srv-1 18/06/14 16:08 CRITICAL Events Entities High CPU on prod-srv-1 WARNING Memory usage on prod-srv-1 CRITICAL Incidents 2 Alerts on prod-srv-1
  4. 4. PRODUCT REQUIREMENTS • Events need to be processed into incidents and streamed to the user’s browser as fast as possible • Incidents need to reliably reflect the state as it is in the monitoring system • The service has to be up and running 24x7
  5. 5. MISSION CRITICAL • It’s not rocket science, it’s not Google, but: • It has to be super fast • It has to be extremely reliable • It has to always be available
  6. 6. OUR #1 COMPETITOR
  7. 7. WHY MONGO? BECAUSE IT’S WEB SCALE!
  8. 8. WHY MONGO? At first: • NodeJS shop • Schemaless • Easy to master Later on: • Reliable • Easy to evolve • Partial and atomic updates • Powerful query language BECAUSE IT’S WEB SCALE!
  9. 9. SUPER FAST Hardware Schema Design Lean & Stream
  10. 10. HARDWARE 03/13 3 x m1.medium 02/14 1 x i2.xlarge
 + 2 x m1.medium m1.medium: 1 vCPUs, 3.75GB RAM, EBS drive 06/14 2 x i2.xlarge
 + 1 x m3.xlarge m3.xlarge: 4 vCPUs, 15GB RAM, EBS drive i2.xlarge: 4 vCPUs, 30.5GB RAM, SSD 800GB x3 reads x4 writes
  11. 11. –Eliot Horowitz “Schema design is … the largest factor when it comes to performance and scalability … more important than hardware, how you shard, or anything else, schema is by far the most important thing.”
  12. 12. SCHEMA DESIGN Event { timestamp : Date status: String description: String, } Entity { start : Date end: Date status: String description: String, events: [ <embedded> ] source_system: String } Incident { start : Date end: Date is_active: Boolean description: String, entities: [
 { entityId: ObjectId status: String } ] }
  13. 13. DENORMALIZATION • Go over the checklist (http://bit.ly/1vUdz2T) • Incidents => Entities: partially embedded + ref • Cardinality: one-to-few • Direct access to Entities • Entities are frequently updated • Entities => Events: embedded • Events are not directly accessed • Events are immutable • Cardinality: one-to-many ~ one-to-gazzilion
  14. 14. INDEXES • Optimized indexes 
 db.collection.find({..}).explain() • Removed redundant indexes • Truncated events collections (TTL index)
  15. 15. LEAN QUERIES • Use projections to limit fields returned by a query:
 Model.find().select(‘-events’) • Mongoose users: use .lean() when possible to gain more than 50% performance boost:
 Model.find().lean() • Stream results: 
 Model.find().stream().on(‘data’, function(doc){})

  16. 16. RESULTS • Average latency of all API calls went from 500ms to under 20ms • Average latency of full pipeline went from 2s to under 500ms • Peak time latency of full pipeline went down from 5m(!!) to less than 30s
  17. 17. EXTREMELY RELIABLE Atomic & Partial Updates
  18. 18. ATOMIC & PARTIAL UPDATES • Several services might try to update the same document at the same time, but: • Different systems update different parts of the document • Updates to the same document are sharded and ordered at the application level 
 (read our awesome blog post: http://bit.ly/1nQVcbS)
  19. 19. IMPOSSIBLETO KILL Replica Set Disaster Recovery
  20. 20. REPLICA SET • 3 nodes replica set • Using priorities to enforce master election of stronger nodes • Deployed on different availability zones
  21. 21. DISASTER RECOVERY • Cold backup using MMS Backup • Full production replication on another EC2 region: using mongo’s replication mechanism to continuously sync data to the backup region
  22. 22. THANKYOU!

Notas do Editor

  • For each customer:
    aggregate alert notifications from multiple monitoring systems
    group together alerts that belong to the same monitored appliance
    group together, into “incidents”, alerts that are (topo-)logically related

×