O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Distributed Data Systems

A brief overview of distributed data systems in the context of analytics ingestion data pipelines.

  • Entre para ver os comentários

Distributed Data Systems

  1. 1. Distributed Data Systems How Do They Even?
  2. 2. About Me - Jared L Kerim - Software Developer (Python) - Mozilla Geolocation Cloud Services Team - CTO at PressureNET
  3. 3. PressureNET (Shameless Plug) - Gathers sensor data from smartphones - Constant stream of data to servers - API to retrieve data - Visualization - Analysis
  4. 4. The First Architecture Sensors Web Servers MySQL API
  5. 5. The Problem: MySQL - Slow lookups - Takes a lot of disk space - Cost (Large Relational DBs are expensive) - Schema changes (become slow or impossible)
  6. 6. How Big is “Big” - PressureNET 100 req/s, 1.5 billion records - Analytics Systems 5000 req/s, 100s of billions of records - Ad Buying Service 500k req/s, trillions of records
  7. 7. The Question What is ???? Sensors ???? APIWeb Servers
  8. 8. What do we want to accomplish? - Receive and store large amounts of data - Access it quickly - Small fast lookups (visualization) - Large batch computations (mapreduce)
  9. 9. Considerations - Durability (we don’t want to lose data) - Redundancy (expect failures!) - Scalability (simple growth, no upper limit)
  10. 10. Durability - Data in a durable store should be ‘safe’ - Don’t remove data from one durable data store until it is confirmed to be in another durable data store - Durable data stores should have redundant backups (hot standbys)
  11. 11. Redundancy - Each stage of your system should have multiple copies - If one copy goes down, another should take over - Redundancy ensures availability
  12. 12. Scalability - The rate of data intake can grow or spike - Your system should be able to add more resources to handle that growth - Require that your workload is partitionable
  13. 13. Proposed Architecture Sensors Ingestors Queue Aggregator S3 DynamoDB
  14. 14. We Are Not Alone - This architecture is widely adopted - Analytics - Ad Serving/Views - Log Analysis - Sensor Data - Game Events - Video Events
  15. 15. Ingestors - A redundant, scalable set of nodes which receive data over http - Can apply early validation and authentication - Stateless, low latency
  16. 16. Queue - A scalable, durable storage mechanism for data ‘in flight’ - Only holds data temporarily - Typically preserves the order data was received in
  17. 17. Aggregator - A scalable, stateless set of workers which consume data from the queue - Can process data in small batches - Write raw or transformed data to persistent storage such as S3, Databases, etc.