O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Elastic Data Analytics Platform @Datadog

715 visualizações

Publicada em

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2l2Rr6L.

Doug Daniels discusses the cloud-based platform they have built at DataDog and how it differs from a traditional datacenter-based analytics stack. He walks through the decisions they have made at each layer, covers the pros and cons of these decisions and discusses the tooling they have built. Filmed at qconsf.com.

Doug Daniels is a Director of Engineering at Datadog, where he works on high-scale data systems for monitoring, data science, and analytics. Prior to joining Datadog, he was CTO at Mortar Data and an architect and developer at Wireless Generation, where he designed data systems to serve more than 4 million students in 49 states.

Publicada em: Tecnologia
  • Seja o primeiro a comentar

Elastic Data Analytics Platform @Datadog

  1. 1. InfoQ.com: News & Community Site • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ datadog-cloud
  2. 2. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  3. 3. The Evolution of a Data Project
  4. 4. The Evolution of a Data Project Python script
  5. 5. The Evolution of a Data Project Python script SQL on 
 live DB
  6. 6. The Evolution of a Data Project Python script SQL on reporting DB SQL on 
 live DB
  7. 7. The Evolution of a Data Project Python script SQL on reporting DB SQL on 
 live DB Terrible confusion
  8. 8. The Evolution of a Data Project Python script SQL on reporting DB SQL on 
 live DB Terrible confusion Hadoop / Spark cluster
  9. 9. What needs fixing image: Pexels
  10. 10. What needs fixing image: Pexels • One cluster: data lock-in.
  11. 11. What needs fixing image: Pexels • One cluster: data lock-in. • Want cluster time? You have to wait.
  12. 12. What needs fixing image: Pexels • One cluster: data lock-in. • Want cluster time? You have to wait. • Clusters are underutilized and EXPENSIVE
  13. 13. Elastic Big Data Platform @ Datadog Doug Daniels Director, Engineering
  14. 14. WHOM What’s our big data platform do? Data Engineers Data Scientists
  15. 15. WHOM What’s our big data platform do? Data Engineers Data Scientists do WHAT App features Statistical Analysis/ML Ad-hoc investigation
  16. 16. WHOM What’s our big data platform do? Data Engineers Data Scientists do WHAT App features Statistical Analysis/ML Ad-hoc investigation WITH Spark Hadoop (Pig) Python (Luigi) with
  17. 17. Exploring the platform COPIOUS
 TOOLING CLOUD STORAGE ELASTIC COMPUTE
  18. 18. CLOUD STORAGE
  19. 19. What do we store?
  20. 20. 150 Integrations …and more
  21. 21. What’s time series data? timestamp 1447020511 metric system.cpu.idle value 98.16687 tags host:i-xyz, role:cassandra, …
  22. 22. We collect over a trillion of these per day …and growing!
  23. 23. Where to put the petabytes? Amazon S3. Amazon S3
  24. 24. How data gets to S3 116 - Buffer - Sort + Dedupe - Upload GO - Partition + Sort - Write Parquet - Update Metastore LUIGI/SPARK/PIG HIVE METASTORE Internal Format AMAZON S3 Parquet Metadata Kafka
  25. 25. Isn’t this a job for HDFS?
  26. 26. What we don’t love about HDFS
  27. 27. What we don’t love about HDFS • Causes the “one cluster” problem
  28. 28. What we don’t love about HDFS • Causes the “one cluster” problem • Come for the storage, get stuck with the servers
  29. 29. What we don’t love about HDFS • Causes the “one cluster” problem • Come for the storage, get stuck with the servers • No Java? No data!
  30. 30. S3 is flexible! • Read data from as many clusters as you want
  31. 31. S3 is flexible! • Read data from as many clusters as you want • Store unlimited stuff(*) with no management * Accepting laws of physics and your credit card limit
  32. 32. S3 is flexible! • Read data from as many clusters as you want • Store unlimited stuff(*) with no management • Rock solid: durability (99.999999999), availability (99.99) * Accepting laws of physics and your credit card limit
  33. 33. S3 is flexible! • Read data from as many clusters as you want • Store unlimited stuff(*) with no management • Rock solid: durability (99.999999999), availability (99.99) • Access from any programming language * Accepting laws of physics and your credit card limit
  34. 34. Decouple data and compute (BREAK THE RULES!)
  35. 35. Breaking the rules is fine. In benchmarks: S3 is ~2X slower than HDFS
  36. 36. Breaking the rules is fine. In benchmarks: S3 is ~2X slower than HDFS
  37. 37. It’s not all roses
  38. 38. Listing is slooooow (A CAUTIONARY TALE)
  39. 39. How to fix slow listing Bigger filesParallelize it
  40. 40. HDFS No way to quickly move data Task Intermediate Final write atomic move
  41. 41. HDFS No way to quickly move data Task Intermediate Final write atomic move S3 Task write
  42. 42. No way to quickly move data • Say goodbye to speculative execution
  43. 43. No way to quickly move data • Say goodbye to speculative execution • Say hello to better task timeouts
  44. 44. But really: We 💜S3 This is a great system. ✓ Data accessible from many clusters ✓ Storage is easy to manage ✓ It’s a multi-language paradise up in here
  45. 45. ELASTIC COMPUTE CLOUD STORAGE
  46. 46. One cluster to compute it all TRADITIONALLY
  47. 47. Instead, we run many, many clusters • New cluster for every automated job • 10–20 clusters at a time • Median lifetime: 2hrs
  48. 48. Why so many clusters?
  49. 49. Total isolation We know what’s happening and why
  50. 50. No more waiting on loaded clusters • Tailor each cluster to the work you want to do • Scale up when you need results faster • Data scientists and data engineers don’t have to wait 🕐🕓🕥
  51. 51. Pick the best hardware for each job for CPU-bound jobs r3 if you don’t care (cheap!) == ~30% savings over general purpose hardware c3 for memory-bound jobs m1.xlarge
  52. 52. 100% spot-instance clusters, all the time.* * (ok, most of the time)
  53. 53. 100% spot-instance clusters, all the time.* * (ok, most of the time) Ridiculous savings! Disappearing clusters!
  54. 54. How we do spot clusters • Bid the on-demand price, pay the spot price In the big data platform
  55. 55. How we do spot clusters • Bid the on-demand price, pay the spot price • Fallback to on-demand instances if you can’t get spot In the big data platform
  56. 56. How we do spot clusters • Bid the on-demand price, pay the spot price • Fallback to on-demand instances if you can’t get spot • Monitor everything: jobs, clusters, spot market In the big data platform
  57. 57. How we do spot clusters • Bid the on-demand price, pay the spot price • Fallback to on-demand instances if you can’t get spot • Monitor everything: jobs, clusters, spot market • ☞ Save up to 80% off the 
 on-demand price In the big data platform
  58. 58. Switch hardware when the market gets volatile Monitor the spot price
  59. 59. We like this strategy a lot! Cluster is oversubscribed; everyone waiting in line to do their work Lots of expensive hardware sits idle when everyone’s gone ✓ No waiting for the cluster you need ✓ No waste from hardware sitting idle ✓ Spot clusters are affordable enough to use everywhere
  60. 60. What’s challenging, though?
  61. 61. Many things that disappear.
  62. 62. ELASTIC COMPUTE CLOUD STORAGE COPIOUS
 TOOLING
  63. 63. Web and APIs Platform as a service CLI Jobs, Clusters, Schedules, Users, Code, Monitoring, Logs, and more
  64. 64. Big Data Platform Architecture DATA Amazon S3
  65. 65. Big Data Platform Architecture DATA Amazon S3 CLUSTER EMR
  66. 66. Big Data Platform Architecture DATA Amazon S3 CLUSTER EMR WORKER Pig Workers Spark Workers Luigi Workers
  67. 67. Big Data Platform Architecture DATA Amazon S3 CLUSTER EMR WORKER Pig Workers Spark Workers Luigi Workers STORAGE Metadata DB Queueing Logs
  68. 68. Big Data Platform Architecture DATA Amazon S3 CLUSTER EMR WEB Web API WORKER Pig Workers Spark Workers Luigi Workers STORAGE Metadata DB Queueing Logs
  69. 69. Big Data Platform Architecture DATA Amazon S3 CLUSTER EMR WEB Web API WORKER Pig Workers Spark Workers Luigi Workers USER CLI API Clients Job Scheduler STORAGE Metadata DB Queueing Logs
  70. 70. Big Data Platform Architecture DATA Amazon S3 CLUSTER EMR WEB Web API WORKER Pig Workers Spark Workers Luigi Workers USER CLI API Clients Job Scheduler STORAGE Metadata DB Queueing Logs Datadog Monitoring
  71. 71. How to find the right cluster when they disappear?
  72. 72. Cluster tagging 
 for discovery #anomaly -detection #monitor-report
  73. 73. How to monitor many disappearing clusters?
  74. 74. Dashboards Monitors Dynamic Monitoring on Tags anomaly-detection cluster_tags: anomaly-detection
  75. 75. How to debug problems when the cluster’s gone?
  76. 76. Debugging In a Post-Cluster World
  77. 77. Debugging In a Post-Cluster World Send all logs to S3 • HDFS • YARN • Pig • Spark
  78. 78. Debugging In a Post-Cluster World Visualize the pipeline • Lipstick for Pig • Spark History Server • Luigi task flow Send all logs to S3 • HDFS • YARN • Pig • Spark
  79. 79. Debugging In a Post-Cluster World Visualize the pipeline • Lipstick for Pig • Spark History Server • Luigi task flow Preserve historical monitoring data Keep history, by tag, after the cluster disappears Send all logs to S3 • HDFS • YARN • Pig • Spark
  80. 80. How to handle certain cluster failure in your jobs?
  81. 81. Automatic cleanup and restart Luigi: design for failure. A B
  82. 82. Automatic cleanup and restart Luigi: design for failure. B
  83. 83. Automatic cleanup and restart Luigi: design for failure. ❌
  84. 84. Automatic cleanup and restart Luigi: design for failure.
  85. 85. ELASTIC COMPUTE CLOUD STORAGE COPIOUS
 TOOLING
  86. 86. Recommendations 
 for Cloud Big Data
  87. 87. Recommendations 
 for Cloud Big Data • Use S3 for permanent data, not HDFS
  88. 88. Recommendations 
 for Cloud Big Data • Use S3 for permanent data, not HDFS • Start from EMR if building yourself
  89. 89. Recommendations 
 for Cloud Big Data • Use S3 for permanent data, not HDFS • Start from EMR if building yourself • Look into a PaaS: Netflix Genie, Qubole, Databricks
  90. 90. Recommendations 
 for Cloud Big Data • Use S3 for permanent data, not HDFS • Start from EMR if building yourself • Look into a PaaS: Netflix Genie, Qubole, Databricks • Tag your clusters for dynamic monitoring
  91. 91. Recommendations 
 for Cloud Big Data • Use S3 for permanent data, not HDFS • Start from EMR if building yourself • Look into a PaaS: Netflix Genie, Qubole, Databricks • Tag your clusters for dynamic monitoring • Design for failure with a workflow tool (Luigi, Airflow)
  92. 92. Thanks! Want to work with us on Spark, Hadoop, Kafka, Parquet, and more? jobs.datadoghq.com DM me @ddaniels888 or doug@datadoghq.com
  93. 93. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ datadog-cloud

×