SlideShare a Scribd company logo
1 of 36
Download to read offline
Scaling Monitoring At Databricks From
Prometheus to M3
YY Wan & Nick Lanham
Virtual M3 Day 2/18/21
Introduction
Nick Lanham
Senior Software Engineer
Observability Team
YY Wan
Software Engineer
Observability Team
About
● Founded in 2013 by the original creators of Apache Spark
● Data and AI platform as a service for 5000+ customers
● 1500+ employees, 400+ engineers, >$400M annual recurring revenue
● 3 cloud providers, 50+ regions
● Launching millions of VMs / day to run data engineering and ML workloads, processing exabytes of
data
Agenda
● Monitoring at Databricks before M3
● Deploying M3
○ Architecture
○ Migration
● Lessons Learned
○ Operational advice
○ Things to monitor
○ Updates and upgrades
Monitoring At Databricks
Before M3
Monitoring At Databricks
● Monitoring targets:
○ Cloud-native, majority of services run on Kubernetes
○ Customer Spark workloads run on VMs in customer environments
● Prometheus-based monitoring since 2016
● All service teams use metrics, dashboards, alerts
○ Most engineers are PromQL-literate
● Use-cases: real-time alerting, debugging, SLO reporting, automated event response
● Monitoring and data-drivenness are core to Databricks engineering culture
Prometheus Monitoring System
Scale Numbers
● 50+ regions / k8s clusters across multiple cloud providers
● 100+ microservices
● Infrastructure footprint of 4M+ VMs of Databricks services and customer Apache Spark workers
● Largest single Prometheus instance
○ 900k samples / sec
○ Churn rate: many metrics with only < 100 samples (i.e. metrics from short-lived Spark jobs persist for only < 100
minutes at 1 min scrape interval)
○ Disk usage (15d retention): 4TB
○ Huge AWS VM: x1e.16xlarge, 64 core, 1952GB RAM
Scaling Bottlenecks & Pain Points
Operational
● Frequent capacity issues - OOMs, high disk usage
● Multi-hour Prometheus updates (long WAL recovery process during startup)
UX
● Mental overhead of sharded view of metrics
● Big queries never completing (and causes OOMs)
● Short retention period
● Subject to strict metric whitelist
Searching for a Scalable Monitoring Solution
Requirements:
● High metric volume, cardinality, churn rate
● Minimum 90d retention
● Compatible with PromQL
● Global view of (some) metrics
● High availability setup
Nice-to-have:
● Good update and maintenance story - less manual intervention, no metrics gaps
● Battle-tested in large scale production environment
● Open source
(Mid-2019) Alternatives considered: sharded Prometheus, Thanos, Cortex, Datadog, SignalFx
Why ?
● Fulfilled all our hard requirements
○ Designed for large scale workloads and horizontally scalable
○ Exposes Prometheus API query endpoint
○ High availability with multi-replica setup
○ Designed for multi-region and cloud setup, with global querying feature
● Battle-tested at high scale at Uber in a production environment
● Has a kubernetes operator for automated cluster operations
● Cool features that we would be interested to use
○ Aggregation on ingest
○ Downsampling (potentially longer retention)
Deploying M3
Initial Plan
Making the Write Path Scalable
Building Our Own Rule Engine
Zooming In On M3 Setup
Separating M3 Coordinator Groups
Monitoring M3 & Final Architecture
Migration
Migration
1. Shadow deployment
○ Dual-write metrics to both Prom and M3 storage
○ Evaluate alerts in using both Prom and M3 rule engine
○ Open a querying endpoint for Observability team to test queries and dashboarding
2. Behavior validation
○ Compare alert evaluation between old and new system
○ Compare dashboards side-by-side
3. Incremental rollout strategy
○ Percentage-based rollout of ad-hoc query traffic to M3, staged across environments
○ Per-service rollout of alert evaluation
4. Final outcome: All ad-hoc query traffic and alerts served from M3
Switching Over Ad-Hoc Querying Traffic
Per-Service Migration of Alerts
Outcome
● 1-yr migration (mid-2019 to mid-2020)
● M3 runs as the sole metrics provider in all environments across clouds
○ (beta) Global query endpoint available via M3 for all metrics
● User experience largely unchanged (PromQL everything)
● Retention is widely 90d
● Migration went pretty smoothly, avoided major outages
● Higher confidence to continue scaling metrics workloads into upcoming years
● No more giant VMs with 2TB RAM!!
Lessons Learned
M3 From The Trenches
● System metrics to monitor
● General operational advice
● What to alert on
● How we do updates/upgrades
Overview
● Overall m3 has been amazingly stable
○ By far our biggest issue is running out of disk space
● Across more than 50 deployments only a few have been problematic
○ We'll dive into why, and how to avoid it
M3 at Databricks
● Large number of clusters means things HAVE to be automated
○ We use a combination of spinnaker and jenkins to kubectl apply templates
● About 900k samples per second in large clusters
● About 200k series read per second in large clusters
Key Metrics to Watch
● Memory used (alert if steadily over 60%)
○ We've seen that spikes can cause OOMs if you're consistently over this
○ Resolve by
■ Scale up cluster, or reduce incoming metric load
○ sum(container_memory_rss{filter}) by (kubernetes_pod_name)
● Disk space used (alert if predict_linear full in 14 days)
○ 14 days seems long, but it gives us plenty of time to provision new nodes and allow data to migrate
○ Resolve by
■ Scale up cluster, reduce retention, reduce incoming metric load
○ (kubelet_volume_stats_capacity_bytes{filter} - kubelet_volume_stats_available_bytes{}) /
kubelet_volume_stats_capacity_bytes{}
● Cluster scale-up can be slow
○ Be sure to test how long it takes in your cluster
General Advice
● Avoid a lot of custom things
○ As close to what the operator expects is the best
● Observe query rates and set limits
● Have a good testing env
○ Need to iterate quickly
○ Be able to throw away data
○ Try to have it at scale
● Have a look at the M3 dashboards and learn what things mean
○ https://grafana.com/grafana/dashboards/8126
○ Very dev focused, suggest making your own with key metrics
Other Alerting
● high latency ingesting samples: coordinator_ingest_latency_bucket
● rate(coordinator_write_errors{code! = '4XX'}[1m])
● rate(coordinator_fetch_errors{code! = '4XX'}[1m])
● high out of order samples:
○ rate(database_tick_merged_out_of_order_blocks[5m]) > X
○ this can help catch double scrapes
■ Due to pull based arch, this can cause false alerts
○ inhibit during node startup
Upgrades / Updates
● So far very smooth from compatibility standpoint
○ Only seen one small query eval regression
○ Just did the 1.0 update, also smooth
■ Some api changes
● We manage this via spinnaker + jenkins
○ One pain point here is lack of fully self driving updates (i.e. only kubectl apply)
■ Is actually now available
○ Requires us to be vigilant to ensure our configs and m3db versions stay in sync
● Suggestion: Have a readiness check for coordinators
○ Restarting many at the same time can make k8s unhappy
○ Requires setting a connect consistency on the coordinator config
Metric Spikes
For any high volume system, you will need a way to deal with spikes.
For example: A service adds a label with exploding cardinality
● Have a way to identify the source of the spike
● Be able to cut off that source easily
○ Preferable to OOMing your cluster
Capacity
● Brief overview of capacity planning at Databricks
● We've found that one m3db replica per 50,000 incoming time-series works pretty well
○ We are write heavy
● For same workload need about 50 write coordinators in two deployments (100 total)
Future Work
Some examples of nifty new things M3 will enable us to do now that we're getting operationally mature
● Downsampling for older metrics
○ Expect a significant savings in disk space
● Using different namespaces for metrics with different requirements
● Allowing direct push into M3 from difficult to scrape services
○ E.g. databricks jobs, developer laptops
Conclusion
● Overall a successful migration for us
● Community has been helpful
● Nice new things on the horizon

More Related Content

What's hot

Zipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkZipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkDatabricks
 
Wix's ML Platform
Wix's ML PlatformWix's ML Platform
Wix's ML PlatformRan Romano
 
Opentelemetry - From frontend to backend
Opentelemetry - From frontend to backendOpentelemetry - From frontend to backend
Opentelemetry - From frontend to backendSebastian Poxhofer
 
Lynx project overview (H2020)
Lynx project overview (H2020)Lynx project overview (H2020)
Lynx project overview (H2020)Lynx Project
 
MLOps with serverless architectures (October 2018)
MLOps with serverless architectures (October 2018)MLOps with serverless architectures (October 2018)
MLOps with serverless architectures (October 2018)Julien SIMON
 
EY + Neo4j: Why graph technology makes sense for fraud detection and customer...
EY + Neo4j: Why graph technology makes sense for fraud detection and customer...EY + Neo4j: Why graph technology makes sense for fraud detection and customer...
EY + Neo4j: Why graph technology makes sense for fraud detection and customer...Neo4j
 
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 20190-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019confluent
 
How to Build a ML Platform Efficiently Using Open-Source
How to Build a ML Platform Efficiently Using Open-SourceHow to Build a ML Platform Efficiently Using Open-Source
How to Build a ML Platform Efficiently Using Open-SourceDatabricks
 
Monitoring With Prometheus
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With PrometheusKnoldus Inc.
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorFlink Forward
 
Modeling Impression discounting in large-scale recommender systems
Modeling Impression discounting in large-scale recommender systemsModeling Impression discounting in large-scale recommender systems
Modeling Impression discounting in large-scale recommender systemsMitul Tiwari
 
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...Ed Fernandez
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Cortex: Horizontally Scalable, Highly Available Prometheus
Cortex: Horizontally Scalable, Highly Available PrometheusCortex: Horizontally Scalable, Highly Available Prometheus
Cortex: Horizontally Scalable, Highly Available PrometheusGrafana Labs
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)James Serra
 
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Databricks
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdfLars Albertsson
 

What's hot (20)

Zipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkZipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering Framework
 
Wix's ML Platform
Wix's ML PlatformWix's ML Platform
Wix's ML Platform
 
Opentelemetry - From frontend to backend
Opentelemetry - From frontend to backendOpentelemetry - From frontend to backend
Opentelemetry - From frontend to backend
 
Lynx project overview (H2020)
Lynx project overview (H2020)Lynx project overview (H2020)
Lynx project overview (H2020)
 
MLOps with serverless architectures (October 2018)
MLOps with serverless architectures (October 2018)MLOps with serverless architectures (October 2018)
MLOps with serverless architectures (October 2018)
 
EY + Neo4j: Why graph technology makes sense for fraud detection and customer...
EY + Neo4j: Why graph technology makes sense for fraud detection and customer...EY + Neo4j: Why graph technology makes sense for fraud detection and customer...
EY + Neo4j: Why graph technology makes sense for fraud detection and customer...
 
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 20190-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
 
How to Build a ML Platform Efficiently Using Open-Source
How to Build a ML Platform Efficiently Using Open-SourceHow to Build a ML Platform Efficiently Using Open-Source
How to Build a ML Platform Efficiently Using Open-Source
 
Kubeflow
KubeflowKubeflow
Kubeflow
 
Monitoring With Prometheus
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With Prometheus
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Modeling Impression discounting in large-scale recommender systems
Modeling Impression discounting in large-scale recommender systemsModeling Impression discounting in large-scale recommender systems
Modeling Impression discounting in large-scale recommender systems
 
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
 
MLOps with Kubeflow
MLOps with Kubeflow MLOps with Kubeflow
MLOps with Kubeflow
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Cortex: Horizontally Scalable, Highly Available Prometheus
Cortex: Horizontally Scalable, Highly Available PrometheusCortex: Horizontally Scalable, Highly Available Prometheus
Cortex: Horizontally Scalable, Highly Available Prometheus
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
 
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
 

Similar to Scaling Monitoring At Databricks From Prometheus to M3

Enabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedEnabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedShubham Tagra
 
Rally--OpenStack Benchmarking at Scale
Rally--OpenStack Benchmarking at ScaleRally--OpenStack Benchmarking at Scale
Rally--OpenStack Benchmarking at ScaleMirantis
 
Enabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speedEnabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speedShubham Tagra
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...NETWAYS
 
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...NETWAYS
 
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
PHP At 5000 Requests Per Second: Hootsuite’s Scaling StoryPHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Storyvanphp
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1Ruslan Meshenberg
 
How to Design for Database High Availability
How to Design for Database High AvailabilityHow to Design for Database High Availability
How to Design for Database High AvailabilityEDB
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthNicolas Brousse
 
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...InfluxData
 
Google Cloud - Stand Out Features
Google Cloud - Stand Out FeaturesGoogle Cloud - Stand Out Features
Google Cloud - Stand Out FeaturesGDG Cloud Bengaluru
 
Mastering MongoDB Atlas: Essentials of Diagnostics and Debugging in the Cloud...
Mastering MongoDB Atlas: Essentials of Diagnostics and Debugging in the Cloud...Mastering MongoDB Atlas: Essentials of Diagnostics and Debugging in the Cloud...
Mastering MongoDB Atlas: Essentials of Diagnostics and Debugging in the Cloud...Mydbops
 
IBM MQ - better application performance
IBM MQ - better application performanceIBM MQ - better application performance
IBM MQ - better application performanceMarkTaylorIBM
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streamingdatamantra
 
Slack in the Age of Prometheus
Slack in the Age of PrometheusSlack in the Age of Prometheus
Slack in the Age of PrometheusGeorge Luong
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesAlexander Penev
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 
What's New in Alluxio 2.3
What's New in Alluxio 2.3What's New in Alluxio 2.3
What's New in Alluxio 2.3Alluxio, Inc.
 
Big data @ Hootsuite analtyics
Big data @ Hootsuite analtyicsBig data @ Hootsuite analtyics
Big data @ Hootsuite analtyicsClaudiu Coman
 

Similar to Scaling Monitoring At Databricks From Prometheus to M3 (20)

Enabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speedEnabling Presto to handle massive scale at lightning speed
Enabling Presto to handle massive scale at lightning speed
 
Rally--OpenStack Benchmarking at Scale
Rally--OpenStack Benchmarking at ScaleRally--OpenStack Benchmarking at Scale
Rally--OpenStack Benchmarking at Scale
 
Enabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speedEnabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speed
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
 
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
 
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
PHP At 5000 Requests Per Second: Hootsuite’s Scaling StoryPHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
How to Design for Database High Availability
How to Design for Database High AvailabilityHow to Design for Database High Availability
How to Design for Database High Availability
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
 
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
 
Google Cloud - Stand Out Features
Google Cloud - Stand Out FeaturesGoogle Cloud - Stand Out Features
Google Cloud - Stand Out Features
 
Managing 600 instances
Managing 600 instancesManaging 600 instances
Managing 600 instances
 
Mastering MongoDB Atlas: Essentials of Diagnostics and Debugging in the Cloud...
Mastering MongoDB Atlas: Essentials of Diagnostics and Debugging in the Cloud...Mastering MongoDB Atlas: Essentials of Diagnostics and Debugging in the Cloud...
Mastering MongoDB Atlas: Essentials of Diagnostics and Debugging in the Cloud...
 
IBM MQ - better application performance
IBM MQ - better application performanceIBM MQ - better application performance
IBM MQ - better application performance
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Slack in the Age of Prometheus
Slack in the Age of PrometheusSlack in the Age of Prometheus
Slack in the Age of Prometheus
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE Architectures
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
What's New in Alluxio 2.3
What's New in Alluxio 2.3What's New in Alluxio 2.3
What's New in Alluxio 2.3
 
Big data @ Hootsuite analtyics
Big data @ Hootsuite analtyicsBig data @ Hootsuite analtyics
Big data @ Hootsuite analtyics
 

More from LibbySchulze

Running distributed tests with k6.pdf
Running distributed tests with k6.pdfRunning distributed tests with k6.pdf
Running distributed tests with k6.pdfLibbySchulze
 
Extending Kubectl.pptx
Extending Kubectl.pptxExtending Kubectl.pptx
Extending Kubectl.pptxLibbySchulze
 
Enhancing Data Protection Workflows with Kanister And Argo Workflows
Enhancing Data Protection Workflows with Kanister And Argo WorkflowsEnhancing Data Protection Workflows with Kanister And Argo Workflows
Enhancing Data Protection Workflows with Kanister And Argo WorkflowsLibbySchulze
 
Fallacies in Platform Engineering.pdf
Fallacies in Platform Engineering.pdfFallacies in Platform Engineering.pdf
Fallacies in Platform Engineering.pdfLibbySchulze
 
Intro to Fluvio.pptx.pdf
Intro to Fluvio.pptx.pdfIntro to Fluvio.pptx.pdf
Intro to Fluvio.pptx.pdfLibbySchulze
 
Enhance your Kafka Infrastructure with Fluvio.pptx
Enhance your Kafka Infrastructure with Fluvio.pptxEnhance your Kafka Infrastructure with Fluvio.pptx
Enhance your Kafka Infrastructure with Fluvio.pptxLibbySchulze
 
CNCF On-Demand Webinar_ LitmusChaos Project Updates.pdf
CNCF On-Demand Webinar_ LitmusChaos Project Updates.pdfCNCF On-Demand Webinar_ LitmusChaos Project Updates.pdf
CNCF On-Demand Webinar_ LitmusChaos Project Updates.pdfLibbySchulze
 
Oh The Places You'll Sign.pdf
Oh The Places You'll Sign.pdfOh The Places You'll Sign.pdf
Oh The Places You'll Sign.pdfLibbySchulze
 
Rancher MasterClass - Avoiding-configuration-drift.pptx
Rancher  MasterClass - Avoiding-configuration-drift.pptxRancher  MasterClass - Avoiding-configuration-drift.pptx
Rancher MasterClass - Avoiding-configuration-drift.pptxLibbySchulze
 
vFunction Konveyor Meetup - Why App Modernization Projects Fail - Aug 2022.pptx
vFunction Konveyor Meetup - Why App Modernization Projects Fail - Aug 2022.pptxvFunction Konveyor Meetup - Why App Modernization Projects Fail - Aug 2022.pptx
vFunction Konveyor Meetup - Why App Modernization Projects Fail - Aug 2022.pptxLibbySchulze
 
CNCF Live Webinar: Low Footprint Java Containers with GraalVM
CNCF Live Webinar: Low Footprint Java Containers with GraalVMCNCF Live Webinar: Low Footprint Java Containers with GraalVM
CNCF Live Webinar: Low Footprint Java Containers with GraalVMLibbySchulze
 
EnRoute-OPA-Integration.pdf
EnRoute-OPA-Integration.pdfEnRoute-OPA-Integration.pdf
EnRoute-OPA-Integration.pdfLibbySchulze
 
AirGap_zusammen_neu.pdf
AirGap_zusammen_neu.pdfAirGap_zusammen_neu.pdf
AirGap_zusammen_neu.pdfLibbySchulze
 
Copy of OTel Me All About OpenTelemetry The Current & Future State, Navigatin...
Copy of OTel Me All About OpenTelemetry The Current & Future State, Navigatin...Copy of OTel Me All About OpenTelemetry The Current & Future State, Navigatin...
Copy of OTel Me All About OpenTelemetry The Current & Future State, Navigatin...LibbySchulze
 
OTel Me All About OpenTelemetry The Current & Future State, Navigating the Pr...
OTel Me All About OpenTelemetry The Current & Future State, Navigating the Pr...OTel Me All About OpenTelemetry The Current & Future State, Navigating the Pr...
OTel Me All About OpenTelemetry The Current & Future State, Navigating the Pr...LibbySchulze
 
CNCF_ A step to step guide to platforming your delivery setup.pdf
CNCF_ A step to step guide to platforming your delivery setup.pdfCNCF_ A step to step guide to platforming your delivery setup.pdf
CNCF_ A step to step guide to platforming your delivery setup.pdfLibbySchulze
 
CNCF Online - Data Protection Guardrails using Open Policy Agent (OPA).pdf
CNCF Online - Data Protection Guardrails using Open Policy Agent (OPA).pdfCNCF Online - Data Protection Guardrails using Open Policy Agent (OPA).pdf
CNCF Online - Data Protection Guardrails using Open Policy Agent (OPA).pdfLibbySchulze
 
Securing Windows workloads.pdf
Securing Windows workloads.pdfSecuring Windows workloads.pdf
Securing Windows workloads.pdfLibbySchulze
 
Securing Windows workloads.pdf
Securing Windows workloads.pdfSecuring Windows workloads.pdf
Securing Windows workloads.pdfLibbySchulze
 
Advancements in Kubernetes Workload Identity for Azure
Advancements in Kubernetes Workload Identity for AzureAdvancements in Kubernetes Workload Identity for Azure
Advancements in Kubernetes Workload Identity for AzureLibbySchulze
 

More from LibbySchulze (20)

Running distributed tests with k6.pdf
Running distributed tests with k6.pdfRunning distributed tests with k6.pdf
Running distributed tests with k6.pdf
 
Extending Kubectl.pptx
Extending Kubectl.pptxExtending Kubectl.pptx
Extending Kubectl.pptx
 
Enhancing Data Protection Workflows with Kanister And Argo Workflows
Enhancing Data Protection Workflows with Kanister And Argo WorkflowsEnhancing Data Protection Workflows with Kanister And Argo Workflows
Enhancing Data Protection Workflows with Kanister And Argo Workflows
 
Fallacies in Platform Engineering.pdf
Fallacies in Platform Engineering.pdfFallacies in Platform Engineering.pdf
Fallacies in Platform Engineering.pdf
 
Intro to Fluvio.pptx.pdf
Intro to Fluvio.pptx.pdfIntro to Fluvio.pptx.pdf
Intro to Fluvio.pptx.pdf
 
Enhance your Kafka Infrastructure with Fluvio.pptx
Enhance your Kafka Infrastructure with Fluvio.pptxEnhance your Kafka Infrastructure with Fluvio.pptx
Enhance your Kafka Infrastructure with Fluvio.pptx
 
CNCF On-Demand Webinar_ LitmusChaos Project Updates.pdf
CNCF On-Demand Webinar_ LitmusChaos Project Updates.pdfCNCF On-Demand Webinar_ LitmusChaos Project Updates.pdf
CNCF On-Demand Webinar_ LitmusChaos Project Updates.pdf
 
Oh The Places You'll Sign.pdf
Oh The Places You'll Sign.pdfOh The Places You'll Sign.pdf
Oh The Places You'll Sign.pdf
 
Rancher MasterClass - Avoiding-configuration-drift.pptx
Rancher  MasterClass - Avoiding-configuration-drift.pptxRancher  MasterClass - Avoiding-configuration-drift.pptx
Rancher MasterClass - Avoiding-configuration-drift.pptx
 
vFunction Konveyor Meetup - Why App Modernization Projects Fail - Aug 2022.pptx
vFunction Konveyor Meetup - Why App Modernization Projects Fail - Aug 2022.pptxvFunction Konveyor Meetup - Why App Modernization Projects Fail - Aug 2022.pptx
vFunction Konveyor Meetup - Why App Modernization Projects Fail - Aug 2022.pptx
 
CNCF Live Webinar: Low Footprint Java Containers with GraalVM
CNCF Live Webinar: Low Footprint Java Containers with GraalVMCNCF Live Webinar: Low Footprint Java Containers with GraalVM
CNCF Live Webinar: Low Footprint Java Containers with GraalVM
 
EnRoute-OPA-Integration.pdf
EnRoute-OPA-Integration.pdfEnRoute-OPA-Integration.pdf
EnRoute-OPA-Integration.pdf
 
AirGap_zusammen_neu.pdf
AirGap_zusammen_neu.pdfAirGap_zusammen_neu.pdf
AirGap_zusammen_neu.pdf
 
Copy of OTel Me All About OpenTelemetry The Current & Future State, Navigatin...
Copy of OTel Me All About OpenTelemetry The Current & Future State, Navigatin...Copy of OTel Me All About OpenTelemetry The Current & Future State, Navigatin...
Copy of OTel Me All About OpenTelemetry The Current & Future State, Navigatin...
 
OTel Me All About OpenTelemetry The Current & Future State, Navigating the Pr...
OTel Me All About OpenTelemetry The Current & Future State, Navigating the Pr...OTel Me All About OpenTelemetry The Current & Future State, Navigating the Pr...
OTel Me All About OpenTelemetry The Current & Future State, Navigating the Pr...
 
CNCF_ A step to step guide to platforming your delivery setup.pdf
CNCF_ A step to step guide to platforming your delivery setup.pdfCNCF_ A step to step guide to platforming your delivery setup.pdf
CNCF_ A step to step guide to platforming your delivery setup.pdf
 
CNCF Online - Data Protection Guardrails using Open Policy Agent (OPA).pdf
CNCF Online - Data Protection Guardrails using Open Policy Agent (OPA).pdfCNCF Online - Data Protection Guardrails using Open Policy Agent (OPA).pdf
CNCF Online - Data Protection Guardrails using Open Policy Agent (OPA).pdf
 
Securing Windows workloads.pdf
Securing Windows workloads.pdfSecuring Windows workloads.pdf
Securing Windows workloads.pdf
 
Securing Windows workloads.pdf
Securing Windows workloads.pdfSecuring Windows workloads.pdf
Securing Windows workloads.pdf
 
Advancements in Kubernetes Workload Identity for Azure
Advancements in Kubernetes Workload Identity for AzureAdvancements in Kubernetes Workload Identity for Azure
Advancements in Kubernetes Workload Identity for Azure
 

Recently uploaded

一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样ayvbos
 
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查ydyuyu
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge GraphsEleniIlkou
 
Mira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
Mira Road Housewife Call Girls 07506202331, Nalasopara Call GirlsMira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
Mira Road Housewife Call Girls 07506202331, Nalasopara Call GirlsPriya Reddy
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirtrahman018755
 
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu DhabiAbu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu DhabiMonica Sydney
 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfJOHNBEBONYAP1
 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdfMatthew Sinclair
 
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime NagercoilNagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoilmeghakumariji156
 
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency""Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency"growthgrids
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查ydyuyu
 
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制pxcywzqs
 
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...gajnagarg
 
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac RoomVip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Roommeghakumariji156
 
一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理F
 
Local Call Girls in Seoni 9332606886 HOT & SEXY Models beautiful and charmin...
Local Call Girls in Seoni  9332606886 HOT & SEXY Models beautiful and charmin...Local Call Girls in Seoni  9332606886 HOT & SEXY Models beautiful and charmin...
Local Call Girls in Seoni 9332606886 HOT & SEXY Models beautiful and charmin...kumargunjan9515
 
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime BalliaBallia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Balliameghakumariji156
 
Call girls Service in Ajman 0505086370 Ajman call girls
Call girls Service in Ajman 0505086370 Ajman call girlsCall girls Service in Ajman 0505086370 Ajman call girls
Call girls Service in Ajman 0505086370 Ajman call girlsMonica Sydney
 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdfMatthew Sinclair
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdfMatthew Sinclair
 

Recently uploaded (20)

一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
 
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
 
Mira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
Mira Road Housewife Call Girls 07506202331, Nalasopara Call GirlsMira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
Mira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirt
 
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu DhabiAbu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
 
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime NagercoilNagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
 
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency""Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
 
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
 
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
 
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac RoomVip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
 
一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理
 
Local Call Girls in Seoni 9332606886 HOT & SEXY Models beautiful and charmin...
Local Call Girls in Seoni  9332606886 HOT & SEXY Models beautiful and charmin...Local Call Girls in Seoni  9332606886 HOT & SEXY Models beautiful and charmin...
Local Call Girls in Seoni 9332606886 HOT & SEXY Models beautiful and charmin...
 
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime BalliaBallia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia
 
Call girls Service in Ajman 0505086370 Ajman call girls
Call girls Service in Ajman 0505086370 Ajman call girlsCall girls Service in Ajman 0505086370 Ajman call girls
Call girls Service in Ajman 0505086370 Ajman call girls
 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
 

Scaling Monitoring At Databricks From Prometheus to M3

  • 1. Scaling Monitoring At Databricks From Prometheus to M3 YY Wan & Nick Lanham Virtual M3 Day 2/18/21
  • 2. Introduction Nick Lanham Senior Software Engineer Observability Team YY Wan Software Engineer Observability Team
  • 3. About ● Founded in 2013 by the original creators of Apache Spark ● Data and AI platform as a service for 5000+ customers ● 1500+ employees, 400+ engineers, >$400M annual recurring revenue ● 3 cloud providers, 50+ regions ● Launching millions of VMs / day to run data engineering and ML workloads, processing exabytes of data
  • 4. Agenda ● Monitoring at Databricks before M3 ● Deploying M3 ○ Architecture ○ Migration ● Lessons Learned ○ Operational advice ○ Things to monitor ○ Updates and upgrades
  • 6. Monitoring At Databricks ● Monitoring targets: ○ Cloud-native, majority of services run on Kubernetes ○ Customer Spark workloads run on VMs in customer environments ● Prometheus-based monitoring since 2016 ● All service teams use metrics, dashboards, alerts ○ Most engineers are PromQL-literate ● Use-cases: real-time alerting, debugging, SLO reporting, automated event response ● Monitoring and data-drivenness are core to Databricks engineering culture
  • 8. Scale Numbers ● 50+ regions / k8s clusters across multiple cloud providers ● 100+ microservices ● Infrastructure footprint of 4M+ VMs of Databricks services and customer Apache Spark workers ● Largest single Prometheus instance ○ 900k samples / sec ○ Churn rate: many metrics with only < 100 samples (i.e. metrics from short-lived Spark jobs persist for only < 100 minutes at 1 min scrape interval) ○ Disk usage (15d retention): 4TB ○ Huge AWS VM: x1e.16xlarge, 64 core, 1952GB RAM
  • 9. Scaling Bottlenecks & Pain Points Operational ● Frequent capacity issues - OOMs, high disk usage ● Multi-hour Prometheus updates (long WAL recovery process during startup) UX ● Mental overhead of sharded view of metrics ● Big queries never completing (and causes OOMs) ● Short retention period ● Subject to strict metric whitelist
  • 10. Searching for a Scalable Monitoring Solution Requirements: ● High metric volume, cardinality, churn rate ● Minimum 90d retention ● Compatible with PromQL ● Global view of (some) metrics ● High availability setup Nice-to-have: ● Good update and maintenance story - less manual intervention, no metrics gaps ● Battle-tested in large scale production environment ● Open source (Mid-2019) Alternatives considered: sharded Prometheus, Thanos, Cortex, Datadog, SignalFx
  • 11. Why ? ● Fulfilled all our hard requirements ○ Designed for large scale workloads and horizontally scalable ○ Exposes Prometheus API query endpoint ○ High availability with multi-replica setup ○ Designed for multi-region and cloud setup, with global querying feature ● Battle-tested at high scale at Uber in a production environment ● Has a kubernetes operator for automated cluster operations ● Cool features that we would be interested to use ○ Aggregation on ingest ○ Downsampling (potentially longer retention)
  • 14. Making the Write Path Scalable
  • 15. Building Our Own Rule Engine
  • 16. Zooming In On M3 Setup
  • 18. Monitoring M3 & Final Architecture
  • 20. Migration 1. Shadow deployment ○ Dual-write metrics to both Prom and M3 storage ○ Evaluate alerts in using both Prom and M3 rule engine ○ Open a querying endpoint for Observability team to test queries and dashboarding 2. Behavior validation ○ Compare alert evaluation between old and new system ○ Compare dashboards side-by-side 3. Incremental rollout strategy ○ Percentage-based rollout of ad-hoc query traffic to M3, staged across environments ○ Per-service rollout of alert evaluation 4. Final outcome: All ad-hoc query traffic and alerts served from M3
  • 21. Switching Over Ad-Hoc Querying Traffic
  • 23. Outcome ● 1-yr migration (mid-2019 to mid-2020) ● M3 runs as the sole metrics provider in all environments across clouds ○ (beta) Global query endpoint available via M3 for all metrics ● User experience largely unchanged (PromQL everything) ● Retention is widely 90d ● Migration went pretty smoothly, avoided major outages ● Higher confidence to continue scaling metrics workloads into upcoming years ● No more giant VMs with 2TB RAM!!
  • 25. M3 From The Trenches ● System metrics to monitor ● General operational advice ● What to alert on ● How we do updates/upgrades
  • 26. Overview ● Overall m3 has been amazingly stable ○ By far our biggest issue is running out of disk space ● Across more than 50 deployments only a few have been problematic ○ We'll dive into why, and how to avoid it
  • 27. M3 at Databricks ● Large number of clusters means things HAVE to be automated ○ We use a combination of spinnaker and jenkins to kubectl apply templates ● About 900k samples per second in large clusters ● About 200k series read per second in large clusters
  • 28. Key Metrics to Watch ● Memory used (alert if steadily over 60%) ○ We've seen that spikes can cause OOMs if you're consistently over this ○ Resolve by ■ Scale up cluster, or reduce incoming metric load ○ sum(container_memory_rss{filter}) by (kubernetes_pod_name) ● Disk space used (alert if predict_linear full in 14 days) ○ 14 days seems long, but it gives us plenty of time to provision new nodes and allow data to migrate ○ Resolve by ■ Scale up cluster, reduce retention, reduce incoming metric load ○ (kubelet_volume_stats_capacity_bytes{filter} - kubelet_volume_stats_available_bytes{}) / kubelet_volume_stats_capacity_bytes{} ● Cluster scale-up can be slow ○ Be sure to test how long it takes in your cluster
  • 29. General Advice ● Avoid a lot of custom things ○ As close to what the operator expects is the best ● Observe query rates and set limits ● Have a good testing env ○ Need to iterate quickly ○ Be able to throw away data ○ Try to have it at scale ● Have a look at the M3 dashboards and learn what things mean ○ https://grafana.com/grafana/dashboards/8126 ○ Very dev focused, suggest making your own with key metrics
  • 30.
  • 31. Other Alerting ● high latency ingesting samples: coordinator_ingest_latency_bucket ● rate(coordinator_write_errors{code! = '4XX'}[1m]) ● rate(coordinator_fetch_errors{code! = '4XX'}[1m]) ● high out of order samples: ○ rate(database_tick_merged_out_of_order_blocks[5m]) > X ○ this can help catch double scrapes ■ Due to pull based arch, this can cause false alerts ○ inhibit during node startup
  • 32. Upgrades / Updates ● So far very smooth from compatibility standpoint ○ Only seen one small query eval regression ○ Just did the 1.0 update, also smooth ■ Some api changes ● We manage this via spinnaker + jenkins ○ One pain point here is lack of fully self driving updates (i.e. only kubectl apply) ■ Is actually now available ○ Requires us to be vigilant to ensure our configs and m3db versions stay in sync ● Suggestion: Have a readiness check for coordinators ○ Restarting many at the same time can make k8s unhappy ○ Requires setting a connect consistency on the coordinator config
  • 33. Metric Spikes For any high volume system, you will need a way to deal with spikes. For example: A service adds a label with exploding cardinality ● Have a way to identify the source of the spike ● Be able to cut off that source easily ○ Preferable to OOMing your cluster
  • 34. Capacity ● Brief overview of capacity planning at Databricks ● We've found that one m3db replica per 50,000 incoming time-series works pretty well ○ We are write heavy ● For same workload need about 50 write coordinators in two deployments (100 total)
  • 35. Future Work Some examples of nifty new things M3 will enable us to do now that we're getting operationally mature ● Downsampling for older metrics ○ Expect a significant savings in disk space ● Using different namespaces for metrics with different requirements ● Allowing direct push into M3 from difficult to scrape services ○ E.g. databricks jobs, developer laptops
  • 36. Conclusion ● Overall a successful migration for us ● Community has been helpful ● Nice new things on the horizon