SlideShare uma empresa Scribd logo
1 de 42
Monitoring Spark Applications
Tzach Zohar @ Kenshoo, March/2016
Who am I
System Architect @ Kenshoo
Java backend for 10 years
Working with Scala + Spark for 2 years
https://www.linkedin.com/in/tzachzohar
Who’s Kenshoo
10-year Tel-Aviv based startup
Industry Leader in Digital Marketing
500+ employees
Heavy data shop
http://kenshoo.com/
And who’re you?
Agenda
Why Monitor
Spark UI
Spark REST API
Spark Metric Sinks
Applicative Metrics
The Importance of being Earnest
Why Monitor
Failures
Performance
Know your data
Correctness of output
Monitoring Distributed Systems
No single log file
No single User Interface
Often - no single framework (e.g. Spark + YARN + HDFS…)
Spark UI
Spark UI
See http://spark.apache.org/docs/latest/monitoring.html#web-interfaces
The first go-to tool for understanding what’s what
Created per SparkContext
Spark UI
Jobs -> Stages -> Tasks
Spark UI
Jobs -> Stages -> Tasks
Spark UI
Use the “DAG Visualization” in Job Details to:
Understand flow
Detect caching opportunities
Spark UI
Jobs -> Stages -> Tasks
Detect unbalanced stages
Detect GC issues
Spark UI
Jobs -> Stages -> Tasks -> “Event Timeline”
Detect stragglers
Detect repartitioning opportunities
Spark UI Disadvantages
“Ad-Hoc”, no history*
Human readable, but not machine readable
Data points, not data trends
Spark UI Disadvantages
UI can quickly become hard to use…
Spark REST API
Spark’s REST API
See http://spark.apache.org/docs/latest/monitoring.html#rest-api
Programmatic access to UI’s data (jobs, stages, tasks, executors, storage…)
Useful for aggregations over similar jobs
Spark’s REST API
Example: calculate total shuffle statistics:
object SparkAppStats {
case class SparkStage(name: String, shuffleWriteBytes: Long, memoryBytesSpilled: Long, diskBytesSpilled: Long)
implicit val formats = DefaultFormats
val url = "http://<host>:4040/api/v1/applications/<app-name>/stages"
def main (args: Array[String]) {
val json = fromURL(url).mkString
val stages: List[SparkStage] = parse(json).extract[List[SparkStage]]
println("stages count: " + stages.size)
println("shuffleWriteBytes: " + stages.map(_.shuffleWriteBytes).sum)
println("memoryBytesSpilled: " + stages.map(_.memoryBytesSpilled).sum)
println("diskBytesSpilled: " + stages.map(_.diskBytesSpilled).sum)
}
}
Example: calculate total shuffle statistics:
Example output:
stages count: 1435
shuffleWriteBytes: 8488622429
memoryBytesSpilled: 120107947855
diskBytesSpilled: 1505616236
Spark’s REST API
Spark’s REST API
Example: calculate total time per job name:
val url = "http://<host>:4040/api/v1/applications/<app-name>/jobs"
case class SparkJob(jobId: Int, name: String, submissionTime: Date, completionTime: Option[Date], stageIds: List[Int]) {
def getDurationMillis: Option[Long] = completionTime.map(_.getTime - submissionTime.getTime)
}
def main (args: Array[String]) {
val json = fromURL(url).mkString
parse(json)
.extract[List[SparkJob]]
.filter(j => j.getDurationMillis.isDefined) // only completed jobs
.groupBy(_.name)
.mapValues(list => (list.map(_.getDurationMillis.get).sum, list.size))
.foreach { case (name, (time, count)) => println(s"TIME: $timetAVG: ${time / count}tNAME: $name") }
}
Spark’s REST API
Example: calculate total time per job name:
Example output:
TIME: 182570 AVG: 16597 NAME: count at
MyAggregationService.scala:132
TIME: 230973 AVG: 1297 NAME: parquet at MyRepository.scala:99
TIME: 120393 AVG: 2188 NAME: collect at MyCollector.scala:30
TIME: 5645 AVG: 627 NAME: collect at MyCollector.scala:103
But that’s still ad-
hoc, right?
Spark Metric Sinks
Metrics: easy Java API for creating and updating metrics stored in memory, e.g.:
Metrics
See http://spark.apache.org/docs/latest/monitoring.html#metrics
Spark uses the popular dropwizard.metrics library (renamed from codahale.metrics
and yammer.metrics)
// Gauge for executor thread pool's actively executing task counts
metricRegistry.register(name("threadpool", "activeTasks"), new Gauge[Int] {
override def getValue: Int = threadPool.getActiveCount()
})
Metrics
What is metered? Couldn’t find any detailed documentation of this
This trick flushes most of them out: search sources for “metricRegistry.register”
Where do these
metrics go?
Spark Metric Sinks
A “Sink” is an interface for viewing these metrics, at given intervals or ad-hoc
Available sinks: Console, CSV, SLF4J, Servlet, JMX, Graphite, Ganglia*
we use the Graphite Sink to send all metrics to Graphite
$SPARK_HOME/metrics.properties:
*.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.graphite.host=<your graphite hostname>
*.sink.graphite.port=2003
*.sink.graphite.period=30
*.sink.graphite.unit=seconds
*.sink.graphite.prefix=<token>.<app-name>.<host-name>
.. and it’s in Graphite ( + Grafana)
Graphite Sink
Very useful for trend analysis
WARNING: Not suitable for short-running applications (will pollute graphite with
new metrics for each application)
Requires some Graphite tricks to get clear readings (wildcards, sums, derivatives,
etc.)
Applicative Metrics
The Missing Piece
Spark meters its internals pretty thoroughly, but what about your internals?
Applicative metrics are a great tool for knowing your data and verifying output
correctness
We use Dropwizard Metrics + Graphite for this too (everywhere)
Counting RDD Elements
rdd.count() might be costly (another action)
Spark Accumulators are a good alternative
Trick: send accumulator results to Graphite, using “Counter-backed Accumulators”
/** *
* Call returned callback after acting on returned RDD to get counter updated
*/
def countSilently[V: ClassTag](rdd: RDD[V], metricName: String, clazz: Class[_]): (RDD[V], Unit => Unit) = {
val counter: Counter = Metrics.newCounter(new MetricName(clazz, metricName))
val accumulator: Accumulator[Long] = rdd.sparkContext.accumulator(0, metricName)
val countedRdd = rdd.map(v => { accumulator += 1; v })
val callback: Unit => Unit = u => counter.inc(accumulator.value)
(countedRdd, callback)
}
Counting RDD Elements
We Measure...
Input records
Output records
Parsing failures
Average job time
Data “freshness” histogram
Much much more...
WARNING:
it’s addictive...
Conclusions
Spark provides a wide variety of monitoring options
Each one should be used when appropriate - neither one is sufficient on its own
Metrics + Graphite + Grafana can give you visibility to any numeric timeseries
Questions?
Thank you

Mais conteúdo relacionado

Mais procurados

Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedInMagnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedInDatabricks
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight OverviewJacques Nadeau
 
What is New with Apache Spark Performance Monitoring in Spark 3.0
What is New with Apache Spark Performance Monitoring in Spark 3.0What is New with Apache Spark Performance Monitoring in Spark 3.0
What is New with Apache Spark Performance Monitoring in Spark 3.0Databricks
 
Write Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfWrite Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfEric Xiao
 
Stream Processing with Apache Kafka and .NET
Stream Processing with Apache Kafka and .NETStream Processing with Apache Kafka and .NET
Stream Processing with Apache Kafka and .NETconfluent
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKai Wähner
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
DASK and Apache Spark
DASK and Apache SparkDASK and Apache Spark
DASK and Apache SparkDatabricks
 
Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow ArchitectureGerard Toonstra
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQLDatabricks
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Databricks
 
Flash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with CoscoFlash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with CoscoDatabricks
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsDatabricks
 

Mais procurados (20)

Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedInMagnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight Overview
 
What is New with Apache Spark Performance Monitoring in Spark 3.0
What is New with Apache Spark Performance Monitoring in Spark 3.0What is New with Apache Spark Performance Monitoring in Spark 3.0
What is New with Apache Spark Performance Monitoring in Spark 3.0
 
Write Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfWrite Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdf
 
Stream Processing with Apache Kafka and .NET
Stream Processing with Apache Kafka and .NETStream Processing with Apache Kafka and .NET
Stream Processing with Apache Kafka and .NET
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid Cloud
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
DASK and Apache Spark
DASK and Apache SparkDASK and Apache Spark
DASK and Apache Spark
 
Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow Architecture
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0
 
Flash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with CoscoFlash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with Cosco
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 

Destaque

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Developers like winning - gamifying code reviews
Developers like winning - gamifying code reviewsDevelopers like winning - gamifying code reviews
Developers like winning - gamifying code reviewsTzach Zohar
 
Defining and Evaluating Success: Metrics and Metric Frameworks for Informatio...
Defining and Evaluating Success: Metrics and Metric Frameworks for Informatio...Defining and Evaluating Success: Metrics and Metric Frameworks for Informatio...
Defining and Evaluating Success: Metrics and Metric Frameworks for Informatio...Andrea L. Ames
 
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...Cloudera, Inc.
 
How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...Jos Boumans
 
Ciclo termodinâmico stirling
Ciclo termodinâmico stirlingCiclo termodinâmico stirling
Ciclo termodinâmico stirlingSérgio Faria
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerEvan Chan
 
Is spark streaming based on reactive streams?
Is spark streaming based on reactive streams?Is spark streaming based on reactive streams?
Is spark streaming based on reactive streams?chibochibo
 
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)Spark Summit
 
Java application monitoring with Dropwizard Metrics and graphite
Java application monitoring with Dropwizard Metrics and graphite Java application monitoring with Dropwizard Metrics and graphite
Java application monitoring with Dropwizard Metrics and graphite Roberto Franchini
 
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016Anya Bida
 
Galera Cluster for MySQL vs MySQL (NDB) Cluster: A High Level Comparison
Galera Cluster for MySQL vs MySQL (NDB) Cluster: A High Level Comparison Galera Cluster for MySQL vs MySQL (NDB) Cluster: A High Level Comparison
Galera Cluster for MySQL vs MySQL (NDB) Cluster: A High Level Comparison Severalnines
 
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...Spark Summit
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit
 
Operational Tips for Deploying Spark by Miklos Christine
Operational Tips for Deploying Spark by Miklos ChristineOperational Tips for Deploying Spark by Miklos Christine
Operational Tips for Deploying Spark by Miklos ChristineSpark Summit
 

Destaque (20)

Advanced Visualization of Spark jobs
Advanced Visualization of Spark jobsAdvanced Visualization of Spark jobs
Advanced Visualization of Spark jobs
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Developers like winning - gamifying code reviews
Developers like winning - gamifying code reviewsDevelopers like winning - gamifying code reviews
Developers like winning - gamifying code reviews
 
Defining and Evaluating Success: Metrics and Metric Frameworks for Informatio...
Defining and Evaluating Success: Metrics and Metric Frameworks for Informatio...Defining and Evaluating Success: Metrics and Metric Frameworks for Informatio...
Defining and Evaluating Success: Metrics and Metric Frameworks for Informatio...
 
PX4 Seminar 03
PX4 Seminar 03PX4 Seminar 03
PX4 Seminar 03
 
YARN Services
YARN ServicesYARN Services
YARN Services
 
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
 
How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...
 
Ciclo termodinâmico stirling
Ciclo termodinâmico stirlingCiclo termodinâmico stirling
Ciclo termodinâmico stirling
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
HiveServer2
HiveServer2HiveServer2
HiveServer2
 
Is spark streaming based on reactive streams?
Is spark streaming based on reactive streams?Is spark streaming based on reactive streams?
Is spark streaming based on reactive streams?
 
Dinâmica climática
Dinâmica climáticaDinâmica climática
Dinâmica climática
 
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
 
Java application monitoring with Dropwizard Metrics and graphite
Java application monitoring with Dropwizard Metrics and graphite Java application monitoring with Dropwizard Metrics and graphite
Java application monitoring with Dropwizard Metrics and graphite
 
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
 
Galera Cluster for MySQL vs MySQL (NDB) Cluster: A High Level Comparison
Galera Cluster for MySQL vs MySQL (NDB) Cluster: A High Level Comparison Galera Cluster for MySQL vs MySQL (NDB) Cluster: A High Level Comparison
Galera Cluster for MySQL vs MySQL (NDB) Cluster: A High Level Comparison
 
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
 
Operational Tips for Deploying Spark by Miklos Christine
Operational Tips for Deploying Spark by Miklos ChristineOperational Tips for Deploying Spark by Miklos Christine
Operational Tips for Deploying Spark by Miklos Christine
 

Semelhante a Monitoring Spark Applications

Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...Provectus
 
2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_onSri Ambati
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowChetan Khatri
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's comingDatabricks
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Databricks
 
Spline 0.3 and Plans for 0.4
Spline 0.3 and Plans for 0.4 Spline 0.3 and Plans for 0.4
Spline 0.3 and Plans for 0.4 Vaclav Kosar
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Summit
 
H2O PySparkling Water
H2O PySparkling WaterH2O PySparkling Water
H2O PySparkling WaterSri Ambati
 
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...PROIDEA
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015Sri Ambati
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R PackagesCraig Warman
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionChetan Khatri
 
Building iot applications with Apache Spark and Apache Bahir
Building iot applications with Apache Spark and Apache BahirBuilding iot applications with Apache Spark and Apache Bahir
Building iot applications with Apache Spark and Apache BahirLuciano Resende
 

Semelhante a Monitoring Spark Applications (20)

Mist - Serverless proxy to Apache Spark
Mist - Serverless proxy to Apache SparkMist - Serverless proxy to Apache Spark
Mist - Serverless proxy to Apache Spark
 
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
 
2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on2014 09 30_sparkling_water_hands_on
2014 09 30_sparkling_water_hands_on
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
 
Spline 0.3 and Plans for 0.4
Spline 0.3 and Plans for 0.4 Spline 0.3 and Plans for 0.4
Spline 0.3 and Plans for 0.4
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
 
H2O PySparkling Water
H2O PySparkling WaterH2O PySparkling Water
H2O PySparkling Water
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015Machine Learning with H2O, Spark, and Python at Strata 2015
Machine Learning with H2O, Spark, and Python at Strata 2015
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R Packages
 
Spark ML Pipeline serving
Spark ML Pipeline servingSpark ML Pipeline serving
Spark ML Pipeline serving
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
Building iot applications with Apache Spark and Apache Bahir
Building iot applications with Apache Spark and Apache BahirBuilding iot applications with Apache Spark and Apache Bahir
Building iot applications with Apache Spark and Apache Bahir
 

Último

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfproinshot.com
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionOnePlan Solutions
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfryanfarris8
 

Último (20)

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 

Monitoring Spark Applications

  • 1. Monitoring Spark Applications Tzach Zohar @ Kenshoo, March/2016
  • 2. Who am I System Architect @ Kenshoo Java backend for 10 years Working with Scala + Spark for 2 years https://www.linkedin.com/in/tzachzohar
  • 3. Who’s Kenshoo 10-year Tel-Aviv based startup Industry Leader in Digital Marketing 500+ employees Heavy data shop http://kenshoo.com/
  • 5. Agenda Why Monitor Spark UI Spark REST API Spark Metric Sinks Applicative Metrics
  • 6. The Importance of being Earnest
  • 7. Why Monitor Failures Performance Know your data Correctness of output
  • 8. Monitoring Distributed Systems No single log file No single User Interface Often - no single framework (e.g. Spark + YARN + HDFS…)
  • 10. Spark UI See http://spark.apache.org/docs/latest/monitoring.html#web-interfaces The first go-to tool for understanding what’s what Created per SparkContext
  • 11. Spark UI Jobs -> Stages -> Tasks
  • 12. Spark UI Jobs -> Stages -> Tasks
  • 13. Spark UI Use the “DAG Visualization” in Job Details to: Understand flow Detect caching opportunities
  • 14. Spark UI Jobs -> Stages -> Tasks Detect unbalanced stages Detect GC issues
  • 15. Spark UI Jobs -> Stages -> Tasks -> “Event Timeline” Detect stragglers Detect repartitioning opportunities
  • 16. Spark UI Disadvantages “Ad-Hoc”, no history* Human readable, but not machine readable Data points, not data trends
  • 17. Spark UI Disadvantages UI can quickly become hard to use…
  • 19. Spark’s REST API See http://spark.apache.org/docs/latest/monitoring.html#rest-api Programmatic access to UI’s data (jobs, stages, tasks, executors, storage…) Useful for aggregations over similar jobs
  • 20. Spark’s REST API Example: calculate total shuffle statistics: object SparkAppStats { case class SparkStage(name: String, shuffleWriteBytes: Long, memoryBytesSpilled: Long, diskBytesSpilled: Long) implicit val formats = DefaultFormats val url = "http://<host>:4040/api/v1/applications/<app-name>/stages" def main (args: Array[String]) { val json = fromURL(url).mkString val stages: List[SparkStage] = parse(json).extract[List[SparkStage]] println("stages count: " + stages.size) println("shuffleWriteBytes: " + stages.map(_.shuffleWriteBytes).sum) println("memoryBytesSpilled: " + stages.map(_.memoryBytesSpilled).sum) println("diskBytesSpilled: " + stages.map(_.diskBytesSpilled).sum) } }
  • 21. Example: calculate total shuffle statistics: Example output: stages count: 1435 shuffleWriteBytes: 8488622429 memoryBytesSpilled: 120107947855 diskBytesSpilled: 1505616236 Spark’s REST API
  • 22. Spark’s REST API Example: calculate total time per job name: val url = "http://<host>:4040/api/v1/applications/<app-name>/jobs" case class SparkJob(jobId: Int, name: String, submissionTime: Date, completionTime: Option[Date], stageIds: List[Int]) { def getDurationMillis: Option[Long] = completionTime.map(_.getTime - submissionTime.getTime) } def main (args: Array[String]) { val json = fromURL(url).mkString parse(json) .extract[List[SparkJob]] .filter(j => j.getDurationMillis.isDefined) // only completed jobs .groupBy(_.name) .mapValues(list => (list.map(_.getDurationMillis.get).sum, list.size)) .foreach { case (name, (time, count)) => println(s"TIME: $timetAVG: ${time / count}tNAME: $name") } }
  • 23. Spark’s REST API Example: calculate total time per job name: Example output: TIME: 182570 AVG: 16597 NAME: count at MyAggregationService.scala:132 TIME: 230973 AVG: 1297 NAME: parquet at MyRepository.scala:99 TIME: 120393 AVG: 2188 NAME: collect at MyCollector.scala:30 TIME: 5645 AVG: 627 NAME: collect at MyCollector.scala:103
  • 24. But that’s still ad- hoc, right?
  • 26. Metrics: easy Java API for creating and updating metrics stored in memory, e.g.: Metrics See http://spark.apache.org/docs/latest/monitoring.html#metrics Spark uses the popular dropwizard.metrics library (renamed from codahale.metrics and yammer.metrics) // Gauge for executor thread pool's actively executing task counts metricRegistry.register(name("threadpool", "activeTasks"), new Gauge[Int] { override def getValue: Int = threadPool.getActiveCount() })
  • 27. Metrics What is metered? Couldn’t find any detailed documentation of this This trick flushes most of them out: search sources for “metricRegistry.register”
  • 29. Spark Metric Sinks A “Sink” is an interface for viewing these metrics, at given intervals or ad-hoc Available sinks: Console, CSV, SLF4J, Servlet, JMX, Graphite, Ganglia* we use the Graphite Sink to send all metrics to Graphite $SPARK_HOME/metrics.properties: *.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink *.sink.graphite.host=<your graphite hostname> *.sink.graphite.port=2003 *.sink.graphite.period=30 *.sink.graphite.unit=seconds *.sink.graphite.prefix=<token>.<app-name>.<host-name>
  • 30. .. and it’s in Graphite ( + Grafana)
  • 31. Graphite Sink Very useful for trend analysis WARNING: Not suitable for short-running applications (will pollute graphite with new metrics for each application) Requires some Graphite tricks to get clear readings (wildcards, sums, derivatives, etc.)
  • 33. The Missing Piece Spark meters its internals pretty thoroughly, but what about your internals? Applicative metrics are a great tool for knowing your data and verifying output correctness We use Dropwizard Metrics + Graphite for this too (everywhere)
  • 34. Counting RDD Elements rdd.count() might be costly (another action) Spark Accumulators are a good alternative Trick: send accumulator results to Graphite, using “Counter-backed Accumulators” /** * * Call returned callback after acting on returned RDD to get counter updated */ def countSilently[V: ClassTag](rdd: RDD[V], metricName: String, clazz: Class[_]): (RDD[V], Unit => Unit) = { val counter: Counter = Metrics.newCounter(new MetricName(clazz, metricName)) val accumulator: Accumulator[Long] = rdd.sparkContext.accumulator(0, metricName) val countedRdd = rdd.map(v => { accumulator += 1; v }) val callback: Unit => Unit = u => counter.inc(accumulator.value) (countedRdd, callback) }
  • 36. We Measure... Input records Output records Parsing failures Average job time Data “freshness” histogram Much much more...
  • 38.
  • 39.
  • 40. Conclusions Spark provides a wide variety of monitoring options Each one should be used when appropriate - neither one is sufficient on its own Metrics + Graphite + Grafana can give you visibility to any numeric timeseries