SlideShare uma empresa Scribd logo
1 de 21
Advanced visualization of
Spark jobs
Zoltán Zvara
zoltan.zvara@sztaki.hu
Márton Balassi
marton.balassi@sztaki.hu
Agenda
• Introduction
• Motivation and the challenges
• The Spark execution visualization
• An elegant way to tackle data skew
• The visualizer extended to Flink
• Future plans
Introduction
• Hungarian Academy of Sciences (MTA SZTAKI)
• Research institute with strong industry ties
• Big Data projects using Spark, Flink, Cassandra, Hadoop etc.
• Multiple telco use cases lately, with challenging data volume and
distribution
Motivation
• We have developped an application aggreagting telco data that tested
well on toy data
• When deploying it against the real dataset the application seemed
healthy
• However it could become suprisingly slow or even crash
• What did go wrong?
Data skew
• When some data-points are very frequent, and we need to process
them on a single machine – aka the Bieber-effect
• Power-law distributions are very common: from social-media to
telecommunication, even in our beloved WordCount
• Default hashing puts these into „random” buckets
• We can do better than that
Aim
1. Most of the telco data-processing workloads suffer from inefficient
communication patterns, introduced by skewed data
2. Help developers to write better applications by detecting issues of
logical and physical execution plans easier
3. Help newcomers to understand a distributed data-processing
system faster
4. Demonstrate and guide the testing of adaptive (re)partitioning
techniques
Spark’s execution model
• Extended MapReduce model, where a previous stage has to complete
all its tasks before the next stage can be scheduled (global
synchronization)
• Data skew can introduce slow tasks
• In most cases, the data distribution
is not known in advance
• „Concept drifts” are common
stage
boundary
& shuffle
even partitioning skewed data
slow task
slow task
Physical execution plan
• Understanding the physical plan
• can lead to designing a better application
• can uncover bottlenecks that are hard to identify otherwise
• can help to pick the right tool for the job at hand
• can be quite difficult for newcomers
• can give insights to new, adaptive, on-the-fly partitioning & optimization
strategies
Current visualizations in Spark
• Current visualization tools
• are missing the fine-grained input & output metrics – also not available in
most sytems, like Spark
• does not capture the data charateristics – hard to accomplish a lightweight
and efficient solution
• does not visualize the physical plan
DAG visualization in Spark
Collecting information during execution
• Data-transfer between tasks are accomplished in
the shuffle phase, in the form of shuffle block fetches
• We distinguish local & remote block fetches
• Data characteristics collected on shuffle write & read
A scalable sampling
• Key-distributions are approximated with a strategy, that
• is not sensitive to early or late concept drifts,
• lightweight and efficient,
• scalable by using a backoff strategy
keys
frequency
counters
keys
frequencysampling rate of 𝑠𝑖 sampling rate of 𝑠𝑖/𝑏
keys
frequency
sampling rate of 𝑠𝑖+𝑗/𝑏
truncateincrease by 𝑠𝑖
𝑇𝐵
Availability through the Spark REST API
• Enhanced Spark REST API to provide block fetch & data characteristics
information of tasks
• New queries to the REST API, for example: „what happend in the last
3 seconds?”
• /api/v1/applications/{appId}/{jobId}/tasks/{timestamp}
THE VISUALIZATION
Our way of tackling data skew
• Goal: use the collected information to handle data skew dynamically,
on-the-fly on any workload (batch or streaming)
• Currently paper in progress
heavy keys create
heavy partitions (slow tasks)
Bieber-key is alone,
size of the biggest partition
is minimized
Making visualization pluggable
• Let us also support Flink
• The execution model is quite different from Spark
• Minimal code change on the data reader module of the visualization
• Expose the necessary metrics through the Flink JM Rest API
• Map Flink’s model to Spark’s
Computational models
Batch data
Kafka, RabbitMQ ...
HDFS, JDBC ...
Stream Data
Spark RDDs break down the computation into stagesFlink computation is fully pipelined by default
The currently available Flink visualizer
Granular visualization of Flink jobs
Future plans
• Open-sourcing of the visualization framework
• Dynamic repartitioning paper is in progress
• Task-classification to make the visualization more compact
• Visual improvements, more features
• Opening PRs against Spark & Flink with the suggested metrics
• A rewrite is coming for the Flink metrics, track it at FLINK-1504
Conclusions
• The devil is in the details
• Visualizations can aid developers to better understand issues and
bottlenecks of certain workloads
• Specially data skew can hinder the completion time of the whole
stage in a batch engine
• Adaptive repartitioning strategies can be suprisingly efficient
Thank you for your attention
Q&A
stage
boundary
& shuffle
even partitioning skewed data
slow task
slow task

Mais conteúdo relacionado

Mais procurados

Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedInMagnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedInDatabricks
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Alex Zeltov
 
Data Science with Spark & Zeppelin
Data Science with Spark & ZeppelinData Science with Spark & Zeppelin
Data Science with Spark & ZeppelinVinay Shukla
 
Helium makes Zeppelin fly!
Helium makes Zeppelin fly!Helium makes Zeppelin fly!
Helium makes Zeppelin fly!DataWorks Summit
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaApache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaDatabricks
 
Apache Pulsar: The Next Generation Messaging and Queuing System
Apache Pulsar: The Next Generation Messaging and Queuing SystemApache Pulsar: The Next Generation Messaging and Queuing System
Apache Pulsar: The Next Generation Messaging and Queuing SystemDatabricks
 
Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterje...
Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterje...Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterje...
Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterje...Spark Summit
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamDataWorks Summit/Hadoop Summit
 
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Apache Spark At Apple with Sam Maclennan and Vishwanath LakkundiApache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Apache Spark At Apple with Sam Maclennan and Vishwanath LakkundiDatabricks
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveDataWorks Summit/Hadoop Summit
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduDataWorks Summit
 
Spark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko KorndorfSpark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko KorndorfSpark Summit
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkDataWorks Summit/Hadoop Summit
 
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using PrometheusMonitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using PrometheusDatabricks
 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Databricks
 

Mais procurados (20)

Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedInMagnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
 
Data Science with Spark & Zeppelin
Data Science with Spark & ZeppelinData Science with Spark & Zeppelin
Data Science with Spark & Zeppelin
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Helium makes Zeppelin fly!
Helium makes Zeppelin fly!Helium makes Zeppelin fly!
Helium makes Zeppelin fly!
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaApache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
 
Apache Pulsar: The Next Generation Messaging and Queuing System
Apache Pulsar: The Next Generation Messaging and Queuing SystemApache Pulsar: The Next Generation Messaging and Queuing System
Apache Pulsar: The Next Generation Messaging and Queuing System
 
Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterje...
Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterje...Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterje...
Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterje...
 
Machine Learning in the IoT with Apache NiFi
Machine Learning in the IoT with Apache NiFiMachine Learning in the IoT with Apache NiFi
Machine Learning in the IoT with Apache NiFi
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
 
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Apache Spark At Apple with Sam Maclennan and Vishwanath LakkundiApache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
Securing Hadoop in an Enterprise Context
Securing Hadoop in an Enterprise ContextSecuring Hadoop in an Enterprise Context
Securing Hadoop in an Enterprise Context
 
Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
 
Spark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko KorndorfSpark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko Korndorf
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using PrometheusMonitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
 

Destaque

Spark Summit EU 2015: SparkUI visualization: a lens into your application
Spark Summit EU 2015: SparkUI visualization: a lens into your applicationSpark Summit EU 2015: SparkUI visualization: a lens into your application
Spark Summit EU 2015: SparkUI visualization: a lens into your applicationDatabricks
 
Monitoring Spark Applications
Monitoring Spark ApplicationsMonitoring Spark Applications
Monitoring Spark ApplicationsTzach Zohar
 
Zoltán Zvara - Advanced visualization of Flink and Spark jobs

Zoltán Zvara - Advanced visualization of Flink and Spark jobs
Zoltán Zvara - Advanced visualization of Flink and Spark jobs

Zoltán Zvara - Advanced visualization of Flink and Spark jobs
Flink Forward
 
Advanced Visualizations, Bijilash Babu
Advanced Visualizations, Bijilash BabuAdvanced Visualizations, Bijilash Babu
Advanced Visualizations, Bijilash BabuBijilash Babu
 
How to deploy spark instance using ansible 2.0 in fiware lab v2
How to deploy spark instance using ansible 2.0 in fiware lab v2How to deploy spark instance using ansible 2.0 in fiware lab v2
How to deploy spark instance using ansible 2.0 in fiware lab v2Fernando Lopez Aguilar
 
超高速処理とスケーラビリティを両立するApache GEODE
超高速処理とスケーラビリティを両立するApache GEODE超高速処理とスケーラビリティを両立するApache GEODE
超高速処理とスケーラビリティを両立するApache GEODEMasaki Yamakawa
 
Think Like Spark
Think Like SparkThink Like Spark
Think Like SparkAlpine Data
 
Elasticsearchインデクシングのパフォーマンスを測ってみた
Elasticsearchインデクシングのパフォーマンスを測ってみたElasticsearchインデクシングのパフォーマンスを測ってみた
Elasticsearchインデクシングのパフォーマンスを測ってみたRyoji Kurosawa
 
Evaluating NoSQL Performance: Time for Benchmarking
Evaluating NoSQL Performance: Time for BenchmarkingEvaluating NoSQL Performance: Time for Benchmarking
Evaluating NoSQL Performance: Time for BenchmarkingSergey Bushik
 
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionFaster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionCloudera, Inc.
 
Clickstream Analysis with Spark
Clickstream Analysis with Spark Clickstream Analysis with Spark
Clickstream Analysis with Spark Josef Adersberger
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesAlexis Seigneurin
 
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...Spark Summit
 
Eda Story So far
Eda Story So farEda Story So far
Eda Story So farkirtidesai
 
Business intelligence, Data Analytics & Data Visualization
Business intelligence, Data Analytics & Data VisualizationBusiness intelligence, Data Analytics & Data Visualization
Business intelligence, Data Analytics & Data VisualizationMuthu Natarajan
 
Visualizing big data in the browser using spark
Visualizing big data in the browser using sparkVisualizing big data in the browser using spark
Visualizing big data in the browser using sparkDatabricks
 
Unit testing of spark applications
Unit testing of spark applicationsUnit testing of spark applications
Unit testing of spark applicationsKnoldus Inc.
 
Data visualization trends in Business Intelligence: Allison Sapka at Analytic...
Data visualization trends in Business Intelligence: Allison Sapka at Analytic...Data visualization trends in Business Intelligence: Allison Sapka at Analytic...
Data visualization trends in Business Intelligence: Allison Sapka at Analytic...Fitzgerald Analytics, Inc.
 

Destaque (20)

Spark Summit EU 2015: SparkUI visualization: a lens into your application
Spark Summit EU 2015: SparkUI visualization: a lens into your applicationSpark Summit EU 2015: SparkUI visualization: a lens into your application
Spark Summit EU 2015: SparkUI visualization: a lens into your application
 
Monitoring Spark Applications
Monitoring Spark ApplicationsMonitoring Spark Applications
Monitoring Spark Applications
 
Zoltán Zvara - Advanced visualization of Flink and Spark jobs

Zoltán Zvara - Advanced visualization of Flink and Spark jobs
Zoltán Zvara - Advanced visualization of Flink and Spark jobs

Zoltán Zvara - Advanced visualization of Flink and Spark jobs

 
Advanced Visualizations, Bijilash Babu
Advanced Visualizations, Bijilash BabuAdvanced Visualizations, Bijilash Babu
Advanced Visualizations, Bijilash Babu
 
How to deploy spark instance using ansible 2.0 in fiware lab v2
How to deploy spark instance using ansible 2.0 in fiware lab v2How to deploy spark instance using ansible 2.0 in fiware lab v2
How to deploy spark instance using ansible 2.0 in fiware lab v2
 
MLeap: Release Spark ML Pipelines
MLeap: Release Spark ML PipelinesMLeap: Release Spark ML Pipelines
MLeap: Release Spark ML Pipelines
 
超高速処理とスケーラビリティを両立するApache GEODE
超高速処理とスケーラビリティを両立するApache GEODE超高速処理とスケーラビリティを両立するApache GEODE
超高速処理とスケーラビリティを両立するApache GEODE
 
Think Like Spark
Think Like SparkThink Like Spark
Think Like Spark
 
Spark meets Smart Meters
Spark meets Smart MetersSpark meets Smart Meters
Spark meets Smart Meters
 
Elasticsearchインデクシングのパフォーマンスを測ってみた
Elasticsearchインデクシングのパフォーマンスを測ってみたElasticsearchインデクシングのパフォーマンスを測ってみた
Elasticsearchインデクシングのパフォーマンスを測ってみた
 
Evaluating NoSQL Performance: Time for Benchmarking
Evaluating NoSQL Performance: Time for BenchmarkingEvaluating NoSQL Performance: Time for Benchmarking
Evaluating NoSQL Performance: Time for Benchmarking
 
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionFaster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
 
Clickstream Analysis with Spark
Clickstream Analysis with Spark Clickstream Analysis with Spark
Clickstream Analysis with Spark
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and Microservices
 
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
 
Eda Story So far
Eda Story So farEda Story So far
Eda Story So far
 
Business intelligence, Data Analytics & Data Visualization
Business intelligence, Data Analytics & Data VisualizationBusiness intelligence, Data Analytics & Data Visualization
Business intelligence, Data Analytics & Data Visualization
 
Visualizing big data in the browser using spark
Visualizing big data in the browser using sparkVisualizing big data in the browser using spark
Visualizing big data in the browser using spark
 
Unit testing of spark applications
Unit testing of spark applicationsUnit testing of spark applications
Unit testing of spark applications
 
Data visualization trends in Business Intelligence: Allison Sapka at Analytic...
Data visualization trends in Business Intelligence: Allison Sapka at Analytic...Data visualization trends in Business Intelligence: Allison Sapka at Analytic...
Data visualization trends in Business Intelligence: Allison Sapka at Analytic...
 

Semelhante a Advanced Visualization of Spark jobs

Introduction to NetGuardians' Big Data Software Stack
Introduction to NetGuardians' Big Data Software StackIntroduction to NetGuardians' Big Data Software Stack
Introduction to NetGuardians' Big Data Software StackJérôme Kehrli
 
Tech Spark Presentation
Tech Spark PresentationTech Spark Presentation
Tech Spark PresentationStephen Borg
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks
 
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...confluent
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDatabricks
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesJen Aman
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance IssuesAntonios Katsarakis
 
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014gmalouf678
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsDatabricks
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsBen Laird
 
Взгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPCВзгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPCOlga Lavrentieva
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark Summit
 
IRMAC April 2015 - DMBOK2 DWBI New Content
IRMAC April 2015 - DMBOK2 DWBI New ContentIRMAC April 2015 - DMBOK2 DWBI New Content
IRMAC April 2015 - DMBOK2 DWBI New ContentMartin Sykora
 
Future of data visualization
Future of data visualizationFuture of data visualization
Future of data visualizationhadoopsphere
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 

Semelhante a Advanced Visualization of Spark jobs (20)

Introduction to NetGuardians' Big Data Software Stack
Introduction to NetGuardians' Big Data Software StackIntroduction to NetGuardians' Big Data Software Stack
Introduction to NetGuardians' Big Data Software Stack
 
Tech Spark Presentation
Tech Spark PresentationTech Spark Presentation
Tech Spark Presentation
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
 
Big Data training
Big Data trainingBig Data training
Big Data training
 
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data Applications
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.js
 
Взгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPCВзгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPC
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
 
IRMAC April 2015 - DMBOK2 DWBI New Content
IRMAC April 2015 - DMBOK2 DWBI New ContentIRMAC April 2015 - DMBOK2 DWBI New Content
IRMAC April 2015 - DMBOK2 DWBI New Content
 
Future of data visualization
Future of data visualizationFuture of data visualization
Future of data visualization
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 

Mais de DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 

Mais de DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Último

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 

Último (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Advanced Visualization of Spark jobs

  • 1. Advanced visualization of Spark jobs Zoltán Zvara zoltan.zvara@sztaki.hu Márton Balassi marton.balassi@sztaki.hu
  • 2. Agenda • Introduction • Motivation and the challenges • The Spark execution visualization • An elegant way to tackle data skew • The visualizer extended to Flink • Future plans
  • 3. Introduction • Hungarian Academy of Sciences (MTA SZTAKI) • Research institute with strong industry ties • Big Data projects using Spark, Flink, Cassandra, Hadoop etc. • Multiple telco use cases lately, with challenging data volume and distribution
  • 4. Motivation • We have developped an application aggreagting telco data that tested well on toy data • When deploying it against the real dataset the application seemed healthy • However it could become suprisingly slow or even crash • What did go wrong?
  • 5. Data skew • When some data-points are very frequent, and we need to process them on a single machine – aka the Bieber-effect • Power-law distributions are very common: from social-media to telecommunication, even in our beloved WordCount • Default hashing puts these into „random” buckets • We can do better than that
  • 6. Aim 1. Most of the telco data-processing workloads suffer from inefficient communication patterns, introduced by skewed data 2. Help developers to write better applications by detecting issues of logical and physical execution plans easier 3. Help newcomers to understand a distributed data-processing system faster 4. Demonstrate and guide the testing of adaptive (re)partitioning techniques
  • 7. Spark’s execution model • Extended MapReduce model, where a previous stage has to complete all its tasks before the next stage can be scheduled (global synchronization) • Data skew can introduce slow tasks • In most cases, the data distribution is not known in advance • „Concept drifts” are common stage boundary & shuffle even partitioning skewed data slow task slow task
  • 8. Physical execution plan • Understanding the physical plan • can lead to designing a better application • can uncover bottlenecks that are hard to identify otherwise • can help to pick the right tool for the job at hand • can be quite difficult for newcomers • can give insights to new, adaptive, on-the-fly partitioning & optimization strategies
  • 9. Current visualizations in Spark • Current visualization tools • are missing the fine-grained input & output metrics – also not available in most sytems, like Spark • does not capture the data charateristics – hard to accomplish a lightweight and efficient solution • does not visualize the physical plan DAG visualization in Spark
  • 10. Collecting information during execution • Data-transfer between tasks are accomplished in the shuffle phase, in the form of shuffle block fetches • We distinguish local & remote block fetches • Data characteristics collected on shuffle write & read
  • 11. A scalable sampling • Key-distributions are approximated with a strategy, that • is not sensitive to early or late concept drifts, • lightweight and efficient, • scalable by using a backoff strategy keys frequency counters keys frequencysampling rate of 𝑠𝑖 sampling rate of 𝑠𝑖/𝑏 keys frequency sampling rate of 𝑠𝑖+𝑗/𝑏 truncateincrease by 𝑠𝑖 𝑇𝐵
  • 12. Availability through the Spark REST API • Enhanced Spark REST API to provide block fetch & data characteristics information of tasks • New queries to the REST API, for example: „what happend in the last 3 seconds?” • /api/v1/applications/{appId}/{jobId}/tasks/{timestamp}
  • 14. Our way of tackling data skew • Goal: use the collected information to handle data skew dynamically, on-the-fly on any workload (batch or streaming) • Currently paper in progress heavy keys create heavy partitions (slow tasks) Bieber-key is alone, size of the biggest partition is minimized
  • 15. Making visualization pluggable • Let us also support Flink • The execution model is quite different from Spark • Minimal code change on the data reader module of the visualization • Expose the necessary metrics through the Flink JM Rest API • Map Flink’s model to Spark’s
  • 16. Computational models Batch data Kafka, RabbitMQ ... HDFS, JDBC ... Stream Data Spark RDDs break down the computation into stagesFlink computation is fully pipelined by default
  • 17. The currently available Flink visualizer
  • 19. Future plans • Open-sourcing of the visualization framework • Dynamic repartitioning paper is in progress • Task-classification to make the visualization more compact • Visual improvements, more features • Opening PRs against Spark & Flink with the suggested metrics • A rewrite is coming for the Flink metrics, track it at FLINK-1504
  • 20. Conclusions • The devil is in the details • Visualizations can aid developers to better understand issues and bottlenecks of certain workloads • Specially data skew can hinder the completion time of the whole stage in a batch engine • Adaptive repartitioning strategies can be suprisingly efficient
  • 21. Thank you for your attention Q&A stage boundary & shuffle even partitioning skewed data slow task slow task