SlideShare uma empresa Scribd logo
1 de 18
© 2022, Amazon Web Services, Inc. or its affiliates.
© 2022, Amazon Web Services, Inc. or its affiliates.
Practical learnings from
running thousands of Flink jobs
HongTeoh, Usamah Jassat
Amazon Kinesis Data Analytics
© 2022, Amazon Web Services, Inc. or its affiliates.
Outline
• Job Stability
• State backend selection and tuning
© 2022, Amazon Web Services, Inc. or its affiliates.
Typical Flink cluster set up on Kubernetes
Jobmanager Taskmanager
Taskmanager
Taskmanager
© 2022, Amazon Web Services, Inc. or its affiliates.
Job instability: Memory
OOMKilled
Using more than allocated to Flink configuration
© 2022, Amazon Web Services, Inc. or its affiliates.
Memory management (Instance/Container/Java)
Instance (Kubernetes node) cAdvisor OS
oom-killer
Taskmanager container
Java (Flink process)
Other processes (e.g. Python, Kinesis producers)
© 2022, Amazon Web Services, Inc. or its affiliates.
Memory investigation learnings
• Record both Java and cAdvisor memory metrics
• Record both virtual + real memory use
• OOM-Killer can cause “broken” containers
© 2022, Amazon Web Services, Inc. or its affiliates.
Flink – Java memory mapping
Flink / Java Heap Metaspace Direct Mapped JNI Overhead
Heap
Metaspace
Network
Task Off-heap
Managed
Overhead
-Xmx -XX:MetaspaceSize
-XX:MaxDirectMemorySize
NOT CONTROLLED
Flink controls*
NOT CONTROLLED
Heap OOM
Metaspace
OOM
OOMKilled
Direct
Memory
OOM
© 2022, Amazon Web Services, Inc. or its affiliates.
Memory configuration learnings
• jvm-overhead too low
• Investigate native memory use using:
• Native MemoryTracking
-XX:NativeMemoryTracking=detail
jcmd 1 VM.native_memory
• jemalloc + jeprof
© 2022, Amazon Web Services, Inc. or its affiliates.
Job instability: “Hung” job / checkpoints
© 2022, Amazon Web Services, Inc. or its affiliates.
Thread dump: Thread stuck on SocketRead
© 2022, Amazon Web Services, Inc. or its affiliates.
Thread dump: Thread stuck on deadlock
jackson-databind version < 2.12
© 2022, Amazon Web Services, Inc. or its affiliates.
State backends
• State backends in Flink
• Comparing HashMap and RocksDB
• RocksDB Optimizations
© 2022, Amazon Web Services, Inc. or its affiliates.
State-backends in Flink
© 2022, Amazon Web Services, Inc. or its affiliates.
Comparing HashMap and RocksDB
HashMap RocksDB
Storage Java Heap Native Memory + Disk
Ceiling Java Heap File System
Incremental Checkpointing Yes (experimental) Yes
R/W Performance Higher Lower*
GC Impact Yes No
In Flight Data Format Java Objects Serialized
© 2022, Amazon Web Services, Inc. or its affiliates.
RockDB or HashMap?
• RocksDB for large state
• Hash Map for performance
© 2022, Amazon Web Services, Inc. or its affiliates.
RocksDB Optimization
Write Stalls
• Increase RocksDB memory
• Increase write threads
• Increase write buffer memory*
• Write buffer size + counts
Slow Reads
• Local Disk
• Increase RocksDB memory
• Increase read cache*
© 2022, Amazon Web Services, Inc. or its affiliates.
© 2022, Amazon Web Services, Inc. or its affiliates.
Thank you!
© 2022, Amazon Web Services, Inc. or its affiliates.
Q&A

Mais conteúdo relacionado

Mais procurados

Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraFlink Forward
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internalsKostas Tzoumas
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 
Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...
Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...
Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...HostedbyConfluent
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaJiangjie Qin
 
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberDemystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberFlink Forward
 
CDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkCDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkTimo Walther
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controllerconfluent
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkFlink Forward
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planningconfluent
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used forAljoscha Krettek
 

Mais procurados (20)

Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...
Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...
Getting up to speed with MirrorMaker 2 | Mickael Maison, IBM and Ryanne Dolan...
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberDemystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
 
CDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkCDC Stream Processing with Apache Flink
CDC Stream Processing with Apache Flink
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controller
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 

Semelhante a Practical learnings from running thousands of Flink jobs

Amazon DocumentDB vs MongoDB 의 내부 아키텍쳐 와 장단점 비교
Amazon DocumentDB vs MongoDB 의 내부 아키텍쳐 와 장단점 비교Amazon DocumentDB vs MongoDB 의 내부 아키텍쳐 와 장단점 비교
Amazon DocumentDB vs MongoDB 의 내부 아키텍쳐 와 장단점 비교Amazon Web Services Korea
 
2018 10-19-jc conf-embrace-legacy-java-ee-by-aws-serverless
2018 10-19-jc conf-embrace-legacy-java-ee-by-aws-serverless2018 10-19-jc conf-embrace-legacy-java-ee-by-aws-serverless
2018 10-19-jc conf-embrace-legacy-java-ee-by-aws-serverlessKim Kao
 
Opinionated re:Invent recap with AWS Heroes & Builders
Opinionated re:Invent recap with AWS Heroes & BuildersOpinionated re:Invent recap with AWS Heroes & Builders
Opinionated re:Invent recap with AWS Heroes & BuildersDaniel Zivkovic
 
Running your Java EE 6 Applications in the Cloud
Running your Java EE 6 Applications in the CloudRunning your Java EE 6 Applications in the Cloud
Running your Java EE 6 Applications in the CloudArun Gupta
 
JFokus 2011 - Running your Java EE 6 apps in the Cloud
JFokus 2011 - Running your Java EE 6 apps in the CloudJFokus 2011 - Running your Java EE 6 apps in the Cloud
JFokus 2011 - Running your Java EE 6 apps in the CloudArun Gupta
 
AWS Community Day 2022 Dhiraj Mahapatro_AWS Lambda under the hood _ Best Prac...
AWS Community Day 2022 Dhiraj Mahapatro_AWS Lambda under the hood _ Best Prac...AWS Community Day 2022 Dhiraj Mahapatro_AWS Lambda under the hood _ Best Prac...
AWS Community Day 2022 Dhiraj Mahapatro_AWS Lambda under the hood _ Best Prac...AWS Chicago
 
Getting started with Amazon ECS
Getting started with Amazon ECSGetting started with Amazon ECS
Getting started with Amazon ECSIoannis Polyzos
 
AWS Startup Day Kyiv: Container services on AWS. Comparing Amazon ECS, AWS Fa...
AWS Startup Day Kyiv: Container services on AWS. Comparing Amazon ECS, AWS Fa...AWS Startup Day Kyiv: Container services on AWS. Comparing Amazon ECS, AWS Fa...
AWS Startup Day Kyiv: Container services on AWS. Comparing Amazon ECS, AWS Fa...Amazon Web Services
 
Running your Java EE 6 applications in the Cloud @ Silicon Valley Code Camp 2010
Running your Java EE 6 applications in the Cloud @ Silicon Valley Code Camp 2010Running your Java EE 6 applications in the Cloud @ Silicon Valley Code Camp 2010
Running your Java EE 6 applications in the Cloud @ Silicon Valley Code Camp 2010Arun Gupta
 
Running your Java EE 6 Apps in the Cloud - JavaOne India 2011
Running your Java EE 6 Apps in the Cloud - JavaOne India 2011Running your Java EE 6 Apps in the Cloud - JavaOne India 2011
Running your Java EE 6 Apps in the Cloud - JavaOne India 2011Arun Gupta
 
JavaOne India 2011 - Running your Java EE 6 Apps in the Cloud
JavaOne India 2011 - Running your Java EE 6 Apps in the CloudJavaOne India 2011 - Running your Java EE 6 Apps in the Cloud
JavaOne India 2011 - Running your Java EE 6 Apps in the CloudArun Gupta
 
Running your Java EE 6 applications in the Cloud (FISL 12)
Running your Java EE 6 applications in the Cloud (FISL 12)Running your Java EE 6 applications in the Cloud (FISL 12)
Running your Java EE 6 applications in the Cloud (FISL 12)Arun Gupta
 
Assembling an AWS CloudFormation Authoring Tool Chain (DEV368-R2) - AWS re:In...
Assembling an AWS CloudFormation Authoring Tool Chain (DEV368-R2) - AWS re:In...Assembling an AWS CloudFormation Authoring Tool Chain (DEV368-R2) - AWS re:In...
Assembling an AWS CloudFormation Authoring Tool Chain (DEV368-R2) - AWS re:In...Amazon Web Services
 
Running your Java EE 6 applications in the Cloud
Running your Java EE 6 applications in the CloudRunning your Java EE 6 applications in the Cloud
Running your Java EE 6 applications in the CloudArun Gupta
 
SRV314 Containerized App Development with AWS Fargate
SRV314 Containerized App Development with AWS FargateSRV314 Containerized App Development with AWS Fargate
SRV314 Containerized App Development with AWS FargateAmazon Web Services
 
Performance Tuning - MuraCon 2012
Performance Tuning - MuraCon 2012Performance Tuning - MuraCon 2012
Performance Tuning - MuraCon 2012eballisty
 
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018Amazon Web Services
 

Semelhante a Practical learnings from running thousands of Flink jobs (20)

Amazon DocumentDB vs MongoDB 의 내부 아키텍쳐 와 장단점 비교
Amazon DocumentDB vs MongoDB 의 내부 아키텍쳐 와 장단점 비교Amazon DocumentDB vs MongoDB 의 내부 아키텍쳐 와 장단점 비교
Amazon DocumentDB vs MongoDB 의 내부 아키텍쳐 와 장단점 비교
 
2018 10-19-jc conf-embrace-legacy-java-ee-by-aws-serverless
2018 10-19-jc conf-embrace-legacy-java-ee-by-aws-serverless2018 10-19-jc conf-embrace-legacy-java-ee-by-aws-serverless
2018 10-19-jc conf-embrace-legacy-java-ee-by-aws-serverless
 
Opinionated re:Invent recap with AWS Heroes & Builders
Opinionated re:Invent recap with AWS Heroes & BuildersOpinionated re:Invent recap with AWS Heroes & Builders
Opinionated re:Invent recap with AWS Heroes & Builders
 
Running your Java EE 6 Applications in the Cloud
Running your Java EE 6 Applications in the CloudRunning your Java EE 6 Applications in the Cloud
Running your Java EE 6 Applications in the Cloud
 
JFokus 2011 - Running your Java EE 6 apps in the Cloud
JFokus 2011 - Running your Java EE 6 apps in the CloudJFokus 2011 - Running your Java EE 6 apps in the Cloud
JFokus 2011 - Running your Java EE 6 apps in the Cloud
 
AWS Community Day 2022 Dhiraj Mahapatro_AWS Lambda under the hood _ Best Prac...
AWS Community Day 2022 Dhiraj Mahapatro_AWS Lambda under the hood _ Best Prac...AWS Community Day 2022 Dhiraj Mahapatro_AWS Lambda under the hood _ Best Prac...
AWS Community Day 2022 Dhiraj Mahapatro_AWS Lambda under the hood _ Best Prac...
 
Getting started with Amazon ECS
Getting started with Amazon ECSGetting started with Amazon ECS
Getting started with Amazon ECS
 
Deep dive into AWS fargate
Deep dive into AWS fargateDeep dive into AWS fargate
Deep dive into AWS fargate
 
AWS Startup Day Kyiv: Container services on AWS. Comparing Amazon ECS, AWS Fa...
AWS Startup Day Kyiv: Container services on AWS. Comparing Amazon ECS, AWS Fa...AWS Startup Day Kyiv: Container services on AWS. Comparing Amazon ECS, AWS Fa...
AWS Startup Day Kyiv: Container services on AWS. Comparing Amazon ECS, AWS Fa...
 
AWS Container services
AWS Container servicesAWS Container services
AWS Container services
 
Running your Java EE 6 applications in the Cloud @ Silicon Valley Code Camp 2010
Running your Java EE 6 applications in the Cloud @ Silicon Valley Code Camp 2010Running your Java EE 6 applications in the Cloud @ Silicon Valley Code Camp 2010
Running your Java EE 6 applications in the Cloud @ Silicon Valley Code Camp 2010
 
Running your Java EE 6 Apps in the Cloud - JavaOne India 2011
Running your Java EE 6 Apps in the Cloud - JavaOne India 2011Running your Java EE 6 Apps in the Cloud - JavaOne India 2011
Running your Java EE 6 Apps in the Cloud - JavaOne India 2011
 
JavaOne India 2011 - Running your Java EE 6 Apps in the Cloud
JavaOne India 2011 - Running your Java EE 6 Apps in the CloudJavaOne India 2011 - Running your Java EE 6 Apps in the Cloud
JavaOne India 2011 - Running your Java EE 6 Apps in the Cloud
 
Running your Java EE 6 applications in the Cloud (FISL 12)
Running your Java EE 6 applications in the Cloud (FISL 12)Running your Java EE 6 applications in the Cloud (FISL 12)
Running your Java EE 6 applications in the Cloud (FISL 12)
 
Assembling an AWS CloudFormation Authoring Tool Chain (DEV368-R2) - AWS re:In...
Assembling an AWS CloudFormation Authoring Tool Chain (DEV368-R2) - AWS re:In...Assembling an AWS CloudFormation Authoring Tool Chain (DEV368-R2) - AWS re:In...
Assembling an AWS CloudFormation Authoring Tool Chain (DEV368-R2) - AWS re:In...
 
Running your Java EE 6 applications in the Cloud
Running your Java EE 6 applications in the CloudRunning your Java EE 6 applications in the Cloud
Running your Java EE 6 applications in the Cloud
 
SRV314 Containerized App Development with AWS Fargate
SRV314 Containerized App Development with AWS FargateSRV314 Containerized App Development with AWS Fargate
SRV314 Containerized App Development with AWS Fargate
 
Performance Tuning - MuraCon 2012
Performance Tuning - MuraCon 2012Performance Tuning - MuraCon 2012
Performance Tuning - MuraCon 2012
 
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
 
Introducing AWS Fargate
Introducing AWS FargateIntroducing AWS Fargate
Introducing AWS Fargate
 

Mais de Flink Forward

Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
 
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!Flink Forward
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesFlink Forward
 
The top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleThe top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleFlink Forward
 
Changelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkChangelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkFlink Forward
 
Large Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior DetectionLarge Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior DetectionFlink Forward
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
 
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Flink Forward
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 

Mais de Flink Forward (13)

Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use cases
 
The top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleThe top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scale
 
Changelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkChangelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache Flink
 
Large Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior DetectionLarge Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior Detection
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 

Último

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 

Último (20)

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 

Practical learnings from running thousands of Flink jobs

  • 1. © 2022, Amazon Web Services, Inc. or its affiliates. © 2022, Amazon Web Services, Inc. or its affiliates. Practical learnings from running thousands of Flink jobs HongTeoh, Usamah Jassat Amazon Kinesis Data Analytics
  • 2. © 2022, Amazon Web Services, Inc. or its affiliates. Outline • Job Stability • State backend selection and tuning
  • 3. © 2022, Amazon Web Services, Inc. or its affiliates. Typical Flink cluster set up on Kubernetes Jobmanager Taskmanager Taskmanager Taskmanager
  • 4. © 2022, Amazon Web Services, Inc. or its affiliates. Job instability: Memory OOMKilled Using more than allocated to Flink configuration
  • 5. © 2022, Amazon Web Services, Inc. or its affiliates. Memory management (Instance/Container/Java) Instance (Kubernetes node) cAdvisor OS oom-killer Taskmanager container Java (Flink process) Other processes (e.g. Python, Kinesis producers)
  • 6. © 2022, Amazon Web Services, Inc. or its affiliates. Memory investigation learnings • Record both Java and cAdvisor memory metrics • Record both virtual + real memory use • OOM-Killer can cause “broken” containers
  • 7. © 2022, Amazon Web Services, Inc. or its affiliates. Flink – Java memory mapping Flink / Java Heap Metaspace Direct Mapped JNI Overhead Heap Metaspace Network Task Off-heap Managed Overhead -Xmx -XX:MetaspaceSize -XX:MaxDirectMemorySize NOT CONTROLLED Flink controls* NOT CONTROLLED Heap OOM Metaspace OOM OOMKilled Direct Memory OOM
  • 8. © 2022, Amazon Web Services, Inc. or its affiliates. Memory configuration learnings • jvm-overhead too low • Investigate native memory use using: • Native MemoryTracking -XX:NativeMemoryTracking=detail jcmd 1 VM.native_memory • jemalloc + jeprof
  • 9. © 2022, Amazon Web Services, Inc. or its affiliates. Job instability: “Hung” job / checkpoints
  • 10. © 2022, Amazon Web Services, Inc. or its affiliates. Thread dump: Thread stuck on SocketRead
  • 11. © 2022, Amazon Web Services, Inc. or its affiliates. Thread dump: Thread stuck on deadlock jackson-databind version < 2.12
  • 12. © 2022, Amazon Web Services, Inc. or its affiliates. State backends • State backends in Flink • Comparing HashMap and RocksDB • RocksDB Optimizations
  • 13. © 2022, Amazon Web Services, Inc. or its affiliates. State-backends in Flink
  • 14. © 2022, Amazon Web Services, Inc. or its affiliates. Comparing HashMap and RocksDB HashMap RocksDB Storage Java Heap Native Memory + Disk Ceiling Java Heap File System Incremental Checkpointing Yes (experimental) Yes R/W Performance Higher Lower* GC Impact Yes No In Flight Data Format Java Objects Serialized
  • 15. © 2022, Amazon Web Services, Inc. or its affiliates. RockDB or HashMap? • RocksDB for large state • Hash Map for performance
  • 16. © 2022, Amazon Web Services, Inc. or its affiliates. RocksDB Optimization Write Stalls • Increase RocksDB memory • Increase write threads • Increase write buffer memory* • Write buffer size + counts Slow Reads • Local Disk • Increase RocksDB memory • Increase read cache*
  • 17. © 2022, Amazon Web Services, Inc. or its affiliates. © 2022, Amazon Web Services, Inc. or its affiliates. Thank you!
  • 18. © 2022, Amazon Web Services, Inc. or its affiliates. Q&A

Notas do Editor

  1. In a typical setup of Flink cluster on Kubernetes JM containers TM containers
  2. Instance  2 taskmanager containers -> Java process, Python process, Kinesis producers Each measures memory differently: Java process measures heap, non-heap using beans. Only measure memory use of Java process, excludes Python/Kinesis Containers’ memory use is measured using cAdvisor On the instance, the OS bridges the gap between Virtual memory and Real memory Means that if the Java process requests 100 Gb of memory, that memory is allocated in virtual space In reality, it is mapped to a “zero page”, meaning it doesn’t actually take up 100Gb of space, until it is accessed/written to. In Kubernetes set ups, vm.overcommit_memory is recommended to be set to 1 -> Do not limit virtual memory. When all processes use the memory, the oom-killer (OS process) will spin up and terminate a process. Does not care what container it belongs to – can cause containers to be in unhealthy state. Takeaways Measure BOTH Java memory and OS memory (cAdvisor) Use data to drill down into which bucket of memory is over the limit cGroups v2 to treat all processes in container like a single process
  3. Java memory NOT JUST HEAP Metaspace – for classes Direct – directly use ByteBuffers -> significantly faster than allocating in heap Mapped – useful when reading from file -> Map a file into a ByteBuffer Java Native Interface (JNI) – when running non-Java code (RocksDB state backend) Overhead – used for the JVM itself (GC, thread stacks, Symbols) Map to Flink Taskmanager memory configurations Explain limits for each bucket When each limit is exceeded, this is how it surfaces Gotchas