SlideShare a Scribd company logo
1 of 16
The top 3 challenges
running multi-tenant
Flink at scale
Sharon Xie (sharon@decodable.co)
Founding Engineer at Decodable
Quick Intro of Decodable
Mental Model
● Connections to external
data systems
● Streams of data records
● Pipelines that process
data in streams
Flink at Decodable
● Runs the connections and pipelines jobs
● Not directly exposed to the users
● Mixed deployment modes for different use
cases
Why Flink?
● Purposely designed for stream processing
● Proven to scale in production
● Mature community
Cool, now the challenges start
The paradox of choice
● Massive Flink configurations
● Different APIs, deployment modes
Cloud resource sharing is great but …
● Noisy neighbors
● Blast radius
● Security
● Observability
Challenge 1: Infra resource management
Problems
● Isolation VS resource sharing
● Cost 💸
● Developer productivity
Design principles
● Use managed services
● Start with max cost efficiency
● Easy to support max isolation
Start with max cost efficiency
VPC
K8S Cluster
DB Cluster Cell: 1
Cell: 1
decodable-cell-1-control
decodable-cell-1-data
Cell: 2
Cell: 2
EKS:
cluster + nodes + ...
Subnets, SGs, ...
RDS Aurora cluster +
R/W instances
n
K8s Namespaces:
decodable-cell-2-control
decodable-cell-2-data
K8s Namespaces:
Kafka Cluster
MSK (AWS Managed
Kafka)
topics
topics
For max isolation deployment
VPC
DB Cluster
Kafka Cluster
K8S Cluster
Cell
Managed by terraform
module cell_1 {
cluster = module.clusters.cluster_1
database = module.databases.database_1
kafka = module.kafkas.kafka_1
}
module cluster_1 { network = module.vpcs.1 }
module database_1 { network = module.vpcs.1 }
module kafka_1 { network = module.vpcs.1 }
● Centralized infrastructure management
● Easy to provision a new cell with reusable infra code
Challenge 2: Observability
Information Overload
● Each Flink job has a lot of metrics
○ Operator level, task level, job level, JobManager,
TaskManagers
○ Now imagine we have a lot of jobs…
● Flink reports a lot of errors, while useful to debug but
needs classification
○ Connectivity issues with external systems
○ Internal temporary errors
○ Internal errors that requires someone to fix (eg: code
errors)
Internal Observability
Goals
● Optimize for on-call engineers QoL
● Actionable operational insights
What we did:
● Reduce the likelihood of configuration error
○ Flink pods have the same configuration
○ Run Flink SQL jobs
● Key metrics / events to monitor
○ Successful completion of checkpoints
○ Time from activation until all tasks are running
○ Job failures / restarts
● Heavy use of kubernetes tagging
User-facing Observability
Goals:
● Eliminate noise while knowing what’s going on
● Actionable error messages
● Integrate with users’ monitoring tools
What we did
● Expose job-level metrics
● Expose runtime errors that requires user actions
○ Error message classifier / processor
○ Custom metrics reporter to Kafka
● Streams for audit events & job metrics
Challenge 3: Authentication with external
systems
This challenge is REALLY about: managing sensitive info in cloud services
🔥Disclaimer🔥
● Lots of omitted details
● Focus on the rationale
● Don’t just copy
Principles
● Principle of least privilege
● Minimize the # of services with access to the sensitive info
● Audit everything
Handling Sensitive info
● Rely on cloud provider role-based auth wherever possible
● Stored in the AWS Secrets Manager
● APIs can only create secrets (can’t read)
● Secrets are only retrieved at activation time
● Flink pods don’t have access to the Secrets Manager
○ Secrets are created as k8s secrets and mounted to the pod running the job
Rationale
● Storing sensitive information together with other configurations (RDS) is bad
○ Many services have access to RDS
○ Hard to enforce more granular access control
○ Increase the risk profile of the RDS
● Only 1 service has access to the Secrets Manager
Parting Thoughts
To run Flink in a multi-tenant environment, you need deep knowledge about:
● Cloud infra/architecture
● Networking
● Operating Systems
● Distributed Systems
● …
It’s just hard!
2022
Build real-time data apps &
services. Fast.
decodable.co

More Related Content

What's hot

What's hot (20)

Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Kafka Streams State Stores Being Persistent
Kafka Streams State Stores Being PersistentKafka Streams State Stores Being Persistent
Kafka Streams State Stores Being Persistent
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Flink Streaming
Flink StreamingFlink Streaming
Flink Streaming
 
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...
 
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth WiesmanWebinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!
 
Changelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkChangelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache Flink
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberDemystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
 

Similar to The top 3 challenges running multi-tenant Flink at scale

Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...
Xiaoman DONG
 

Similar to The top 3 challenges running multi-tenant Flink at scale (20)

Last Conference 2017: Big Data in a Production Environment: Lessons Learnt
Last Conference 2017: Big Data in a Production Environment: Lessons LearntLast Conference 2017: Big Data in a Production Environment: Lessons Learnt
Last Conference 2017: Big Data in a Production Environment: Lessons Learnt
 
Automating using Ansible
Automating using AnsibleAutomating using Ansible
Automating using Ansible
 
Serverless Boston @ Oracle Meetup
Serverless Boston @ Oracle MeetupServerless Boston @ Oracle Meetup
Serverless Boston @ Oracle Meetup
 
The Fn Project by Jesse Butler
 The Fn Project by Jesse Butler The Fn Project by Jesse Butler
The Fn Project by Jesse Butler
 
Spring 21 Salesforce Release Webinar
Spring 21 Salesforce Release WebinarSpring 21 Salesforce Release Webinar
Spring 21 Salesforce Release Webinar
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
The FN Project by Maximilian Jerg
The FN Project by Maximilian JergThe FN Project by Maximilian Jerg
The FN Project by Maximilian Jerg
 
Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17
 
Challenges In Modern Application
Challenges In Modern ApplicationChallenges In Modern Application
Challenges In Modern Application
 
Truemotion Adventures in Containerization
Truemotion Adventures in ContainerizationTruemotion Adventures in Containerization
Truemotion Adventures in Containerization
 
J-Spring 2017 - Microservices in action at the Dutch National Police
J-Spring 2017 - Microservices in action at the Dutch National PoliceJ-Spring 2017 - Microservices in action at the Dutch National Police
J-Spring 2017 - Microservices in action at the Dutch National Police
 
Continuous Delivery Amsterdam - Microservices in action at the Dutch National...
Continuous Delivery Amsterdam - Microservices in action at the Dutch National...Continuous Delivery Amsterdam - Microservices in action at the Dutch National...
Continuous Delivery Amsterdam - Microservices in action at the Dutch National...
 
NetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & ContainersNetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & Containers
 
Viktor Turskyi "Effective NodeJS Application Development"
Viktor Turskyi "Effective NodeJS Application Development"Viktor Turskyi "Effective NodeJS Application Development"
Viktor Turskyi "Effective NodeJS Application Development"
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016
 
Netflix Architecture and Open Source
Netflix Architecture and Open SourceNetflix Architecture and Open Source
Netflix Architecture and Open Source
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015
 
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...
 
Not my problem - Delegating responsibility to infrastructure
Not my problem - Delegating responsibility to infrastructureNot my problem - Delegating responsibility to infrastructure
Not my problem - Delegating responsibility to infrastructure
 

More from Flink Forward

More from Flink Forward (11)

Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use cases
 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and Profit
 
Large Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior DetectionLarge Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior Detection
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 

The top 3 challenges running multi-tenant Flink at scale

  • 1. The top 3 challenges running multi-tenant Flink at scale Sharon Xie (sharon@decodable.co) Founding Engineer at Decodable
  • 2. Quick Intro of Decodable Mental Model ● Connections to external data systems ● Streams of data records ● Pipelines that process data in streams
  • 3. Flink at Decodable ● Runs the connections and pipelines jobs ● Not directly exposed to the users ● Mixed deployment modes for different use cases Why Flink? ● Purposely designed for stream processing ● Proven to scale in production ● Mature community
  • 4. Cool, now the challenges start The paradox of choice ● Massive Flink configurations ● Different APIs, deployment modes Cloud resource sharing is great but … ● Noisy neighbors ● Blast radius ● Security ● Observability
  • 5. Challenge 1: Infra resource management Problems ● Isolation VS resource sharing ● Cost 💸 ● Developer productivity Design principles ● Use managed services ● Start with max cost efficiency ● Easy to support max isolation
  • 6. Start with max cost efficiency VPC K8S Cluster DB Cluster Cell: 1 Cell: 1 decodable-cell-1-control decodable-cell-1-data Cell: 2 Cell: 2 EKS: cluster + nodes + ... Subnets, SGs, ... RDS Aurora cluster + R/W instances n K8s Namespaces: decodable-cell-2-control decodable-cell-2-data K8s Namespaces: Kafka Cluster MSK (AWS Managed Kafka) topics topics
  • 7. For max isolation deployment VPC DB Cluster Kafka Cluster K8S Cluster Cell
  • 8. Managed by terraform module cell_1 { cluster = module.clusters.cluster_1 database = module.databases.database_1 kafka = module.kafkas.kafka_1 } module cluster_1 { network = module.vpcs.1 } module database_1 { network = module.vpcs.1 } module kafka_1 { network = module.vpcs.1 } ● Centralized infrastructure management ● Easy to provision a new cell with reusable infra code
  • 9. Challenge 2: Observability Information Overload ● Each Flink job has a lot of metrics ○ Operator level, task level, job level, JobManager, TaskManagers ○ Now imagine we have a lot of jobs… ● Flink reports a lot of errors, while useful to debug but needs classification ○ Connectivity issues with external systems ○ Internal temporary errors ○ Internal errors that requires someone to fix (eg: code errors)
  • 10. Internal Observability Goals ● Optimize for on-call engineers QoL ● Actionable operational insights What we did: ● Reduce the likelihood of configuration error ○ Flink pods have the same configuration ○ Run Flink SQL jobs ● Key metrics / events to monitor ○ Successful completion of checkpoints ○ Time from activation until all tasks are running ○ Job failures / restarts ● Heavy use of kubernetes tagging
  • 11. User-facing Observability Goals: ● Eliminate noise while knowing what’s going on ● Actionable error messages ● Integrate with users’ monitoring tools What we did ● Expose job-level metrics ● Expose runtime errors that requires user actions ○ Error message classifier / processor ○ Custom metrics reporter to Kafka ● Streams for audit events & job metrics
  • 12. Challenge 3: Authentication with external systems This challenge is REALLY about: managing sensitive info in cloud services 🔥Disclaimer🔥 ● Lots of omitted details ● Focus on the rationale ● Don’t just copy Principles ● Principle of least privilege ● Minimize the # of services with access to the sensitive info ● Audit everything
  • 13. Handling Sensitive info ● Rely on cloud provider role-based auth wherever possible ● Stored in the AWS Secrets Manager ● APIs can only create secrets (can’t read) ● Secrets are only retrieved at activation time ● Flink pods don’t have access to the Secrets Manager ○ Secrets are created as k8s secrets and mounted to the pod running the job
  • 14. Rationale ● Storing sensitive information together with other configurations (RDS) is bad ○ Many services have access to RDS ○ Hard to enforce more granular access control ○ Increase the risk profile of the RDS ● Only 1 service has access to the Secrets Manager
  • 15. Parting Thoughts To run Flink in a multi-tenant environment, you need deep knowledge about: ● Cloud infra/architecture ● Networking ● Operating Systems ● Distributed Systems ● … It’s just hard!
  • 16. 2022 Build real-time data apps & services. Fast. decodable.co

Editor's Notes

  1. What else? User management / access control Multi-tenant resource management
  2. No UDF SQL operator’s behavior is known and predictable stress the idea of knowing the behavior of these operators on most workloads to make it clear we're not just YOLO'ing our config.
  3. API service has the most privilege as it can R/W secrets Still trapped in the cell when 💩happens Heavily audit activities in the api service Easily remove the service account’s assumed role when bad things happen