SlideShare uma empresa Scribd logo
1 de 24
Baixar para ler offline
Calico のデプロイを
ミスって本番クラスタを
壊しそうになった話
2021/03/12
Cloud Native Days Online
WHO I AM?
- Name: Kawabe Katsuya
- Team: CyberAgent Group Infrastructure Unit
- Position: Software, Infra Engineer, 2020 New Graduate
- Hobby: Music, Comic
ABOUT US
@KKawabe108
1. What Happen
2. How To Resolve
TABLE
CONTENTS
問題発覚編: calico のデプロイをミスったことに
よって発生した事象について
解決編: プロダクト側への説明、監視の見直し、
再発防止への取り組み
図参照: https://docs.projectcalico.org/reference/architecture/overview
CNI: calico
アラートはいつも突然に
AKE での監視
Victoria Metrics で複数のクラスタを監視しています
What Happened
Ingress とノードの
BGP ピアがダウンしたというアラートが大量発生
Kubectl get po -A をすると Master ノードに乗っている
Pod がほとんど Evicted されていた 😇
Master ノードに ssh すると、どうやらディスク領域が
圧迫され、Eviction の閾値に到達していた
What Happened: Ingress の実装
Node
calico-node
Node
exporter
Node
calico-node
Node
calico-node
nginx ctrl nginx ctrl nginx ctrl
Node
calico-node
Node
calico-node
calico-node
exporter exporter
Big IP VS
BGP Routing
What Happened: Ingress の実装
Node
calico-node
Node
exporter
Node
calico-node
Node
calico-node
nginx ctrl nginx ctrl nginx ctrl
Node
calico-node
Node
calico-node
calico-node
exporter exporter
Big IP VS
BGP Routing
BGP Link is
Down
なぜ、Diskが圧迫されたのか
What Happened : ipamhandles リソースの爆発
calico-ipam が使用する Pod と IPを紐づけるリソース
通常、calico が Pod の作成と削除に合わせ
制御するリソースのはずだったが・・・
What Happened : ipamhandles リソースの爆発
😇
kubectl get をすると API サーバがメモリを食い潰して死ぬので etcdctl でチェックしてる
ちなみに、Pod の数はおよそ30個ぐらい
What Happened : ipamhandles リソースの爆発
😇
kubectl get をすると API サーバがメモリを食い潰して死ぬので etcdctl でチェックしてる
ちなみに、Pod の数はおよそ30個ぐらい
APIサーバが操作できない =
クラスタが操作不能 =
ヤバイ
What Happened : ipamhandles リソースの爆発
etcd backup
2GB
/var/backup/etcd
etcd backup
2GB
etcd backup
2GB
systemd
🧨
🧨 🧨
ipamhandles の爆発によって、etcd のバックアップデータが肥大化し、
20GB しかないディスクの圧迫へと繋がった
What Happened : まとめ
Step 1 Step 2 Step 3 Step 4
calico の ipamhandles が
爆発する
etcd のバックアップデータが
増加する (2GB)
Master のディスクが圧迫されて
calico-node とその他が Evicted
される
calico-node がダウンしたことに
より、BGPピアが切断され、
アラート発砲
原因と対応
How To Resolve
calico-kube-controllers というコンポーネント
をデプロイしていなかった
calico-node の ClusterRole の権限が
間違っていた
図参照: https://docs.projectcalico.org/reference/architecture/overview
How To Resolve: 反省点
calico-node に delete の権限を渡していないせいで
GCが発生していなかった
元々、3.8 のマニフェストをベースに弄っていたので
発生したミス
新しいマニフェストを公式から落としてそれをベースにすれば
今回のようなミスは発生しなかった
How To Resolve : プロダクトへの対応
今回、アラートが上がったのは監視用に立てているクラスタでプロダクトが利用しているクラスタでは
Pod がそこまで頻繁に作成削除されていなかったので、肥大化はディスクに影響が出るほどではなかった
すぐに事情を説明して、マニフェストの修正を行なった
発生するリスクは抑えたということを確認して、対応終了
How To Resolve: 監視基盤の対応
監視基盤で全クラスタで etcd の db size
と、オブジェクト数を監視するようにした
ディスクサイズのアラートが
Eviction Policy と同等だったので、
それより低く設定し直す
How To Resolve: calico のアップデートについて
極力アップデートしなくていいならしない
マニフェストは基本公式のものをそのまま使うので問題はないはず (Pod CIDR や IPIP の有効化フラグぐらい)
マニフェストが大きいので修正のレビューは三重チェックで丁度いい (それぐらいCNIはクリティカル)
CNI のアップデートには細心の
注意を払いましょう!
Thank you for listening !

Mais conteúdo relacionado

Mais procurados

Using Qt under LGPLv3
Using Qt under LGPLv3Using Qt under LGPLv3
Using Qt under LGPLv3
Burkhard Stubert
 

Mais procurados (20)

Stargz Snapshotter: イメージのpullを省略しcontainerdでコンテナを高速に起動する
Stargz Snapshotter: イメージのpullを省略しcontainerdでコンテナを高速に起動するStargz Snapshotter: イメージのpullを省略しcontainerdでコンテナを高速に起動する
Stargz Snapshotter: イメージのpullを省略しcontainerdでコンテナを高速に起動する
 
The overview of lazypull with containerd Remote Snapshotter & Stargz Snapshotter
The overview of lazypull with containerd Remote Snapshotter & Stargz SnapshotterThe overview of lazypull with containerd Remote Snapshotter & Stargz Snapshotter
The overview of lazypull with containerd Remote Snapshotter & Stargz Snapshotter
 
Startup Containers in Lightning Speed with Lazy Image Distribution
Startup Containers in Lightning Speed with Lazy Image DistributionStartup Containers in Lightning Speed with Lazy Image Distribution
Startup Containers in Lightning Speed with Lazy Image Distribution
 
Starting up Containers Super Fast With Lazy Pulling of Images
Starting up Containers Super Fast With Lazy Pulling of ImagesStarting up Containers Super Fast With Lazy Pulling of Images
Starting up Containers Super Fast With Lazy Pulling of Images
 
Learning kubernetes
Learning kubernetesLearning kubernetes
Learning kubernetes
 
Build and Run Containers With Lazy Pulling - Adoption status of containerd St...
Build and Run Containers With Lazy Pulling - Adoption status of containerd St...Build and Run Containers With Lazy Pulling - Adoption status of containerd St...
Build and Run Containers With Lazy Pulling - Adoption status of containerd St...
 
Shifter singularity - june 7, 2018 - bw symposium
Shifter  singularity - june 7, 2018 - bw symposiumShifter  singularity - june 7, 2018 - bw symposium
Shifter singularity - june 7, 2018 - bw symposium
 
Distributed tensorflow on kubernetes
Distributed tensorflow on kubernetesDistributed tensorflow on kubernetes
Distributed tensorflow on kubernetes
 
Introduction and Deep Dive Into Containerd
Introduction and Deep Dive Into ContainerdIntroduction and Deep Dive Into Containerd
Introduction and Deep Dive Into Containerd
 
KubeCon EU 2016: Secure, Cloud-Native Networking with Project Calico
KubeCon EU 2016: Secure, Cloud-Native Networking with Project CalicoKubeCon EU 2016: Secure, Cloud-Native Networking with Project Calico
KubeCon EU 2016: Secure, Cloud-Native Networking with Project Calico
 
Kubernetes for Java developers
Kubernetes for Java developersKubernetes for Java developers
Kubernetes for Java developers
 
KubeCon EU 2016: What is OpenStack's role in a Kubernetes world?
KubeCon EU 2016: What is OpenStack's role in a Kubernetes world?KubeCon EU 2016: What is OpenStack's role in a Kubernetes world?
KubeCon EU 2016: What is OpenStack's role in a Kubernetes world?
 
SCALE 2011 Deploying OpenStack with Chef
SCALE 2011 Deploying OpenStack with ChefSCALE 2011 Deploying OpenStack with Chef
SCALE 2011 Deploying OpenStack with Chef
 
Kubernetes extensibility
Kubernetes extensibilityKubernetes extensibility
Kubernetes extensibility
 
ArgoCD and Tekton: Match made in Kubernetes heaven | DevNation Tech Talk
ArgoCD and Tekton: Match made in Kubernetes heaven | DevNation Tech TalkArgoCD and Tekton: Match made in Kubernetes heaven | DevNation Tech Talk
ArgoCD and Tekton: Match made in Kubernetes heaven | DevNation Tech Talk
 
[KubeConEU] Building images efficiently and securely on Kubernetes with BuildKit
[KubeConEU] Building images efficiently and securely on Kubernetes with BuildKit[KubeConEU] Building images efficiently and securely on Kubernetes with BuildKit
[KubeConEU] Building images efficiently and securely on Kubernetes with BuildKit
 
Rtl sdr software defined radio
Rtl sdr   software defined radioRtl sdr   software defined radio
Rtl sdr software defined radio
 
Using Qt under LGPLv3
Using Qt under LGPLv3Using Qt under LGPLv3
Using Qt under LGPLv3
 
What's new in FreeBSD 10
What's new in FreeBSD 10What's new in FreeBSD 10
What's new in FreeBSD 10
 
Cantainer CI/ CD with Kubernetes
Cantainer CI/ CD with KubernetesCantainer CI/ CD with Kubernetes
Cantainer CI/ CD with Kubernetes
 

Semelhante a 【CNDO2021】Calicoのデプロイをミスって本番クラスタを壊しそうになった話

Cms forum, future of Web Content Management
Cms forum, future of Web Content ManagementCms forum, future of Web Content Management
Cms forum, future of Web Content Management
guest88136a
 
Moving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloudMoving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloud
KubeAcademy
 
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytesWindows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Peter Hlavaty
 
ドワンゴでのScala活用事例「ニコニコandroid」
ドワンゴでのScala活用事例「ニコニコandroid」ドワンゴでのScala活用事例「ニコニコandroid」
ドワンゴでのScala活用事例「ニコニコandroid」
Satoshi Goto
 

Semelhante a 【CNDO2021】Calicoのデプロイをミスって本番クラスタを壊しそうになった話 (20)

[Wroclaw #7] Why So Serial?
[Wroclaw #7] Why So Serial?[Wroclaw #7] Why So Serial?
[Wroclaw #7] Why So Serial?
 
Orchestrating Cloud Events - Knative Meetup 2020
Orchestrating Cloud Events - Knative Meetup 2020Orchestrating Cloud Events - Knative Meetup 2020
Orchestrating Cloud Events - Knative Meetup 2020
 
Qa in production singular 2019
Qa in production   singular 2019Qa in production   singular 2019
Qa in production singular 2019
 
Future of WCM - CM Forum Belgium
Future of WCM - CM Forum BelgiumFuture of WCM - CM Forum Belgium
Future of WCM - CM Forum Belgium
 
Cms forum, future of Web Content Management
Cms forum, future of Web Content ManagementCms forum, future of Web Content Management
Cms forum, future of Web Content Management
 
KubeCon 2017 Zero Touch Provision
KubeCon 2017 Zero Touch ProvisionKubeCon 2017 Zero Touch Provision
KubeCon 2017 Zero Touch Provision
 
Kubecon 2017 Zero Touch Kubernetes
Kubecon 2017 Zero Touch KubernetesKubecon 2017 Zero Touch Kubernetes
Kubecon 2017 Zero Touch Kubernetes
 
Immutable Infrastructure & Rethinking Configuration - Interop 2019
Immutable Infrastructure & Rethinking Configuration - Interop 2019Immutable Infrastructure & Rethinking Configuration - Interop 2019
Immutable Infrastructure & Rethinking Configuration - Interop 2019
 
Moving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloudMoving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloud
 
Moving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloudMoving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloud
 
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytesWindows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytes
 
【HITCON FreeTalk 2022 - 我把在網頁框架發現的密碼學漏洞變成 CTF 題了】
【HITCON FreeTalk 2022 - 我把在網頁框架發現的密碼學漏洞變成 CTF 題了】【HITCON FreeTalk 2022 - 我把在網頁框架發現的密碼學漏洞變成 CTF 題了】
【HITCON FreeTalk 2022 - 我把在網頁框架發現的密碼學漏洞變成 CTF 題了】
 
There and Back Again (My DevOps journey) - DevOps Days Copenhagen 2018
There and Back Again (My DevOps journey) - DevOps Days Copenhagen 2018There and Back Again (My DevOps journey) - DevOps Days Copenhagen 2018
There and Back Again (My DevOps journey) - DevOps Days Copenhagen 2018
 
Testability for developers – Fighting a mess by making it testable
Testability for developers – Fighting a mess by making it testableTestability for developers – Fighting a mess by making it testable
Testability for developers – Fighting a mess by making it testable
 
Recreating "The Clock" with Machine Learning and Web Scraping
Recreating "The Clock" with Machine Learning and Web ScrapingRecreating "The Clock" with Machine Learning and Web Scraping
Recreating "The Clock" with Machine Learning and Web Scraping
 
Basics of Kubernetes on BOSH: Run Production-grade Kubernetes on the SDDC
Basics of Kubernetes on BOSH: Run Production-grade Kubernetes on the SDDCBasics of Kubernetes on BOSH: Run Production-grade Kubernetes on the SDDC
Basics of Kubernetes on BOSH: Run Production-grade Kubernetes on the SDDC
 
Debugging Go in Kubernetes
Debugging Go in KubernetesDebugging Go in Kubernetes
Debugging Go in Kubernetes
 
Simplifying Real Time Data Analytics with Docker, IoT & Cloud
Simplifying Real Time Data Analytics with Docker, IoT & CloudSimplifying Real Time Data Analytics with Docker, IoT & Cloud
Simplifying Real Time Data Analytics with Docker, IoT & Cloud
 
Build and Monitor Machine Learning Services in Kubernetes
Build and Monitor Machine Learning Services in KubernetesBuild and Monitor Machine Learning Services in Kubernetes
Build and Monitor Machine Learning Services in Kubernetes
 
ドワンゴでのScala活用事例「ニコニコandroid」
ドワンゴでのScala活用事例「ニコニコandroid」ドワンゴでのScala活用事例「ニコニコandroid」
ドワンゴでのScala活用事例「ニコニコandroid」
 

Último

The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 

Último (20)

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide Deck
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 

【CNDO2021】Calicoのデプロイをミスって本番クラスタを壊しそうになった話

  • 2. WHO I AM? - Name: Kawabe Katsuya - Team: CyberAgent Group Infrastructure Unit - Position: Software, Infra Engineer, 2020 New Graduate - Hobby: Music, Comic ABOUT US @KKawabe108
  • 3. 1. What Happen 2. How To Resolve TABLE CONTENTS 問題発覚編: calico のデプロイをミスったことに よって発生した事象について 解決編: プロダクト側への説明、監視の見直し、 再発防止への取り組み
  • 4.
  • 7. AKE での監視 Victoria Metrics で複数のクラスタを監視しています
  • 8. What Happened Ingress とノードの BGP ピアがダウンしたというアラートが大量発生 Kubectl get po -A をすると Master ノードに乗っている Pod がほとんど Evicted されていた 😇 Master ノードに ssh すると、どうやらディスク領域が 圧迫され、Eviction の閾値に到達していた
  • 9. What Happened: Ingress の実装 Node calico-node Node exporter Node calico-node Node calico-node nginx ctrl nginx ctrl nginx ctrl Node calico-node Node calico-node calico-node exporter exporter Big IP VS BGP Routing
  • 10. What Happened: Ingress の実装 Node calico-node Node exporter Node calico-node Node calico-node nginx ctrl nginx ctrl nginx ctrl Node calico-node Node calico-node calico-node exporter exporter Big IP VS BGP Routing BGP Link is Down
  • 12. What Happened : ipamhandles リソースの爆発 calico-ipam が使用する Pod と IPを紐づけるリソース 通常、calico が Pod の作成と削除に合わせ 制御するリソースのはずだったが・・・
  • 13. What Happened : ipamhandles リソースの爆発 😇 kubectl get をすると API サーバがメモリを食い潰して死ぬので etcdctl でチェックしてる ちなみに、Pod の数はおよそ30個ぐらい
  • 14. What Happened : ipamhandles リソースの爆発 😇 kubectl get をすると API サーバがメモリを食い潰して死ぬので etcdctl でチェックしてる ちなみに、Pod の数はおよそ30個ぐらい APIサーバが操作できない = クラスタが操作不能 = ヤバイ
  • 15. What Happened : ipamhandles リソースの爆発 etcd backup 2GB /var/backup/etcd etcd backup 2GB etcd backup 2GB systemd 🧨 🧨 🧨 ipamhandles の爆発によって、etcd のバックアップデータが肥大化し、 20GB しかないディスクの圧迫へと繋がった
  • 16. What Happened : まとめ Step 1 Step 2 Step 3 Step 4 calico の ipamhandles が 爆発する etcd のバックアップデータが 増加する (2GB) Master のディスクが圧迫されて calico-node とその他が Evicted される calico-node がダウンしたことに より、BGPピアが切断され、 アラート発砲
  • 18. How To Resolve calico-kube-controllers というコンポーネント をデプロイしていなかった calico-node の ClusterRole の権限が 間違っていた 図参照: https://docs.projectcalico.org/reference/architecture/overview
  • 19. How To Resolve: 反省点 calico-node に delete の権限を渡していないせいで GCが発生していなかった 元々、3.8 のマニフェストをベースに弄っていたので 発生したミス 新しいマニフェストを公式から落としてそれをベースにすれば 今回のようなミスは発生しなかった
  • 20. How To Resolve : プロダクトへの対応 今回、アラートが上がったのは監視用に立てているクラスタでプロダクトが利用しているクラスタでは Pod がそこまで頻繁に作成削除されていなかったので、肥大化はディスクに影響が出るほどではなかった すぐに事情を説明して、マニフェストの修正を行なった 発生するリスクは抑えたということを確認して、対応終了
  • 21. How To Resolve: 監視基盤の対応 監視基盤で全クラスタで etcd の db size と、オブジェクト数を監視するようにした ディスクサイズのアラートが Eviction Policy と同等だったので、 それより低く設定し直す
  • 22. How To Resolve: calico のアップデートについて 極力アップデートしなくていいならしない マニフェストは基本公式のものをそのまま使うので問題はないはず (Pod CIDR や IPIP の有効化フラグぐらい) マニフェストが大きいので修正のレビューは三重チェックで丁度いい (それぐらいCNIはクリティカル)
  • 24. Thank you for listening !