2. 想打世界杯,看清楚以下是你的對手在做的事:
● 線上閱讀:https://landing.google.com/sre/book.html
● SRE Conference:
a. SREcon 2016 - Netflix: 190 Countries and 5 CORE SREs
b. SREcon16 - Performance Checklists for SREs
c. SREcon16 - The Realities of the Job of Delivering Reliability
d. SouthBay SRE: Cloud Capacity Planning - August 9th 2016
e. Site Reliability Engineering at Dropbox
2
6. 沒有什麼大神,雷踩得夠多,而且都能解決,就是大神。
Hit mines will make the guy to be a great geek.
-- Rick Hwang
值得警惕的是,理解一個系統應該如何工作,並不能使人成為專家。只能靠調查系統
為何不能正常工作才行。
Be warned that being an expert is more than understanding how a system is
supposed to work. Expertise is gained by investigating why a system doesn’t work.
-- Brian Redman
神寫的系統是不會有雷
6
11. A process for troubleshooting
11
Triage
定位
Problem Report
故障報告
Examine
檢查
Diagnose
診斷
Test / Treat
測試 / 修復
Cure
治癒
Consider re-triaging if situation
changes.
如果情況發生改變,考慮重新
定位
37. Ref: 淺談系統監控與 AWS CloudWatch 的應用
Levels of Health Check
● Light / Static Health Check
● Layer Health Check
● Deep Health Check
37
38. Light / Static Health Check
38
ASG
ELB
(Internet-Facing)
Route 53
Web App
ASG
Web Servers ELB
(Internal ELB)
App ServersThird Party
Services
Health-Checker
Light Health Check
Layer Health Check
Deep Health Check
Service A
Service B
39. Layer Health Check
39
ASG
ELB
(Internet-Facing)
Route 53
Web App
ASG
Web Servers ELB
(Internal ELB)
App ServersThird Party
Services
Health-Checker
Light Health Check
Layer Health Check
Deep Health Check
Service A
Service B
40. Deep Health Check
40
ASG
ELB
(Internet-Facing)
Route 53
Web App
ASG
Web Servers ELB
(Internal ELB)
App ServersThird Party
Services
Health-Checker
Light Health Check
Layer Health Check
Deep Health Check
Service A
Service B
41. Levels of Health Check
41
● Light / Static Health Check
○ Application 自己是正常的, 像是: Tomcat, IIS 正常運作
● Layer Health Check
○ App 跟另一個 App 溝通是正常的, Tomcat to Redis
○ 出問題時,釐清問題的節點
● Deep Health Check
○ 確認 Service 自身的商務邏輯是正常的:登入、結帳
Ref: 淺談系統監控與 AWS CloudWatch 的應用
42. 42
Service A Service B
Service C
Service D
Service E
(Third Party)
Service Dependencies (Internal)
43. Levels of Health Check
43
● Light / Static Health Check - Application Self
● Layer Health Check - App to App
● Deep Health Check: Service Self
● Service Health Check: Service to Services
Ref: 淺談系統監控與 AWS CloudWatch 的應用
44. ● 開發好的應用程式,交給其他單位 (Test、Operation) 部署時
,用來確認部署正確性、確認點
● CD 時可以自測
● 跨很多系統時,釐清問題的基本參考點,特別是 Micro
Service 架構
● 系統異常發生時,檢查的起始點
Health Check 的用途
44
Ref: 淺談系統監控與 AWS CloudWatch 的應用