SlideShare uma empresa Scribd logo
1 de 34
Baixar para ler offline
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Breaking Observability Chaos:
Best Practices to Monitor AWS
Cloud Native Apps
Jon Jozwiak
Solutions Architect
AWS
D E V 3 1 1
Marcos Ortiz
Solutions Architect
AWS
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Companies in every segment are transforming
EnterprisesStartups B2B B2C PubSec
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How customers are building modern applications
• Business logic focused
• Containers
• Serverless
• Breaking the monolith
• Distributed architectures
• API based
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Today’s world centers around customer experience
Create
agility
Empower
teams
Invent and
innovate
Happier
customers
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Observability is key to success
Observability
Metrics
Logs Events
Traces
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
https://github.com/aws-samples/aws-iot-core-acmebots-monitoring
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Architecture
Amazon
Cognito
AWS IoT
Amazon
DynamoDB
IoT certs and
keys bucket
Sign in
Create/Delete
things
List things
Create/Delete
things
1
2
3
1
2
3
Backend
Frontend
User
AWS Step
Functions
App S3 bucket
Keys and certs
Provisioning
Metadata
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Running bots
AWS IoT
IoT certs and
keys bucketAmazon ECS containers
(bots)
AWS Fargate
Bots fleet
1
2
1
2
Load keys and certs
Publish/Subscribe
Amazon
Cognito
Sign in
Frontend
User
App S3 bucket
2
Telemetry
publish data
every 15 secs
subscribe
Commands
1. subscribe
1. subscribe2. publish
4. publish
3. process
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Bot Logic
start
end
charging working standBy
If batteryLife < 100
If batteryLife = 100
If batteryLife >15 or autoCharge = false
shutdown shutdown
shutdown
if batteryLife < 15 and autoCharge = true
stand by requested
go to work requested
Drains the battery.Charges the battery.Battery life starts
at 50% and initial
state is charging.
Charges the battery.
stand by requested
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS observability portfolio
Complete visibility of cloud resources
and applications
• Monitor applications
• Respond to performance changes
• Optimize resource utilization
• Get a unified view of operational
health
Analyze and debug production,
distributed applications
• Identify performance bottlenecks
• Troubleshoot root cause
• Trace user requests
• For simple & complex applications
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
1―Operational insights
Operational excellence
requires preparation
How do I measure bot’s telemetry
delay and package size?
Align metrics to business
needs
How do I measure bot’s battery
life over time?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How do implement custom metrics?
AWS IoT myThings/+/telemetry
publish publish
AcmeBots Backend
Event
Action
Connectivity
Rule
...
[
...,
{
"recorded_at": 1538754230628,
"version": "2.0",
"status": "working",
"batteryLife": 61.666659
}
]
...
Bots
myThings/bot1/telemetry
Amazon CloudWatch
putMetricData
Namespace Metric Dimensions
AcmeBots batteryLife bot
AcmeBots telemetryDelay bot
AcmeBots telemetryPacketSize bot
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Operational insights takeaways
• Align your metrics to the business need
• Amazon CloudWatch Agent supports custom metrics
• Include dashboards as part of your deployment process
"Type" : "AWS::CloudWatch::Dashboard",
"Properties" : {
"DashboardName" : String,
"DashboardBody" : String,
}
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
2―Detecting and handling errors
“Everything fails all the time.”
Werner Vogels, Amazon CTO
How do I know to detect if a
bot is not properly charging?
Be aware of failures
Notify target
audience
React to failures automatically
as much as possible
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How do I detect if a bot is properly charging?
Bots
Alert & Notify
Logging
Amazon
DynamoDB
scheduled
event
Amazon SNS
Email notification
alarm
CloudWatch Log
1
2
3
4
1 2
4
myThings/<id>/cmd
forEachBot: 3
searchLowBatteryBots
Self Healing
subscribe
thingName batteryLife
bot1 95
bot2 5
bot3 45
bot4 10
bot5 100
bot6 85
bot7 35
bot8 50
bot9 25
bot10 0
thingName batteryLife
bot1 95
bot2 5
bot3 45
bot4 10
bot5 100
bot6 85
bot7 35
bot8 50
bot9 25
bot10 0
Query low battery bots
Log details
Publish IoT command
Raise alarm and notify
Also calls putMetricData to write the
custom metric AcmeBots.lowBatteryCount.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Detecting and handling errors takeaways
Observability
Measure
Detect
Notify
Fix
Amazon CloudWatch metrics
Amazon CloudWatch alarms
EC2 actions/auto-scaling, Amazon SNS
Amazon SNS triggering AWS Lambda
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
3―Reacting to workload events
How can an end-user or
external application consume
your workload’s events?
Is a given bot connected?
What is its given status?
Make sure you can capture
and react to your workload
events
Set up rules
to match
specific
events and
route them
appropriately
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
eventAWS IoT $aws/events/presence/#
connect
disconnect
publish putEvent
External App
updateItem
AcmeBots Backend
"clientId" :"bot1",
"timestamp":1460065214626,
"eventType":"connected",
…
"clientId" :"bot2",
"timestamp":1460065214626,
"eventType":"disconnected",
…
$aws/events/presence/disconnected/bot2
Event
Action
Connectivity
Rule
…
"source":"acmebots.connectivity",
"detail": {
"clientId" :"bot1",
"timestamp":1460065214626,
"eventType":"connected",
…
}
…
rule
Amazon DynamoDB
$aws/events/presence/connected/bot1
thingName connected lastSeenAt
bot1 false 1460064614626
bot2 true 1460065175626
thingName connected lastSeenAt
bot1 true 1460065214626
bot2 false 1460065214626
Bots
How do I track bot connectivity?
Also calls putMetricData to write the
custom metric AcmeBots.eventsDelay.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How do I track bot status?
thingName connected lastSeenAt status
bot1 true 1460065199626 standby
bot2 false 1534375553063 charging
eventAWS IoT myThings/+/telemetry
publish
telemetry
publish putEvent
External App
Amazon DynamoDB
updateItem
AcmeBots IoT Backend
Event
Action
Status
Rule
…
"source":"acmebots.status",
"detail": {
"clientId" :"bot1",
"timestamp":1460065214626,
"status" :"working",
…
}
…
rule
thingName connected lastSeenAt status
bot1 true 1460065214626 working
bot2 false 1534375553063 charging
myThings/bot1/telemetry
Bots
Also calls putMetricData to write the
custom metric AcmeBots.eventsDelay.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Reacting to workload events takeaways
Enables automation
Track and share state change
Time based and event based
AWS services and CloudTrail
Custom events support
Use rules to match and route events
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
4―Troubleshooting
What are my service dependencies?
Where should I improve?
What are my bottlenecks and error
rates?
Make sure you have end to
end visibility of your
workload
Quickly
identify
performance
degradation and
anomalies, including
latency distribution
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Troubleshooting takeaways
• Use Amazon CloudWatch
logs to collect, centralize,
and search your logs
• Plan your log retention
strategy
• View Service Maps
• Enable Tracing visibility
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Best Practices
1. Plan your monitoring strategy working backwards from your business
2. Deliver your monitoring resources as part of your app
3. Collect all 4: Metrics, Logs, Events and Traces
4. Monitor everything: services, limits, costs, API interaction, etc.
5. Leverage other AWS services to enhance observability: AWS
CloudTrail, AWS Config, AWS Trusted Advisor, Amazon Macie,
Amazon GuardDuty, AWS Budgets, AWS Health and Amazon RDS
Performance Insights
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Jon Jozwiak
Marcos Ortiz
https://github.com/aws-samples/aws-iot-core-acmebots-monitoring

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Sre summary
Sre summarySre summary
Sre summary
 
Migrating Monitoring to Observability – How to Transform DevOps from being Re...
Migrating Monitoring to Observability – How to Transform DevOps from being Re...Migrating Monitoring to Observability – How to Transform DevOps from being Re...
Migrating Monitoring to Observability – How to Transform DevOps from being Re...
 
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet SugathadasaSite Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
 
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
 
SRE 101
SRE 101SRE 101
SRE 101
 
Site (Service) Reliability Engineering
Site (Service) Reliability EngineeringSite (Service) Reliability Engineering
Site (Service) Reliability Engineering
 
Chaos Engineering
Chaos EngineeringChaos Engineering
Chaos Engineering
 
What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)
 
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
 
SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)
 
Site Reliability Engineering - Descubra a nova era para (Infraestrutura|Opera...
Site Reliability Engineering - Descubra a nova era para (Infraestrutura|Opera...Site Reliability Engineering - Descubra a nova era para (Infraestrutura|Opera...
Site Reliability Engineering - Descubra a nova era para (Infraestrutura|Opera...
 
How to SRE when you have no SRE
How to SRE when you have no SREHow to SRE when you have no SRE
How to SRE when you have no SRE
 
Observability For Modern Applications
Observability For Modern ApplicationsObservability For Modern Applications
Observability For Modern Applications
 
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
 
SRE & Kubernetes
SRE & KubernetesSRE & Kubernetes
SRE & Kubernetes
 
DevOps on AWS
DevOps on AWSDevOps on AWS
DevOps on AWS
 
SRE vs DevOps
SRE vs DevOpsSRE vs DevOps
SRE vs DevOps
 
Azure DevOps
Azure DevOpsAzure DevOps
Azure DevOps
 
Cloud-Native Observability
Cloud-Native ObservabilityCloud-Native Observability
Cloud-Native Observability
 
Microservices: Mais que uma arquitetura de software, uma filosofia de desenvo...
Microservices: Mais que uma arquitetura de software, uma filosofia de desenvo...Microservices: Mais que uma arquitetura de software, uma filosofia de desenvo...
Microservices: Mais que uma arquitetura de software, uma filosofia de desenvo...
 

Semelhante a Breaking Observability Chaos: Best Practices to Monitor AWS Cloud Native Apps (DEV311) - AWS re:Invent 2018

The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...
The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...
The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...
Amazon Web Services
 

Semelhante a Breaking Observability Chaos: Best Practices to Monitor AWS Cloud Native Apps (DEV311) - AWS re:Invent 2018 (20)

Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...
Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...
Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...
 
Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...
Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...
Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...
 
AWS serverless infrastructure - Integration testing
AWS serverless infrastructure - Integration testingAWS serverless infrastructure - Integration testing
AWS serverless infrastructure - Integration testing
 
Building a Monitoring Plan.pdf
Building a Monitoring Plan.pdfBuilding a Monitoring Plan.pdf
Building a Monitoring Plan.pdf
 
AWS IoT for Frictionless Consumer Experiences in Retail (RET201) - AWS re:Inv...
AWS IoT for Frictionless Consumer Experiences in Retail (RET201) - AWS re:Inv...AWS IoT for Frictionless Consumer Experiences in Retail (RET201) - AWS re:Inv...
AWS IoT for Frictionless Consumer Experiences in Retail (RET201) - AWS re:Inv...
 
Best Practices for Safe Deployments on AWS Lambda and Amazon API Gateway (SRV...
Best Practices for Safe Deployments on AWS Lambda and Amazon API Gateway (SRV...Best Practices for Safe Deployments on AWS Lambda and Amazon API Gateway (SRV...
Best Practices for Safe Deployments on AWS Lambda and Amazon API Gateway (SRV...
 
Breaking Up the Monolith While Migrating to AWS (GPSTEC320) - AWS re:Invent 2018
Breaking Up the Monolith While Migrating to AWS (GPSTEC320) - AWS re:Invent 2018Breaking Up the Monolith While Migrating to AWS (GPSTEC320) - AWS re:Invent 2018
Breaking Up the Monolith While Migrating to AWS (GPSTEC320) - AWS re:Invent 2018
 
Automated Monitoring of Operational Health in the Cloud - Mathew Green - AWS ...
Automated Monitoring of Operational Health in the Cloud - Mathew Green - AWS ...Automated Monitoring of Operational Health in the Cloud - Mathew Green - AWS ...
Automated Monitoring of Operational Health in the Cloud - Mathew Green - AWS ...
 
Proven Methodologies for Accelerating Your Cloud Journey (ENT308-S) - AWS re:...
Proven Methodologies for Accelerating Your Cloud Journey (ENT308-S) - AWS re:...Proven Methodologies for Accelerating Your Cloud Journey (ENT308-S) - AWS re:...
Proven Methodologies for Accelerating Your Cloud Journey (ENT308-S) - AWS re:...
 
Serverless for Developers
Serverless for DevelopersServerless for Developers
Serverless for Developers
 
Building IoT Devices for Regulated Industries (LFS304-i) - AWS re:Invent 2018
Building IoT Devices for Regulated Industries (LFS304-i) - AWS re:Invent 2018Building IoT Devices for Regulated Industries (LFS304-i) - AWS re:Invent 2018
Building IoT Devices for Regulated Industries (LFS304-i) - AWS re:Invent 2018
 
Real-Time Web Analytics with Amazon Kinesis Data Analytics (ADT401) - AWS re:...
Real-Time Web Analytics with Amazon Kinesis Data Analytics (ADT401) - AWS re:...Real-Time Web Analytics with Amazon Kinesis Data Analytics (ADT401) - AWS re:...
Real-Time Web Analytics with Amazon Kinesis Data Analytics (ADT401) - AWS re:...
 
2019 03-13-implementing microservices by ddd
2019 03-13-implementing microservices by ddd2019 03-13-implementing microservices by ddd
2019 03-13-implementing microservices by ddd
 
Implementing Microservices by DDD
Implementing Microservices by DDDImplementing Microservices by DDD
Implementing Microservices by DDD
 
Using automation to drive continuous-compliance best practices - SVC309 - Chi...
Using automation to drive continuous-compliance best practices - SVC309 - Chi...Using automation to drive continuous-compliance best practices - SVC309 - Chi...
Using automation to drive continuous-compliance best practices - SVC309 - Chi...
 
How can your business benefit from going Serverless
How can your business benefit from going ServerlessHow can your business benefit from going Serverless
How can your business benefit from going Serverless
 
Designing for Operability: Getting the Last Nines in Five-Nines Availability ...
Designing for Operability: Getting the Last Nines in Five-Nines Availability ...Designing for Operability: Getting the Last Nines in Five-Nines Availability ...
Designing for Operability: Getting the Last Nines in Five-Nines Availability ...
 
The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...
The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...
The Quest for Continuous ATO: A Case Study Featuring the US Intelligence Comm...
 
Enabling Governance, Compliance, Operational, and Risk Auditing with AWS Mana...
Enabling Governance, Compliance, Operational, and Risk Auditing with AWS Mana...Enabling Governance, Compliance, Operational, and Risk Auditing with AWS Mana...
Enabling Governance, Compliance, Operational, and Risk Auditing with AWS Mana...
 
AWS Governance at Scale_AWSPSSummit_Singapore
AWS Governance at Scale_AWSPSSummit_SingaporeAWS Governance at Scale_AWSPSSummit_Singapore
AWS Governance at Scale_AWSPSSummit_Singapore
 

Mais de Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Mais de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Breaking Observability Chaos: Best Practices to Monitor AWS Cloud Native Apps (DEV311) - AWS re:Invent 2018

  • 1.
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Breaking Observability Chaos: Best Practices to Monitor AWS Cloud Native Apps Jon Jozwiak Solutions Architect AWS D E V 3 1 1 Marcos Ortiz Solutions Architect AWS
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Companies in every segment are transforming EnterprisesStartups B2B B2C PubSec
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How customers are building modern applications • Business logic focused • Containers • Serverless • Breaking the monolith • Distributed architectures • API based
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Today’s world centers around customer experience Create agility Empower teams Invent and innovate Happier customers
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Observability is key to success Observability Metrics Logs Events Traces
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. https://github.com/aws-samples/aws-iot-core-acmebots-monitoring
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Architecture Amazon Cognito AWS IoT Amazon DynamoDB IoT certs and keys bucket Sign in Create/Delete things List things Create/Delete things 1 2 3 1 2 3 Backend Frontend User AWS Step Functions App S3 bucket Keys and certs Provisioning Metadata
  • 10.
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Running bots AWS IoT IoT certs and keys bucketAmazon ECS containers (bots) AWS Fargate Bots fleet 1 2 1 2 Load keys and certs Publish/Subscribe Amazon Cognito Sign in Frontend User App S3 bucket 2 Telemetry publish data every 15 secs subscribe Commands 1. subscribe 1. subscribe2. publish 4. publish 3. process
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Bot Logic start end charging working standBy If batteryLife < 100 If batteryLife = 100 If batteryLife >15 or autoCharge = false shutdown shutdown shutdown if batteryLife < 15 and autoCharge = true stand by requested go to work requested Drains the battery.Charges the battery.Battery life starts at 50% and initial state is charging. Charges the battery. stand by requested
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS observability portfolio Complete visibility of cloud resources and applications • Monitor applications • Respond to performance changes • Optimize resource utilization • Get a unified view of operational health Analyze and debug production, distributed applications • Identify performance bottlenecks • Troubleshoot root cause • Trace user requests • For simple & complex applications
  • 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. 1―Operational insights Operational excellence requires preparation How do I measure bot’s telemetry delay and package size? Align metrics to business needs How do I measure bot’s battery life over time?
  • 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How do implement custom metrics? AWS IoT myThings/+/telemetry publish publish AcmeBots Backend Event Action Connectivity Rule ... [ ..., { "recorded_at": 1538754230628, "version": "2.0", "status": "working", "batteryLife": 61.666659 } ] ... Bots myThings/bot1/telemetry Amazon CloudWatch putMetricData Namespace Metric Dimensions AcmeBots batteryLife bot AcmeBots telemetryDelay bot AcmeBots telemetryPacketSize bot
  • 17.
  • 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Operational insights takeaways • Align your metrics to the business need • Amazon CloudWatch Agent supports custom metrics • Include dashboards as part of your deployment process "Type" : "AWS::CloudWatch::Dashboard", "Properties" : { "DashboardName" : String, "DashboardBody" : String, }
  • 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. 2―Detecting and handling errors “Everything fails all the time.” Werner Vogels, Amazon CTO How do I know to detect if a bot is not properly charging? Be aware of failures Notify target audience React to failures automatically as much as possible
  • 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How do I detect if a bot is properly charging? Bots Alert & Notify Logging Amazon DynamoDB scheduled event Amazon SNS Email notification alarm CloudWatch Log 1 2 3 4 1 2 4 myThings/<id>/cmd forEachBot: 3 searchLowBatteryBots Self Healing subscribe thingName batteryLife bot1 95 bot2 5 bot3 45 bot4 10 bot5 100 bot6 85 bot7 35 bot8 50 bot9 25 bot10 0 thingName batteryLife bot1 95 bot2 5 bot3 45 bot4 10 bot5 100 bot6 85 bot7 35 bot8 50 bot9 25 bot10 0 Query low battery bots Log details Publish IoT command Raise alarm and notify Also calls putMetricData to write the custom metric AcmeBots.lowBatteryCount.
  • 21.
  • 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Detecting and handling errors takeaways Observability Measure Detect Notify Fix Amazon CloudWatch metrics Amazon CloudWatch alarms EC2 actions/auto-scaling, Amazon SNS Amazon SNS triggering AWS Lambda
  • 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. 3―Reacting to workload events How can an end-user or external application consume your workload’s events? Is a given bot connected? What is its given status? Make sure you can capture and react to your workload events Set up rules to match specific events and route them appropriately
  • 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. eventAWS IoT $aws/events/presence/# connect disconnect publish putEvent External App updateItem AcmeBots Backend "clientId" :"bot1", "timestamp":1460065214626, "eventType":"connected", … "clientId" :"bot2", "timestamp":1460065214626, "eventType":"disconnected", … $aws/events/presence/disconnected/bot2 Event Action Connectivity Rule … "source":"acmebots.connectivity", "detail": { "clientId" :"bot1", "timestamp":1460065214626, "eventType":"connected", … } … rule Amazon DynamoDB $aws/events/presence/connected/bot1 thingName connected lastSeenAt bot1 false 1460064614626 bot2 true 1460065175626 thingName connected lastSeenAt bot1 true 1460065214626 bot2 false 1460065214626 Bots How do I track bot connectivity? Also calls putMetricData to write the custom metric AcmeBots.eventsDelay.
  • 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How do I track bot status? thingName connected lastSeenAt status bot1 true 1460065199626 standby bot2 false 1534375553063 charging eventAWS IoT myThings/+/telemetry publish telemetry publish putEvent External App Amazon DynamoDB updateItem AcmeBots IoT Backend Event Action Status Rule … "source":"acmebots.status", "detail": { "clientId" :"bot1", "timestamp":1460065214626, "status" :"working", … } … rule thingName connected lastSeenAt status bot1 true 1460065214626 working bot2 false 1534375553063 charging myThings/bot1/telemetry Bots Also calls putMetricData to write the custom metric AcmeBots.eventsDelay.
  • 26.
  • 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Reacting to workload events takeaways Enables automation Track and share state change Time based and event based AWS services and CloudTrail Custom events support Use rules to match and route events
  • 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. 4―Troubleshooting What are my service dependencies? Where should I improve? What are my bottlenecks and error rates? Make sure you have end to end visibility of your workload Quickly identify performance degradation and anomalies, including latency distribution
  • 29.
  • 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Troubleshooting takeaways • Use Amazon CloudWatch logs to collect, centralize, and search your logs • Plan your log retention strategy • View Service Maps • Enable Tracing visibility
  • 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Best Practices 1. Plan your monitoring strategy working backwards from your business 2. Deliver your monitoring resources as part of your app 3. Collect all 4: Metrics, Logs, Events and Traces 4. Monitor everything: services, limits, costs, API interaction, etc. 5. Leverage other AWS services to enhance observability: AWS CloudTrail, AWS Config, AWS Trusted Advisor, Amazon Macie, Amazon GuardDuty, AWS Budgets, AWS Health and Amazon RDS Performance Insights
  • 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 34. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Jon Jozwiak Marcos Ortiz https://github.com/aws-samples/aws-iot-core-acmebots-monitoring