Mais conteúdo relacionado Semelhante a Breaking Observability Chaos: Best Practices to Monitor AWS Cloud Native Apps (DEV311) - AWS re:Invent 2018 (20) Mais de Amazon Web Services (20) Breaking Observability Chaos: Best Practices to Monitor AWS Cloud Native Apps (DEV311) - AWS re:Invent 20182. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Breaking Observability Chaos:
Best Practices to Monitor AWS
Cloud Native Apps
Jon Jozwiak
Solutions Architect
AWS
D E V 3 1 1
Marcos Ortiz
Solutions Architect
AWS
3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Companies in every segment are transforming
EnterprisesStartups B2B B2C PubSec
5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How customers are building modern applications
• Business logic focused
• Containers
• Serverless
• Breaking the monolith
• Distributed architectures
• API based
6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Today’s world centers around customer experience
Create
agility
Empower
teams
Invent and
innovate
Happier
customers
7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Observability is key to success
Observability
Metrics
Logs Events
Traces
8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
https://github.com/aws-samples/aws-iot-core-acmebots-monitoring
9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Architecture
Amazon
Cognito
AWS IoT
Amazon
DynamoDB
IoT certs and
keys bucket
Sign in
Create/Delete
things
List things
Create/Delete
things
1
2
3
1
2
3
Backend
Frontend
User
AWS Step
Functions
App S3 bucket
Keys and certs
Provisioning
Metadata
11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Running bots
AWS IoT
IoT certs and
keys bucketAmazon ECS containers
(bots)
AWS Fargate
Bots fleet
1
2
1
2
Load keys and certs
Publish/Subscribe
Amazon
Cognito
Sign in
Frontend
User
App S3 bucket
2
Telemetry
publish data
every 15 secs
subscribe
Commands
1. subscribe
1. subscribe2. publish
4. publish
3. process
12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Bot Logic
start
end
charging working standBy
If batteryLife < 100
If batteryLife = 100
If batteryLife >15 or autoCharge = false
shutdown shutdown
shutdown
if batteryLife < 15 and autoCharge = true
stand by requested
go to work requested
Drains the battery.Charges the battery.Battery life starts
at 50% and initial
state is charging.
Charges the battery.
stand by requested
13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS observability portfolio
Complete visibility of cloud resources
and applications
• Monitor applications
• Respond to performance changes
• Optimize resource utilization
• Get a unified view of operational
health
Analyze and debug production,
distributed applications
• Identify performance bottlenecks
• Troubleshoot root cause
• Trace user requests
• For simple & complex applications
15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
1―Operational insights
Operational excellence
requires preparation
How do I measure bot’s telemetry
delay and package size?
Align metrics to business
needs
How do I measure bot’s battery
life over time?
16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How do implement custom metrics?
AWS IoT myThings/+/telemetry
publish publish
AcmeBots Backend
Event
Action
Connectivity
Rule
...
[
...,
{
"recorded_at": 1538754230628,
"version": "2.0",
"status": "working",
"batteryLife": 61.666659
}
]
...
Bots
myThings/bot1/telemetry
Amazon CloudWatch
putMetricData
Namespace Metric Dimensions
AcmeBots batteryLife bot
AcmeBots telemetryDelay bot
AcmeBots telemetryPacketSize bot
18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Operational insights takeaways
• Align your metrics to the business need
• Amazon CloudWatch Agent supports custom metrics
• Include dashboards as part of your deployment process
"Type" : "AWS::CloudWatch::Dashboard",
"Properties" : {
"DashboardName" : String,
"DashboardBody" : String,
}
19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
2―Detecting and handling errors
“Everything fails all the time.”
Werner Vogels, Amazon CTO
How do I know to detect if a
bot is not properly charging?
Be aware of failures
Notify target
audience
React to failures automatically
as much as possible
20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How do I detect if a bot is properly charging?
Bots
Alert & Notify
Logging
Amazon
DynamoDB
scheduled
event
Amazon SNS
Email notification
alarm
CloudWatch Log
1
2
3
4
1 2
4
myThings/<id>/cmd
forEachBot: 3
searchLowBatteryBots
Self Healing
subscribe
thingName batteryLife
bot1 95
bot2 5
bot3 45
bot4 10
bot5 100
bot6 85
bot7 35
bot8 50
bot9 25
bot10 0
thingName batteryLife
bot1 95
bot2 5
bot3 45
bot4 10
bot5 100
bot6 85
bot7 35
bot8 50
bot9 25
bot10 0
Query low battery bots
Log details
Publish IoT command
Raise alarm and notify
Also calls putMetricData to write the
custom metric AcmeBots.lowBatteryCount.
22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Detecting and handling errors takeaways
Observability
Measure
Detect
Notify
Fix
Amazon CloudWatch metrics
Amazon CloudWatch alarms
EC2 actions/auto-scaling, Amazon SNS
Amazon SNS triggering AWS Lambda
23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
3―Reacting to workload events
How can an end-user or
external application consume
your workload’s events?
Is a given bot connected?
What is its given status?
Make sure you can capture
and react to your workload
events
Set up rules
to match
specific
events and
route them
appropriately
24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
eventAWS IoT $aws/events/presence/#
connect
disconnect
publish putEvent
External App
updateItem
AcmeBots Backend
"clientId" :"bot1",
"timestamp":1460065214626,
"eventType":"connected",
…
"clientId" :"bot2",
"timestamp":1460065214626,
"eventType":"disconnected",
…
$aws/events/presence/disconnected/bot2
Event
Action
Connectivity
Rule
…
"source":"acmebots.connectivity",
"detail": {
"clientId" :"bot1",
"timestamp":1460065214626,
"eventType":"connected",
…
}
…
rule
Amazon DynamoDB
$aws/events/presence/connected/bot1
thingName connected lastSeenAt
bot1 false 1460064614626
bot2 true 1460065175626
thingName connected lastSeenAt
bot1 true 1460065214626
bot2 false 1460065214626
Bots
How do I track bot connectivity?
Also calls putMetricData to write the
custom metric AcmeBots.eventsDelay.
25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How do I track bot status?
thingName connected lastSeenAt status
bot1 true 1460065199626 standby
bot2 false 1534375553063 charging
eventAWS IoT myThings/+/telemetry
publish
telemetry
publish putEvent
External App
Amazon DynamoDB
updateItem
AcmeBots IoT Backend
Event
Action
Status
Rule
…
"source":"acmebots.status",
"detail": {
"clientId" :"bot1",
"timestamp":1460065214626,
"status" :"working",
…
}
…
rule
thingName connected lastSeenAt status
bot1 true 1460065214626 working
bot2 false 1534375553063 charging
myThings/bot1/telemetry
Bots
Also calls putMetricData to write the
custom metric AcmeBots.eventsDelay.
27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Reacting to workload events takeaways
Enables automation
Track and share state change
Time based and event based
AWS services and CloudTrail
Custom events support
Use rules to match and route events
28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
4―Troubleshooting
What are my service dependencies?
Where should I improve?
What are my bottlenecks and error
rates?
Make sure you have end to
end visibility of your
workload
Quickly
identify
performance
degradation and
anomalies, including
latency distribution
30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Troubleshooting takeaways
• Use Amazon CloudWatch
logs to collect, centralize,
and search your logs
• Plan your log retention
strategy
• View Service Maps
• Enable Tracing visibility
31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Best Practices
1. Plan your monitoring strategy working backwards from your business
2. Deliver your monitoring resources as part of your app
3. Collect all 4: Metrics, Logs, Events and Traces
4. Monitor everything: services, limits, costs, API interaction, etc.
5. Leverage other AWS services to enhance observability: AWS
CloudTrail, AWS Config, AWS Trusted Advisor, Amazon Macie,
Amazon GuardDuty, AWS Budgets, AWS Health and Amazon RDS
Performance Insights
33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
34. Thank you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Jon Jozwiak
Marcos Ortiz
https://github.com/aws-samples/aws-iot-core-acmebots-monitoring