SlideShare uma empresa Scribd logo
1 de 43
Baixar para ler offline
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Cost-Effectively Running Distributed
Systems at Scale in the Cloud
Osman Sarood
Infrastructure and Operations Lead
Mist Systems
C M P 3 4 9
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What to expect?
• Minimize cost using the volatile Amazon EC2 Spot instances
• Ensuring reliability amidst unpredictable server terminations
• Container right-sizing for reliability and autoscaling
• How to monitor real-time applications under chaotic conditions
Heat Map for CPU Load (Hotspots)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Mist architecture
1 TB+
10 Billion+ Msgs
10’s TB+
500+ partitions
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Contract Type Cost Reliable Commitment
Required?
On-demand Very High Yes No
Reserved High Yes Yes
Spot Low No No
Amazon EC2 contract types
• On-demand
• Reserved
• Spot
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Spot instances
• Unused Amazon EC2 capacity
• Controlled by supply and
demand
• If used correctly, they can be ~
80% cheaper
• Be careful! Naive usage may
end up costing more than on-
demand
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Spot market volatility
86 Heterogeneous instance terminations in a day!
(~20% of Production DC)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Mist architecture running on spot
•100% on Spot
•80% of Datacenter
•Huge savings $$$
•Allows us to do more
•Forces reliability
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Cost benefits of using spot instances
System vCPUs Percentage of
total EC2 cost
Savings compared to
On-demad
Savings compared
to RI
Storm 1096 (31%) 17% 74% 60%
Mesos 1864 (52%) 30% 75% 62%
Total Cores 3590 56% 35%
*Includes EBS cost as well
15% non-spot infra constitutes 53% of our cost!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How to use spot instances?
• Spot market = (instance type, AZ)
• Prices for difference spot markets are
independent
• AWS Spot Fleet
• Uses spot instances
• Ensures capacity
• Caution: Cost saving scheme of AWS Spot
Fleet is risky!
• Diversify across markets!
• Applications can restart
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How much to overprovision?
• Lead time for spot instance: < 5 mins
• Max number of instances terminated in 5 mins = 9
• Overprovision by 9 * 3 = 27 ( where 3 is a magic multiple)
• The bigger the cluster the better it is
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Unreliable servers help building reliable systems!?
• Chaos Monkey
• Pioneer of chaos engineering
• Controlled Chaos (can be stopped
and resumed)
• Use of spot Instances
• Uncontrolled `Real’ Chaos
• Can’t be turned off and depends on
spot prices
• Enables building reliable
infrastructure. Wait, what?
• Can be volatile
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Coping with spot’s unreliability
• Mist Systems relies on both
stateful and stateless
applications
• Live-aggregators:
• 1.2+ TB in-memory state
• ~ 1000 Mesos containers
• CPU Utilization 60%-80%
• Per-container memory
footprint
• minimum: 32MB
• maximum: 21GB
• mean: 1.3GB
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Real-time aggregation
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Live aggregators
• Stream: continuous flow of msgs from Kafka
• View: a set of tuples containing aggregated data
• Lag: how far behind from current time
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Live aggregators architecture
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Live aggregators—Scheduler
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Live aggregators—Executor
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Performance variations across instance types
CPU utilization (%) for consuming the same stream (cv-sle-le)
Fast
Slow
Thank you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Osman Sarood
osmansarood@gmail.com
Please complete the session
survey in the mobile app.
!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Appendix
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Coping with spot’s unreliability: Checkpointing
• How far a real-time application is
lagging?
• Checkpointing is critical!
• Each platform we use has a mechanism
for checkpointing:
• Storm: Redis
• Mesos:
• Live-aggregators: Amazon S3
• Real-time: Redis
• API: stateless
More resources
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Spot termination: Lag and then recovery
Dead
Not Rescheduled
Rescheduled
Recovering
10 min older checkpoint restored
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Difference in recovery times
• High Message Rate • Type of computation • Amount of data/msg
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Why do we monitor?
• Detecting the problem: Is a real-time
application falling behind?
• Troubleshooting:
• Resource bottlenecks
• Bottlenecked on an external system
• Application code errors
• Use SignalFX, Graphana, Graphite and
Amazon Cloudwatch
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Detection: Real-time lag
• How far behind is a real-time
service/topology?
• Difference between Kafka timestamp
between:
• Most recent message
• Last consumed message for an
application
• How much lag is acceptable? nano
secs, millisecs, secs?
Most recent msg
t = 21
Consumer 1
t = 8, Lag = 13 sec
Consumer 2
t = 1, Lag = 20 sec
Head
Tail
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Real-time lags
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Attribution: Resource monitoring
• Track key metrics per
container/worker
• CPU, Memory, Network IO
• CPU Kafka stream
• Timing individual code blocks
• Write metrics using different
dimensions to filter and aggregate
• Host name
• Container ID
• Service/Application name
• Environment
• Instance Type
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
CPU metrics for microservices
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Performance variations across instance types
CPU utilization (%) for consuming the same stream
Fast
Slow
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Case study: Lagging application
~ 5:15 starts lagging
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Case study: High CPU utilization
~ 5:15
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Case Study: Host CPU utilization details
~ 5:15 terminated le_wifi_leprepare
~ 7:15
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Case study: Lag recovers
~ 7:15 lag recovery starts
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Lying factor
• Predicting resource requirements is
difficult
• How much CPU, memory, and network
IO?
• Recent resource managers are typically
book keepers:
• Mesos
• Kubernetes
• Lying factor is the difference between (1 -
2 = -1 cores)
• Reserved resources (1 cores)
• Actual used resources (2 cores)
Noisy neighbor
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Right-sizing containers
• Lying factor for a container:
• Negative: Dangerous! Using more
resources than it reserved
(hotspots)
• Positive: Resources wasted
• Ideally should be 0 (reserved =
actual/consumed)
• Manually update resources using
marathon
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Dynamic Load ( Incoming data stream)
Daily Seasonality
Trend
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Autoscaling using lying factor
• Changing load common in data streams
• Complicated due to changing load:
• Seasonality (daily peaks)
• Trend (load increasing overtime)
• Autoscaling based on lying factor for
containers
• Reduce resources when lying factor
crosses a positive threshold
• Increase resources when lying factor
below a number
• Maintain lying factor within a band
Lying Factor
Time
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Lying factor vs. load
Peak Load
Upper threshold: 0.6
Lower threshold: -0.1
0.35
Lying factor = # reserved cores - # actual cores used
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Containers upsized automatically
Actual usage increasing and hence container terminated
0.17
Lying factor = # reserved cores - # actual cores used
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Container downsized automatically
Actual usage decreased and hence container terminated
0.60
Lying factor = # reserved cores - # actual cores used
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Autoscaling in action: Reserved vs. actual usage
Up-sizing
Down-sizing
Actual cores used
Reserved cores
180
360

Mais conteúdo relacionado

Mais procurados

[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Infe...
[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Infe...[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Infe...
[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Infe...Amazon Web Services
 
Work Anywhere with Amazon Workspaces (Level: 200)
Work Anywhere with Amazon Workspaces (Level: 200)Work Anywhere with Amazon Workspaces (Level: 200)
Work Anywhere with Amazon Workspaces (Level: 200)Amazon Web Services
 
使用 AWS Step Functions 靈活調度 AWS Lambda (Level:200)
使用 AWS Step Functions 靈活調度 AWS Lambda (Level:200)使用 AWS Step Functions 靈活調度 AWS Lambda (Level:200)
使用 AWS Step Functions 靈活調度 AWS Lambda (Level:200)Amazon Web Services
 
Amazon EC2 T Instances – Burstable, Cost-Effective Performance (CMP209) - AWS...
Amazon EC2 T Instances – Burstable, Cost-Effective Performance (CMP209) - AWS...Amazon EC2 T Instances – Burstable, Cost-Effective Performance (CMP209) - AWS...
Amazon EC2 T Instances – Burstable, Cost-Effective Performance (CMP209) - AWS...Amazon Web Services
 
以 Amazon EC2 Spot 執行個體有效控制專案成本 (Level: 200)
以 Amazon EC2 Spot 執行個體有效控制專案成本 (Level: 200)以 Amazon EC2 Spot 執行個體有效控制專案成本 (Level: 200)
以 Amazon EC2 Spot 執行個體有效控制專案成本 (Level: 200)Amazon Web Services
 
Save up to 90% on Big Data and Machine Learning Workloads with Spot Instances...
Save up to 90% on Big Data and Machine Learning Workloads with Spot Instances...Save up to 90% on Big Data and Machine Learning Workloads with Spot Instances...
Save up to 90% on Big Data and Machine Learning Workloads with Spot Instances...Amazon Web Services
 
[NEW LAUNCH!] Introducing Amazon EC2 A1 Instances Based on the Arm Architectu...
[NEW LAUNCH!] Introducing Amazon EC2 A1 Instances Based on the Arm Architectu...[NEW LAUNCH!] Introducing Amazon EC2 A1 Instances Based on the Arm Architectu...
[NEW LAUNCH!] Introducing Amazon EC2 A1 Instances Based on the Arm Architectu...Amazon Web Services
 
Amazon EC2 Foundations (CMP208-R1) - AWS re:Invent 2018
Amazon EC2 Foundations (CMP208-R1) - AWS re:Invent 2018Amazon EC2 Foundations (CMP208-R1) - AWS re:Invent 2018
Amazon EC2 Foundations (CMP208-R1) - AWS re:Invent 2018Amazon Web Services
 
Industrialize Machine Learning Using CI/CD Techniques (FSV304-i) - AWS re:Inv...
Industrialize Machine Learning Using CI/CD Techniques (FSV304-i) - AWS re:Inv...Industrialize Machine Learning Using CI/CD Techniques (FSV304-i) - AWS re:Inv...
Industrialize Machine Learning Using CI/CD Techniques (FSV304-i) - AWS re:Inv...Amazon Web Services
 
Introducing AWS Transfer for SFTP, a Fully Managed SFTP Service for Amazon S3...
Introducing AWS Transfer for SFTP, a Fully Managed SFTP Service for Amazon S3...Introducing AWS Transfer for SFTP, a Fully Managed SFTP Service for Amazon S3...
Introducing AWS Transfer for SFTP, a Fully Managed SFTP Service for Amazon S3...Amazon Web Services
 
Achieving Business Value with AWS (ENT203-R2) - AWS re:Invent 2018
Achieving Business Value with AWS (ENT203-R2) - AWS re:Invent 2018Achieving Business Value with AWS (ENT203-R2) - AWS re:Invent 2018
Achieving Business Value with AWS (ENT203-R2) - AWS re:Invent 2018Amazon Web Services
 
Earn Your DevOps Black Belt: Deployment Scenarios with AWS CloudFormation (DE...
Earn Your DevOps Black Belt: Deployment Scenarios with AWS CloudFormation (DE...Earn Your DevOps Black Belt: Deployment Scenarios with AWS CloudFormation (DE...
Earn Your DevOps Black Belt: Deployment Scenarios with AWS CloudFormation (DE...Amazon Web Services
 
Building an AWS Greengrass Machine Learning Solution at the Edge (IOT403-R1) ...
Building an AWS Greengrass Machine Learning Solution at the Edge (IOT403-R1) ...Building an AWS Greengrass Machine Learning Solution at the Edge (IOT403-R1) ...
Building an AWS Greengrass Machine Learning Solution at the Edge (IOT403-R1) ...Amazon Web Services
 
Studio in the Cloud: Producing Content on AWS (MAE202) - AWS re:Invent 2018
Studio in the Cloud: Producing Content on AWS (MAE202) - AWS re:Invent 2018Studio in the Cloud: Producing Content on AWS (MAE202) - AWS re:Invent 2018
Studio in the Cloud: Producing Content on AWS (MAE202) - AWS re:Invent 2018Amazon Web Services
 
Amazon Linux 2: A Stable, Secure, High-Performance Linux Environment (CMP203-...
Amazon Linux 2: A Stable, Secure, High-Performance Linux Environment (CMP203-...Amazon Linux 2: A Stable, Secure, High-Performance Linux Environment (CMP203-...
Amazon Linux 2: A Stable, Secure, High-Performance Linux Environment (CMP203-...Amazon Web Services
 
AWSome Day - Solutions Architecture Best Practices
AWSome Day - Solutions Architecture Best PracticesAWSome Day - Solutions Architecture Best Practices
AWSome Day - Solutions Architecture Best PracticesAmazon Web Services
 
[NEW LAUNCH!] Introducing Amazon Managed Streaming for Kafka (Amazon MSK) (AN...
[NEW LAUNCH!] Introducing Amazon Managed Streaming for Kafka (Amazon MSK) (AN...[NEW LAUNCH!] Introducing Amazon Managed Streaming for Kafka (Amazon MSK) (AN...
[NEW LAUNCH!] Introducing Amazon Managed Streaming for Kafka (Amazon MSK) (AN...Amazon Web Services
 
How Jupiter Intel Uses AWS to Forecast Weather Impact & Climate Risk (CMP340)...
How Jupiter Intel Uses AWS to Forecast Weather Impact & Climate Risk (CMP340)...How Jupiter Intel Uses AWS to Forecast Weather Impact & Climate Risk (CMP340)...
How Jupiter Intel Uses AWS to Forecast Weather Impact & Climate Risk (CMP340)...Amazon Web Services
 
Accelerating Life Sciences with HPC on AWS - AWS Online Tech Talks
Accelerating Life Sciences with HPC on AWS - AWS Online Tech TalksAccelerating Life Sciences with HPC on AWS - AWS Online Tech Talks
Accelerating Life Sciences with HPC on AWS - AWS Online Tech TalksAmazon Web Services
 

Mais procurados (20)

[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Infe...
[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Infe...[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Infe...
[NEW LAUNCH!] Introducing Amazon Elastic Inference: Reduce Deep Learning Infe...
 
Work Anywhere with Amazon Workspaces (Level: 200)
Work Anywhere with Amazon Workspaces (Level: 200)Work Anywhere with Amazon Workspaces (Level: 200)
Work Anywhere with Amazon Workspaces (Level: 200)
 
使用 AWS Step Functions 靈活調度 AWS Lambda (Level:200)
使用 AWS Step Functions 靈活調度 AWS Lambda (Level:200)使用 AWS Step Functions 靈活調度 AWS Lambda (Level:200)
使用 AWS Step Functions 靈活調度 AWS Lambda (Level:200)
 
Amazon EC2 T Instances – Burstable, Cost-Effective Performance (CMP209) - AWS...
Amazon EC2 T Instances – Burstable, Cost-Effective Performance (CMP209) - AWS...Amazon EC2 T Instances – Burstable, Cost-Effective Performance (CMP209) - AWS...
Amazon EC2 T Instances – Burstable, Cost-Effective Performance (CMP209) - AWS...
 
以 Amazon EC2 Spot 執行個體有效控制專案成本 (Level: 200)
以 Amazon EC2 Spot 執行個體有效控制專案成本 (Level: 200)以 Amazon EC2 Spot 執行個體有效控制專案成本 (Level: 200)
以 Amazon EC2 Spot 執行個體有效控制專案成本 (Level: 200)
 
Save up to 90% on Big Data and Machine Learning Workloads with Spot Instances...
Save up to 90% on Big Data and Machine Learning Workloads with Spot Instances...Save up to 90% on Big Data and Machine Learning Workloads with Spot Instances...
Save up to 90% on Big Data and Machine Learning Workloads with Spot Instances...
 
[NEW LAUNCH!] Introducing Amazon EC2 A1 Instances Based on the Arm Architectu...
[NEW LAUNCH!] Introducing Amazon EC2 A1 Instances Based on the Arm Architectu...[NEW LAUNCH!] Introducing Amazon EC2 A1 Instances Based on the Arm Architectu...
[NEW LAUNCH!] Introducing Amazon EC2 A1 Instances Based on the Arm Architectu...
 
Amazon EC2 Foundations (CMP208-R1) - AWS re:Invent 2018
Amazon EC2 Foundations (CMP208-R1) - AWS re:Invent 2018Amazon EC2 Foundations (CMP208-R1) - AWS re:Invent 2018
Amazon EC2 Foundations (CMP208-R1) - AWS re:Invent 2018
 
Industrialize Machine Learning Using CI/CD Techniques (FSV304-i) - AWS re:Inv...
Industrialize Machine Learning Using CI/CD Techniques (FSV304-i) - AWS re:Inv...Industrialize Machine Learning Using CI/CD Techniques (FSV304-i) - AWS re:Inv...
Industrialize Machine Learning Using CI/CD Techniques (FSV304-i) - AWS re:Inv...
 
Introducing AWS Transfer for SFTP, a Fully Managed SFTP Service for Amazon S3...
Introducing AWS Transfer for SFTP, a Fully Managed SFTP Service for Amazon S3...Introducing AWS Transfer for SFTP, a Fully Managed SFTP Service for Amazon S3...
Introducing AWS Transfer for SFTP, a Fully Managed SFTP Service for Amazon S3...
 
Achieving Business Value with AWS (ENT203-R2) - AWS re:Invent 2018
Achieving Business Value with AWS (ENT203-R2) - AWS re:Invent 2018Achieving Business Value with AWS (ENT203-R2) - AWS re:Invent 2018
Achieving Business Value with AWS (ENT203-R2) - AWS re:Invent 2018
 
Earn Your DevOps Black Belt: Deployment Scenarios with AWS CloudFormation (DE...
Earn Your DevOps Black Belt: Deployment Scenarios with AWS CloudFormation (DE...Earn Your DevOps Black Belt: Deployment Scenarios with AWS CloudFormation (DE...
Earn Your DevOps Black Belt: Deployment Scenarios with AWS CloudFormation (DE...
 
SRV319 Amazon EC2 Foundations
SRV319 Amazon EC2 FoundationsSRV319 Amazon EC2 Foundations
SRV319 Amazon EC2 Foundations
 
Building an AWS Greengrass Machine Learning Solution at the Edge (IOT403-R1) ...
Building an AWS Greengrass Machine Learning Solution at the Edge (IOT403-R1) ...Building an AWS Greengrass Machine Learning Solution at the Edge (IOT403-R1) ...
Building an AWS Greengrass Machine Learning Solution at the Edge (IOT403-R1) ...
 
Studio in the Cloud: Producing Content on AWS (MAE202) - AWS re:Invent 2018
Studio in the Cloud: Producing Content on AWS (MAE202) - AWS re:Invent 2018Studio in the Cloud: Producing Content on AWS (MAE202) - AWS re:Invent 2018
Studio in the Cloud: Producing Content on AWS (MAE202) - AWS re:Invent 2018
 
Amazon Linux 2: A Stable, Secure, High-Performance Linux Environment (CMP203-...
Amazon Linux 2: A Stable, Secure, High-Performance Linux Environment (CMP203-...Amazon Linux 2: A Stable, Secure, High-Performance Linux Environment (CMP203-...
Amazon Linux 2: A Stable, Secure, High-Performance Linux Environment (CMP203-...
 
AWSome Day - Solutions Architecture Best Practices
AWSome Day - Solutions Architecture Best PracticesAWSome Day - Solutions Architecture Best Practices
AWSome Day - Solutions Architecture Best Practices
 
[NEW LAUNCH!] Introducing Amazon Managed Streaming for Kafka (Amazon MSK) (AN...
[NEW LAUNCH!] Introducing Amazon Managed Streaming for Kafka (Amazon MSK) (AN...[NEW LAUNCH!] Introducing Amazon Managed Streaming for Kafka (Amazon MSK) (AN...
[NEW LAUNCH!] Introducing Amazon Managed Streaming for Kafka (Amazon MSK) (AN...
 
How Jupiter Intel Uses AWS to Forecast Weather Impact & Climate Risk (CMP340)...
How Jupiter Intel Uses AWS to Forecast Weather Impact & Climate Risk (CMP340)...How Jupiter Intel Uses AWS to Forecast Weather Impact & Climate Risk (CMP340)...
How Jupiter Intel Uses AWS to Forecast Weather Impact & Climate Risk (CMP340)...
 
Accelerating Life Sciences with HPC on AWS - AWS Online Tech Talks
Accelerating Life Sciences with HPC on AWS - AWS Online Tech TalksAccelerating Life Sciences with HPC on AWS - AWS Online Tech Talks
Accelerating Life Sciences with HPC on AWS - AWS Online Tech Talks
 

Semelhante a Cost-Effectively Running Distributed Systems at Scale in the Cloud (CMP349) - AWS re:Invent 2018

Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28
Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28
Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28Amazon Web Services
 
Running Lean Architectures: How to Optimize for Cost Efficiency (ARC202-R2) -...
Running Lean Architectures: How to Optimize for Cost Efficiency (ARC202-R2) -...Running Lean Architectures: How to Optimize for Cost Efficiency (ARC202-R2) -...
Running Lean Architectures: How to Optimize for Cost Efficiency (ARC202-R2) -...Amazon Web Services
 
Running a High-Performance Kubernetes Cluster with Amazon EKS (CON318-R1) - A...
Running a High-Performance Kubernetes Cluster with Amazon EKS (CON318-R1) - A...Running a High-Performance Kubernetes Cluster with Amazon EKS (CON318-R1) - A...
Running a High-Performance Kubernetes Cluster with Amazon EKS (CON318-R1) - A...Amazon Web Services
 
Scaling Up to Your First 10 Million Users (ARC205-R1) - AWS re:Invent 2018
Scaling Up to Your First 10 Million Users (ARC205-R1) - AWS re:Invent 2018Scaling Up to Your First 10 Million Users (ARC205-R1) - AWS re:Invent 2018
Scaling Up to Your First 10 Million Users (ARC205-R1) - AWS re:Invent 2018Amazon Web Services
 
Monitor the World: Meaningful Metrics for Containerized Apps and Clusters (CO...
Monitor the World: Meaningful Metrics for Containerized Apps and Clusters (CO...Monitor the World: Meaningful Metrics for Containerized Apps and Clusters (CO...
Monitor the World: Meaningful Metrics for Containerized Apps and Clusters (CO...Amazon Web Services
 
CMP376 - Another Week, Another Million Containers on Amazon EC2
CMP376 - Another Week, Another Million Containers on Amazon EC2CMP376 - Another Week, Another Million Containers on Amazon EC2
CMP376 - Another Week, Another Million Containers on Amazon EC2aspyker
 
Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...
Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...
Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...Amazon Web Services
 
Serverless on AWS: Architectural Patterns and Best Practices
Serverless on AWS: Architectural Patterns and Best PracticesServerless on AWS: Architectural Patterns and Best Practices
Serverless on AWS: Architectural Patterns and Best PracticesVladimir Simek
 
2019 03-13-implementing microservices by ddd
2019 03-13-implementing microservices by ddd2019 03-13-implementing microservices by ddd
2019 03-13-implementing microservices by dddKim Kao
 
Implementing Microservices by DDD
Implementing Microservices by DDDImplementing Microservices by DDD
Implementing Microservices by DDDAmazon Web Services
 
Become a Serverless Black Belt - Optimizing Your Serverless Applications - AW...
Become a Serverless Black Belt - Optimizing Your Serverless Applications - AW...Become a Serverless Black Belt - Optimizing Your Serverless Applications - AW...
Become a Serverless Black Belt - Optimizing Your Serverless Applications - AW...Amazon Web Services
 
Another Week, Another Million Containers on Amazon EC2 (CMP376) - AWS re:Inve...
Another Week, Another Million Containers on Amazon EC2 (CMP376) - AWS re:Inve...Another Week, Another Million Containers on Amazon EC2 (CMP376) - AWS re:Inve...
Another Week, Another Million Containers on Amazon EC2 (CMP376) - AWS re:Inve...Amazon Web Services
 
Come scalare da zero ai tuoi primi 10 milioni di utenti.pdf
Come scalare da zero ai tuoi primi 10 milioni di utenti.pdfCome scalare da zero ai tuoi primi 10 milioni di utenti.pdf
Come scalare da zero ai tuoi primi 10 milioni di utenti.pdfAmazon Web Services
 
Get the Most out of Your Amazon Elasticsearch Service Domain (ANT334-R1) - AW...
Get the Most out of Your Amazon Elasticsearch Service Domain (ANT334-R1) - AW...Get the Most out of Your Amazon Elasticsearch Service Domain (ANT334-R1) - AW...
Get the Most out of Your Amazon Elasticsearch Service Domain (ANT334-R1) - AW...Amazon Web Services
 
SRV203 Optimizing Amazon EC2 for Fun and Profit
 SRV203 Optimizing Amazon EC2 for Fun and Profit SRV203 Optimizing Amazon EC2 for Fun and Profit
SRV203 Optimizing Amazon EC2 for Fun and ProfitAmazon Web Services
 
Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018
Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018
Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018Amazon Web Services
 
Optimize EC2 for Fun and Profit - SRV203 - Anaheim AWS Summit
Optimize EC2 for Fun and Profit - SRV203 - Anaheim AWS SummitOptimize EC2 for Fun and Profit - SRV203 - Anaheim AWS Summit
Optimize EC2 for Fun and Profit - SRV203 - Anaheim AWS SummitAmazon Web Services
 

Semelhante a Cost-Effectively Running Distributed Systems at Scale in the Cloud (CMP349) - AWS re:Invent 2018 (20)

Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28
Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28
Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28
 
Amazon EC2 Spot Instances
Amazon EC2 Spot InstancesAmazon EC2 Spot Instances
Amazon EC2 Spot Instances
 
Running Lean Architectures: How to Optimize for Cost Efficiency (ARC202-R2) -...
Running Lean Architectures: How to Optimize for Cost Efficiency (ARC202-R2) -...Running Lean Architectures: How to Optimize for Cost Efficiency (ARC202-R2) -...
Running Lean Architectures: How to Optimize for Cost Efficiency (ARC202-R2) -...
 
Running a High-Performance Kubernetes Cluster with Amazon EKS (CON318-R1) - A...
Running a High-Performance Kubernetes Cluster with Amazon EKS (CON318-R1) - A...Running a High-Performance Kubernetes Cluster with Amazon EKS (CON318-R1) - A...
Running a High-Performance Kubernetes Cluster with Amazon EKS (CON318-R1) - A...
 
Scaling Up to Your First 10 Million Users (ARC205-R1) - AWS re:Invent 2018
Scaling Up to Your First 10 Million Users (ARC205-R1) - AWS re:Invent 2018Scaling Up to Your First 10 Million Users (ARC205-R1) - AWS re:Invent 2018
Scaling Up to Your First 10 Million Users (ARC205-R1) - AWS re:Invent 2018
 
Amazon EC2 Spot Instances Workshop
Amazon EC2 Spot Instances WorkshopAmazon EC2 Spot Instances Workshop
Amazon EC2 Spot Instances Workshop
 
Monitor the World: Meaningful Metrics for Containerized Apps and Clusters (CO...
Monitor the World: Meaningful Metrics for Containerized Apps and Clusters (CO...Monitor the World: Meaningful Metrics for Containerized Apps and Clusters (CO...
Monitor the World: Meaningful Metrics for Containerized Apps and Clusters (CO...
 
CMP376 - Another Week, Another Million Containers on Amazon EC2
CMP376 - Another Week, Another Million Containers on Amazon EC2CMP376 - Another Week, Another Million Containers on Amazon EC2
CMP376 - Another Week, Another Million Containers on Amazon EC2
 
Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...
Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...
Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...
 
Serverless on AWS: Architectural Patterns and Best Practices
Serverless on AWS: Architectural Patterns and Best PracticesServerless on AWS: Architectural Patterns and Best Practices
Serverless on AWS: Architectural Patterns and Best Practices
 
2019 03-13-implementing microservices by ddd
2019 03-13-implementing microservices by ddd2019 03-13-implementing microservices by ddd
2019 03-13-implementing microservices by ddd
 
Implementing Microservices by DDD
Implementing Microservices by DDDImplementing Microservices by DDD
Implementing Microservices by DDD
 
Become a Serverless Black Belt - Optimizing Your Serverless Applications - AW...
Become a Serverless Black Belt - Optimizing Your Serverless Applications - AW...Become a Serverless Black Belt - Optimizing Your Serverless Applications - AW...
Become a Serverless Black Belt - Optimizing Your Serverless Applications - AW...
 
Another Week, Another Million Containers on Amazon EC2 (CMP376) - AWS re:Inve...
Another Week, Another Million Containers on Amazon EC2 (CMP376) - AWS re:Inve...Another Week, Another Million Containers on Amazon EC2 (CMP376) - AWS re:Inve...
Another Week, Another Million Containers on Amazon EC2 (CMP376) - AWS re:Inve...
 
Come scalare da zero ai tuoi primi 10 milioni di utenti.pdf
Come scalare da zero ai tuoi primi 10 milioni di utenti.pdfCome scalare da zero ai tuoi primi 10 milioni di utenti.pdf
Come scalare da zero ai tuoi primi 10 milioni di utenti.pdf
 
Get the Most out of Your Amazon Elasticsearch Service Domain (ANT334-R1) - AW...
Get the Most out of Your Amazon Elasticsearch Service Domain (ANT334-R1) - AW...Get the Most out of Your Amazon Elasticsearch Service Domain (ANT334-R1) - AW...
Get the Most out of Your Amazon Elasticsearch Service Domain (ANT334-R1) - AW...
 
Cloud Economics - TCO 101
Cloud Economics - TCO 101Cloud Economics - TCO 101
Cloud Economics - TCO 101
 
SRV203 Optimizing Amazon EC2 for Fun and Profit
 SRV203 Optimizing Amazon EC2 for Fun and Profit SRV203 Optimizing Amazon EC2 for Fun and Profit
SRV203 Optimizing Amazon EC2 for Fun and Profit
 
Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018
Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018
Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018
 
Optimize EC2 for Fun and Profit - SRV203 - Anaheim AWS Summit
Optimize EC2 for Fun and Profit - SRV203 - Anaheim AWS SummitOptimize EC2 for Fun and Profit - SRV203 - Anaheim AWS Summit
Optimize EC2 for Fun and Profit - SRV203 - Anaheim AWS Summit
 

Mais de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Mais de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Cost-Effectively Running Distributed Systems at Scale in the Cloud (CMP349) - AWS re:Invent 2018

  • 1.
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Cost-Effectively Running Distributed Systems at Scale in the Cloud Osman Sarood Infrastructure and Operations Lead Mist Systems C M P 3 4 9
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What to expect? • Minimize cost using the volatile Amazon EC2 Spot instances • Ensuring reliability amidst unpredictable server terminations • Container right-sizing for reliability and autoscaling • How to monitor real-time applications under chaotic conditions Heat Map for CPU Load (Hotspots)
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Mist architecture 1 TB+ 10 Billion+ Msgs 10’s TB+ 500+ partitions
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Contract Type Cost Reliable Commitment Required? On-demand Very High Yes No Reserved High Yes Yes Spot Low No No Amazon EC2 contract types • On-demand • Reserved • Spot
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Spot instances • Unused Amazon EC2 capacity • Controlled by supply and demand • If used correctly, they can be ~ 80% cheaper • Be careful! Naive usage may end up costing more than on- demand
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Spot market volatility 86 Heterogeneous instance terminations in a day! (~20% of Production DC)
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Mist architecture running on spot •100% on Spot •80% of Datacenter •Huge savings $$$ •Allows us to do more •Forces reliability
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Cost benefits of using spot instances System vCPUs Percentage of total EC2 cost Savings compared to On-demad Savings compared to RI Storm 1096 (31%) 17% 74% 60% Mesos 1864 (52%) 30% 75% 62% Total Cores 3590 56% 35% *Includes EBS cost as well 15% non-spot infra constitutes 53% of our cost!
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How to use spot instances? • Spot market = (instance type, AZ) • Prices for difference spot markets are independent • AWS Spot Fleet • Uses spot instances • Ensures capacity • Caution: Cost saving scheme of AWS Spot Fleet is risky! • Diversify across markets! • Applications can restart
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How much to overprovision? • Lead time for spot instance: < 5 mins • Max number of instances terminated in 5 mins = 9 • Overprovision by 9 * 3 = 27 ( where 3 is a magic multiple) • The bigger the cluster the better it is
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Unreliable servers help building reliable systems!? • Chaos Monkey • Pioneer of chaos engineering • Controlled Chaos (can be stopped and resumed) • Use of spot Instances • Uncontrolled `Real’ Chaos • Can’t be turned off and depends on spot prices • Enables building reliable infrastructure. Wait, what? • Can be volatile
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Coping with spot’s unreliability • Mist Systems relies on both stateful and stateless applications • Live-aggregators: • 1.2+ TB in-memory state • ~ 1000 Mesos containers • CPU Utilization 60%-80% • Per-container memory footprint • minimum: 32MB • maximum: 21GB • mean: 1.3GB
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Real-time aggregation
  • 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Live aggregators • Stream: continuous flow of msgs from Kafka • View: a set of tuples containing aggregated data • Lag: how far behind from current time
  • 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Live aggregators architecture
  • 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Live aggregators—Scheduler
  • 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Live aggregators—Executor
  • 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Performance variations across instance types CPU utilization (%) for consuming the same stream (cv-sle-le) Fast Slow
  • 20. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Osman Sarood osmansarood@gmail.com
  • 21. Please complete the session survey in the mobile app. ! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Appendix
  • 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Coping with spot’s unreliability: Checkpointing • How far a real-time application is lagging? • Checkpointing is critical! • Each platform we use has a mechanism for checkpointing: • Storm: Redis • Mesos: • Live-aggregators: Amazon S3 • Real-time: Redis • API: stateless More resources
  • 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Spot termination: Lag and then recovery Dead Not Rescheduled Rescheduled Recovering 10 min older checkpoint restored
  • 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Difference in recovery times • High Message Rate • Type of computation • Amount of data/msg
  • 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Why do we monitor? • Detecting the problem: Is a real-time application falling behind? • Troubleshooting: • Resource bottlenecks • Bottlenecked on an external system • Application code errors • Use SignalFX, Graphana, Graphite and Amazon Cloudwatch
  • 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Detection: Real-time lag • How far behind is a real-time service/topology? • Difference between Kafka timestamp between: • Most recent message • Last consumed message for an application • How much lag is acceptable? nano secs, millisecs, secs? Most recent msg t = 21 Consumer 1 t = 8, Lag = 13 sec Consumer 2 t = 1, Lag = 20 sec Head Tail
  • 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Real-time lags
  • 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Attribution: Resource monitoring • Track key metrics per container/worker • CPU, Memory, Network IO • CPU Kafka stream • Timing individual code blocks • Write metrics using different dimensions to filter and aggregate • Host name • Container ID • Service/Application name • Environment • Instance Type
  • 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. CPU metrics for microservices
  • 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Performance variations across instance types CPU utilization (%) for consuming the same stream Fast Slow
  • 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Case study: Lagging application ~ 5:15 starts lagging
  • 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Case study: High CPU utilization ~ 5:15
  • 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Case Study: Host CPU utilization details ~ 5:15 terminated le_wifi_leprepare ~ 7:15
  • 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Case study: Lag recovers ~ 7:15 lag recovery starts
  • 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Lying factor • Predicting resource requirements is difficult • How much CPU, memory, and network IO? • Recent resource managers are typically book keepers: • Mesos • Kubernetes • Lying factor is the difference between (1 - 2 = -1 cores) • Reserved resources (1 cores) • Actual used resources (2 cores) Noisy neighbor
  • 37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Right-sizing containers • Lying factor for a container: • Negative: Dangerous! Using more resources than it reserved (hotspots) • Positive: Resources wasted • Ideally should be 0 (reserved = actual/consumed) • Manually update resources using marathon
  • 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Dynamic Load ( Incoming data stream) Daily Seasonality Trend
  • 39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Autoscaling using lying factor • Changing load common in data streams • Complicated due to changing load: • Seasonality (daily peaks) • Trend (load increasing overtime) • Autoscaling based on lying factor for containers • Reduce resources when lying factor crosses a positive threshold • Increase resources when lying factor below a number • Maintain lying factor within a band Lying Factor Time
  • 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Lying factor vs. load Peak Load Upper threshold: 0.6 Lower threshold: -0.1 0.35 Lying factor = # reserved cores - # actual cores used
  • 41. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Containers upsized automatically Actual usage increasing and hence container terminated 0.17 Lying factor = # reserved cores - # actual cores used
  • 42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Container downsized automatically Actual usage decreased and hence container terminated 0.60 Lying factor = # reserved cores - # actual cores used
  • 43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Autoscaling in action: Reserved vs. actual usage Up-sizing Down-sizing Actual cores used Reserved cores 180 360