1. 10 Tips for Your Journey to the Public Cloud
Suchi Upadhyayula Sean McCluskey
Director of Product Development, Intuit Director of Quality and Operations, Intuit
May 28, 2015
9. Load Balancing
• Security policy against terminating SSL on ELB
– ELB acts as a dumb pass-through
• Routing logic to support bulk-head pattern (Pods) too complex for
current ELBs
• Developed a proxy layer to:
– Terminate SSL
– Implement routing logic
– Access audit logging
1
10. Securing Sensitive Customer Data
• Multi-layer encryption (integrated with Amazon’s Key Management System) with periodic key
rotation:
– Application encryption of sensitive data
– Encryption in flight
– File level encryption at rest
• Reviewed fields to identify sensitive data to be “application level” encrypted
– Dropping of clear text columns before data ready to ship
• >50TB of data encrypted
2
11. Establishing a Framework for Low Latency
• Prepare for latency impact due to encryption
– Mint planned for 30% degradation
• Continuous measurement of TP50, TP90, TP99 for critical features
– Weekly review of TPs to drive improvements to reduce latency
– Constant tuning of code and single page architecture
– Able to maintain TP50 & TP90 SLAs
• Create a culture of continuous focus on TPs to drive improvements
3
12. Infrastructure as Code
• Configuration change in the infrastructure resulted in a release
failing to deploy and requiring rollback
• What we learned:
– In AWS, operations spends a lot of time writing code: CloudFormation
templates, deployment automation, monitors
– Development rigor was new to the operations team
– Needed to adopt development practices within operations: designs, code
reviews, testing, validation, formal release processes for infrastructure
4
13. Migrating Large Volumes of Data
• Not feasible to copy >50TB (and growing) of secure data “over the
wire”
• Plan for data transport to AWS:
– Encrypted drives physically secure shipped to AWS; 3 days to ship backup
copy to AWS and upload
– Catch up replication
– Final drive shipment needs to be timed so that replication can catch up to the
shipment window and sustain data growth prior to production cutover
5
14. High Availability and Disaster Recovery
• Recovery Time Objective (RTO): time to restore a
service to operation
• Recovery Point Objective (RPO): amount of data
acceptable to lose
• Solve for availability first with Multi-AZ
• Determine acceptable RTO/RPO and solve for regional
failures second
– Balance lower RTO/RPO against increased cost and
complexity
– Recognize the technology you use to handle regional
failures will add complexity that could increase outages
Region US-EAST
Availability
Zone
Availability
Zone
Availability
Zone
Region US-WEST
Availability
Zone
Availability
Zone
Availability
Zone
6
15. Monitoring and Diagnostics
• Disassociate with IPs
– Instances, ELBs, and their IP addresses are dynamic
– Number of instances are constantly changing
– When an instance has issues it can be “blown away”
• Build resilient and self-healing infrastructure
– Monitoring should then be built to compliment this
– If you alert on failure, have the courtesy to alert on healing
7
16. End-to-End Testing
• In addition to validating the full functionality of the production
environment, you also need to validate:
– Build, config, deploy, and validation infrastructure
– Logging, Monitoring, etc system that ensure the environment is healthy
– Access controls and security
– Auto-Scaling
• Continuous synthetic testing in the production environment
– provide an end-to-end test to ensure the customer experience doesn’t degrade
8
17. Managing Costs
• Compute: reserved vs. on-demand
– If compute is “on” for more than 9 hours per day, reserved will save money
– On-demand for seasonal workloads and rare peaks
– Reaper scripts; shutdown unused instances
• Snapshots drove significant cost savings
• Storage is cheap
– A lot of work that yields a small return
• IOPS are not
– Optimizing IOPS per shard saved a lot of money
9
Other,
3.13%
Storage,
3.42%
IOPS,
17.09%Snapshots,
42.17%
Compute,
34.19%
Savings Distribution
18. Release Operations
• Infrastructure deployed independently of applications
– DB schema
– AMI
– Infrastructure as code
– Application
• Support rollbacks for everything (blue-green)
– We can always go back to N-1, ALWAYS!!
10
19. Summary
1. Load balancing: Evaluate if ELB is sufficient and plan ahead
2. Security: Multi-layer encryption, AWS Key Management
3. Low latency: TP50, TP90, TP99 measure and improve
4. Infrastructure as code: Design, review, test templates
5. Migrating large volumes of data: Encrypted drives
6. HA/DR: Multi-AZ, multi-region
7. Monitoring and diagnostics: Disassociate with IP addresses
8. End-to-end testing: Don’t forget to test auto-scaling
9. Managing costs: Compute is more expensive than storage
10. Release operations: Rollback-ready, blue-green