1. 1TB/day
Logging and counting billions of events.
Scaling infrastructure using Amazon Web Services.
Dirk Harms-Merbitz - grasswood@icloud.com
2. Amazon Web Services
• Flexible toolkit for building Internet applications
• Infrastructure as a service
• Enables very fast growth
• No commitments, capex replaced by opex
3. Example
• Customer signs up on web form, specifies number of
users, data retention policies, based on business needs.
• Vendor programmatically spins up an instance from a
custom AMI with EBS volumes or local storage RAIDed as
needed to match performance, size, and cost parameters.
• One customer or one thousand customers, the
infrastructure and scaling of resources is handled by
Amazon.
• Vendor focusses on marketing, support and software
development.
4. The AWS Toolkit
• EC2 = Containers on Demand
• EBS = Elastic Block Storage
• S3 = Object storage and static HTTP
• Glacier = Long term storage
5. Elastic Compute 2
• Container for OS and application software
• Storage is EBS or locally attached
• / on EBS makes it easy to change instance size
• Standard or custom AMI
• An EC2 instance is not a server
6. Elastic Block Storage
• More reliable than hard drives
• Building blocks for application specific storage
• Combine as needed using RAID and LVM
• Different flavors, PIOPS, GP2, magnetic
• 1TB max, 10 max per instance, 1TB = $50-$388/mo
• Elastic Block Storage is not a disk
7. Local storage
• Directly attached to an instance
• Lower cost compared to EBS, much faster
• Survives reboots but disappears when instance
is stopped or terminated
• Best used with instance level redundancy:
RAID0 with the same data on multiple instances
allows for very fast processing in parallel
8. Object Storage 3
• Stores objects of up to 5TB
• 4x9 availability, 11x9 durability
• REST and SOAP interfaces - $5/1M requests
• HTTP download, easy for customers to access
• 1TB = $30/mo storage, $120/mo to transfer
9. AWS Glacier
• Glacier Storage
• 4x9 availability, 11x9 durability
• $10/mo to store 1TB
• Cost for getting data out is based on speed
• Getting data out quickly can become expensive
10. AWS Optimizations
• EBS optimized instances offer better performance. Your
storage and network compete otherwise.
• RAID and LVM are used to combine EBS volumes to
match application storage size and throughput
requirements.
• Local SSDs double in size and speed with RAID0. Data
survives reboots but snapshots are needed before
stopping or terminating.
• Cloud is not just AWS: DigitalOcean, Linode, there are
many alternatives. EBS however makes resizing easy.
11. AWS Pro and Con
• Not hardware: Intuitions based on physical hardware won’t
transfer. Everything is throttled.
• Flexible: Used correctly you don’t have to think about scaling
your hardware to millions of users. Short term, testing ideas.
• Complex: Easy to use incorrectly, with very low performance and
very high costs possible as a result.
• Expensive Mistakes: Storing 6TB for three years can cost as
much as $83,808 or as little as $4,818.
• If you know what you need, co-location delivers more for less: A
physical 6TB drive is faster, lasts 3-5 years and costs $299.
12. AWS
• Not appropriate for all businesses: Complexity
cost, rental cost, slow technology updates.
• Not appropriate for all applications: nobody
mines bitcoin in AWS.
• Not appropriate as workaround when
management is slow in approving hardware.
13. Tips & Tricks
• avoid copying data
• use parallel or exec
• speed up ssh, use mosh
• use fixed length records
• use raw block devices
• use bitmaps
14. avoid copying data
• write to EBS volume A until full
• switch to volume B, continue writing
• detach A and attach to processing instance
• zero copy when a volume is passed around
15. parallel and pexec
• grep, bzip2, wc, awk, sed use only a single CPU core
• gnu parallel or pexec make use of all cores, local and even neighbors
• pexec -o - -f instances -e x -c -- 'rsync -ae ssh /etc/hosts $x:/etc/hosts'
• parallel ping -c1 ::: host1 host2 host2 host4
• find -name “*csv.gz” -print | parallel zgrep “string”
• find -name “*.csv.gz” -print | parallel zcat >all.txt
• cat all.txt | parallel —pipe grep ‘api_key=xyz’
• cat all.txt | parallel —pipe wc -l | awk ‘{s+$1} END {print s}’
16. ssh and mosh
• 30x faster when reusing ssh connections:
• ControlMaster auto
• ControlPersist yes
• ControlPath ~/.ssh/socket-%r@%h:%p
• mosh.mit.edu works well over lossy connections
• including changing locations and IP numbers
17. fixed length records
• Fixed length records on raw block devices
• No compressing and uncompressing
• No parsing of ASCII
• No file system
• No overflow possible, write pointer wraps
18. raw block devices
• Counters on raw block devices
• By keeping just the lower byte of a counter in
RAM you can divide access frequency by 256
• RAID0 of SSDs can reach 1000-2000MB/s
• EBS 100MB/s, RAID0 of multiple EBS 800MB/s
19. bitmaps
• Bitmaps for counting things and other uses
• 100M unique users in 12.5MB of RAM
• Hourly, Daily, Weekly, Quarterly…
• 6TB SSD instance = 7000 bits / person on earth