This is a presentation that I gave at the AWS Meetup in Ann Arbor, Michigan back in January. It recounts some experiences that I had while working on a project with RightBrain Networks that involved moving millions of small files around between S3, Glacier and an NFS NAS volume. A good time was had by all.
2. Who the @#%^ is Dave
Thompson?
• DevOps/SRE/Systems guy from MI by way of San
Francisco
• Current Employer: MuleSoft Inc
• Past Employers: Netflix, Domino’s Pizza, U of M
• Also contributing to the madness at RBN
3. … and what is he talking
about?
• Today, we’ll talk about a case study using Glacier
with S3, and the various surprises that I
encountered on the way.
10. Enter RBN!
The proposal: migrate the data from S3 to a cloud storage
solution (Zadara), and archive the files to Glacier.
11. Everything Goes According
to Plan (Again)!
• Files are copied to Zadara share
• S3 lifecycle configured to archive objects to Glacier
12. The Zadara share becomes
corrupted after the data is migrated.
Except…
13. Amazon Glacier: a Primer
• Glacier is an archival solution provided by AWS.
• It’s closely integrated with S3.
• Use cases for Glacier and S3 are different,
though…
14. S3 vs Glacier
• Unlike an S3 GET, a Glacier RETRIEVAL takes ~4
hours
• UPLOAD and RETRIEVAL API requests are 10x
more expensive on Glacier than comparable S3
requests
• Bandwidth charges for RETRIEVAL requests apply,
even inside us-east-1
15. S3 vs Glacier (cont.)
• This means that Glacier is optimized for
compressed archives (i.e. tarball data)
• S3 is about equally suited for smaller or larger files
• Automatically archiving S3 objects to Glacier can
thus lead to great sadness.
18. The New Plan
• Restore files from Glacier back to S3
• Migrate data from S3 to Zadara share
• Archive files back to Glacier in tar.gz chunks
• Create DynamoDB index from file name to Glacier
archive for future restore
20. Task 0: Calculating Cost
• Glacier pricing model is… interesting
• Costs are fixed per UPLOAD and RETRIEVAL
request
• Cost for bandwidth based on the peak outbound
bandwidth consumed in a monthly billing period2
• Monthly bandwidth equal to 5% of your total Glacier
usage is permitted free of charge
21. The Equation(Oh, boy. Okay, let’s do
this.)
• Let X equal the number of RETRIEVE API calls made.
• Let Y equal the amount to restore in GB.
• Let Z equal the total amount of data archived in GB.
• Let T equal the time to restore the data in hours.
• Then the cost can be expressed as:
(0.05 * (X / 1000)) + (((Y / T) - (Z * .05 / 30) * .01 * 720)
22. Task 1: Restore from Glacier
• Two m2.large instances running a Python daemon
• Multiple iterations, from single threaded to multi-
threaded to multiprocessing with threading
After iterating several times to get the speed we needed, I
started the process for the ‘last time’ on a Sunday evening.
ETA: ~5 days
25. Task 1: Restore from Glacier
(cont.)
Glacier team was not amused.
26. Task 1: Restore from Glacier
(cont.)
Restore continued at the ‘suggested’ rate, and thereafter
completed successfully a couple of weeks later.
Task 1 complete!
27. Task 2: Migrate and Archive
Data
Now we just needed to migrate the data from S3 to Zadara
(again), create tarballs of the files, archive them to Glacier, and
create a DynamoDB index so you can look up individual files.
Easy!
28. Task 2: Migrate and Archive
Data (cont.)
Back to iPython and Boto. Recent experience with Python
threading and multiprocessing was to prove helpful.
30. Great Success!
And the whole thing only took about 10x as long
as the client initially estimated!
31. Lessons Learned
• Glacier is optimized for large, compressed files and
lower request rates.
• Be very careful about the S3 -> Glacier lifecycle
option.
• If you DoS an Amazon service, you get special
attention!