How we built, architected and scaled Defensio, from our prototype to the version currently in production.
Presented at ConFoo in Montreal on March 12, 2010.
1. Building Scalable Web
Applications for the
Cloud
Carl Mercier (@cmercier)
Director of software development, Websense Inc.
Founder, Defensio.com
cmercier@websense.com
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
2. Security
for
the
Social
Web
We protect your website from spam,
malicious content,
unwanted URLs and profanity.
Friday, March 12, 2010
3. The Cloud is
different
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
4. Architecture challenges
• We’re an API, not a website
• Many million requests per day, non stop
• Each requests can be fast or slow
• Very little caching possible
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
5. Architecture challenges
• Write intensive
• Traffic comes in spikes
• Any downtime is catastrophic
• 2 different versions of our APIs
• Bootstrapped startup. We’re broke!
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
6. Getting technical
• Built in Ruby (Rails, Merb and pure Ruby)
• External services written in Perl and C
• 100% hosted on Amazon EC2
• Mix of 32 and 64 bit machines
• mostly m1.small (the cheapest ones)
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
7. Prototyping/1.0 beta
aka The Spaghetti Release
• Single Ruby on Rails application
• No direction whatsoever
• A few small EC2 instances
• A single MySQL
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
8. Prototyping/1.0 beta
aka The Spaghetti Release
• Horizontal scaling:
Start more instances
DNS Round Robin • This also scaled the website
NGINX + API + WEB NGINX + API + WEB
• Eventually moved MySQL to m1.large
MySQL
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
9. What was wrong?
• Unmaintainable code
• Why did it even work?
• but it REALLY did work, and well! :)
• DNS Round Robin
• Very database intensive
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
10. The Big Rewrite
• Complete code rewrite
• Proper code separation
• Completely tested
• Ruby + MERB + Datamapper
• Replaced DNS RR with HAProxy
• Added Memcached to the mix
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
11. The Big Rewrite
architecture
HAProxy
NGINX + API (Merb) NGINX + API (Merb) NGINX + API (Merb)
MySQL + Memcached
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
12. Later Improvements
• Dumped HAProxy (single point of failure)
• replaced with Amazon ELB
• Move Memcached to its own machine
• Decoupled resource-intensive parts
• turned them into web services
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
13. The Big Rewrite
architecture, revisited
Amazon ELB
NGINX + API (Merb)
many EC2 instances
MySQL
Memcached
Web Service 1 Web Service n
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
14. Advantages of this architecture
• Easy to scale horizontally OR vertically
• Each unit can be scaled & tweaked independently
• Easy to maintain
• Increased redundancy
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
15. MySQL Pain
• Traffic keeps growing
• Adding millions of records per day
• Database size growing exponentially
• Most of this data was non-critical
• Stuck with our schemas and indexes
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
16. Scaling MySQL on EC2
• If your DB fits in memory, don’t worry, be happy!
• It’s painful.
• You should be on EBS or equivalent
• permanent and robust storage
• EBS snapshots
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
17. Scaling MySQL on EC2
• Scale up (move to a bigger machine)
• More RAM
• Database often IO bound
• RAID 0 (stripe)
• Inconsistent EBS snapshots
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
18. Scaling MySQL on EC2
• Replication
• headache
• all writes go to master
• Split database
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
19. MongoDB
• Document-oriented database
• Schema-less
• Fast
• Replication, fail-over, auto-sharding
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
20. Three Data Stores
• MySQL (critical data)
• accounts, keys, account settings, statistics
• MongoDB (semi non-critical)
• documents, reputations
• Memcached (non-critical data)
• short term, very fast updates
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
21. Three Data Stores
Amazon ELB
NGINX + API (Merb)
many EC2 instances
MySQL
m1.small
MongoDB
64-bit
Memcached
Web Service 1 Web Service n
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
22. API 2.0 Challenges
• Completely new API to the user
• Keep support for 1.x
• Asynchronous
• New features, can’t just wrap API 1.x
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
23. Frontend
• Ruby on Rails
• Accepts HTTP connections
• Knows the API definition for both 1.x and 2.0
• Converts API calls into “jobs”
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
24. Frontend
• Jobs are put in a queue
• Backend responds with generic response
• Frontend converts response and renders
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
25. Queue/Messaging: RabbitMQ
• Messaging (AMQP)
• Ultra-fast
• Feature-rich
• Complex (too complex for our needs)
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
26. Queue/Messaging: Beanstalkd
• Ultra-simple simple queue
• Not a messaging server (hack it to make it behave like one!)
• Just as fast as RabbitMQ
• Delayed jobs
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
27. Backend
• Previously our “API” servers
• Doesn’t accept HTTP connections anymore
• Communicates through jobs/response (queue)
• API agnostic. Only knows about jobs/response
• All processing/logic
• Spits a response back in the queue
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
28. Current Architecture API 2.0
Amazon ELB
Cluster n
API Frontend (Unicorn + Rails)
many EC2 instances
Queue/Messaging
(Beanstalkd)
Backend (hacked Merb)
many EC2 instances
MySQL MongoDB
Memcached
m1.small 64-bit
Web Service 1 Web Service n
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
29. Advantages
• Awesome fault-tolerence
•
Amazon ELB
API Frontend (Unicorn + Rails)
many EC2 instances
Cluster n Horizontal scaling is easy
Queue/Messaging
(Beanstalkd)
Backend (hacked Merb)
• Add capacity to a cluster
•
many EC2 instances
Add clusters
MySQL MongoDB
Memcached
•
m1.small 64-bit
No more MySQL scaling worries
•
Web Service 1 Web Service n
Complete schema flexibility w/
MongoDB
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
30. When to scale “out”
(horizontally)
• Each instances are identical clones
• Redundancy
• Fast & easy scaling
• Instance is “irrelevant”
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
31. What we scale “out”
(horizontally)
• Frontend
• Backend
• Internal web services
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
32. When to scale “up”
(vertically)
• Multiple instances are hard to manage (eg: database)
• CPU or memory intensive applications
• Scaling out becomes unpractical
• Scaling out becomes cost-ineffective
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
33. I really like
scaling out
vs. scaling up
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
34. Bulletproof your app
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
35. Scale & shrink fast
even automatically
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
36. Most cost-effective
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
37. Things I learned
• Cloud instances are disposable
• Architect your app accordingly
• Instances should be killed, not fixed
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
38. Things I learned
• Pre-optimizing is useless
• Be aware of your bottlenecks
• Architect your application for flexibility
• Deploy different parts to different servers
• Secure your important data
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
39. Things I learned (about EC2)
• It is pretty reliable, anything else you heard is a myth
• When shit hits the fan, you’re on your own
• Create images
• Automate as much as you can
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
40. Things I learned (about EC2)
• Auto-scaling is easy, but rarely needed
• IO is inconsistent and mostly sucks
• Slowish (Rackspace Cloud is much faster)
• Large(r) instances are too expensive
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010
41. Questions?
Twitter: @cmercier and @defensio
Email: cmercier@websense.com
Web: www.defensio.com
O U T S M A R T I N G E V I L S PA M
Friday, March 12, 2010