[@IndeedEng] Redundant Array of Inexpensive Datacenters

Redundant Array of
Inexpensive Datacenters
Charles Valentine and Chris Graf
June 2013

Overview
Charles Valentine
VP, Technology Services

Indeed
● 100 million unique visitors per month
● Over 50 countries and 26 languages
● 3 Billion job searches per month

Indeed Ops
● Assist development in designing new products
● Engineer scalable systems to support applications
● Monitor applications
● Fix systems when they break

Indeed Lingo
Datacenter = Point of Presence

Each Presence is Full Stack
● Applications
● Services
● Read/Write Data systems
● Communications
● Monitoring
We need serious processing power in each
datacenter!

Applications per Datacenter
● Over 40 Java-based web applications
● Over 90 Java-based services

Data Systems
● MySQL databases
● Mongo databases
● Memcached instances
● LSM Trees
● Search indexes
● Numerous other data stores

Goals
● Fast
● Reliable
● Inexpensive

Triple Constraint
Fast
Reliable
Inexpensive

Traditional Method
Fast
Reliable
Inexpensive

Indeed Method
Fast
Reliable
Inexpensive

Fast
Speed is a product feature
● Server Time
● Client Time

1 ms, 3 Billion Times/Month
1 ms = 34 job seeker days per month

20 ms = 22 jobseeker months

100 ms = 9.5 jobseeker years

Reliable
Reliability is a product feature

Impact of Downtime
8,000
Disappointed Job Seekers every minute

People get hired on Indeed
7 seconds

Availability
● Jobseekers can find jobs
● Less focus on mitigating failure
● More focus on recovering quickly

Availability is Good for Job Seekers
9's

Good
99.9% availability => down for 525 minutes
At peak 4,500 jobseekers don't get a job

Better
99.99% availability => down for 52 minutes
At peak 450 jobseekers don't get a job

Almost Best
99.999% uptime => down for 5 minutes
At peak 45 jobseekers don't get a job

Indeed is Always there for Job
Seekers
Availability > 99.999%
Less than 5 minutes downtime per year

How It Works
Chris Graf
Operations Manager

Maximize Availability
Beyond 99.999%
No downtime, scheduled or otherwise

Maximize Performance
Optimize page load times to the millisecond

Minimize Cost
Minimize cost while meeting performance and
availability goals

Hosting Models
● Traditional Colocation
● The Cloud
● Managed Hosting

Traditional Colocation
● You buy the servers, network gear, cables...
● You send people to set it up
● You send people to fix stuff when it breaks
● You manage your own pipes (maybe)

Traditional Colocation Expansion
1. Acquire rack space
2. Buy the hardware
3. Wait for manufacturing
4. Wait for delivery
5. Send people to the datacenter to set it all up
Expansion can take weeks

Traditional Colocation
Good if you have
● Fairly static environment
● Really beefy hardware
● Some centralized functionality
● Time to wait
● Lots of cap-ex budget
● Like signing long-term deals
● People to do stuff

● You rent access to computing power
● You pay to reserve it if you aren't using it
● Usually abstracted from hardware layer
The Cloud

Expanding Cloud-based systems
1. Order new instances
2. Wait a few minutes
3. Provision them
Expansion takes minutes.

The Cloud is good!
If you have significant, unpredictable changes
in load

The Cloud is bad!
Costs more if you need all your instances
available all of the time

Managed Hosting
● Rent hardware from provider
● Provider buys and hosts servers, network,
etc.
● Provider deals with hardware issues

Expanding Managed Hosting
1. Order new servers
2. Wait a few hours
3. Provision
Expansion takes hours (depending on
provider)

Indeed Uses Managed Hosting
Least expensive overall
Access to real bare metal hardware
Agile enough

Steps for beyond 99.999% uptime
1. Find a provider
2. Sign contract for 100% uptime with 100%
revenue protection
3. Profit
Right?

Providers "guarantee" availability
"Service Level Agreement" (SLA) guarantees
some percentage of uptime

SLA: brief outages aren't outages
Less than 30 minutes downtime not counted
against "100% SLA"
One 5-minute outage per month < 99.99%
Two 25-minute outages per month < 99.9%
The provider can call that 100% available

SLA: maintenance is not downtime
Scheduled maintenance not counted against
SLA
1 hour maintenance each month < 99.9%
The provider can call that 100% available

SLA credits don't cover your
business
You get a refund for the services, not for lost
business and lost customer confidence
Providers lose your hosting fees
You lose your revenue

100% is not really 100%
Hosting is complicated
A single datacenter is rarely 100% available

Bug in provider hardware caused total loss of
Internet access under certain load
Core network problem

Power outage
1. Utility power was disrupted
2. Backup generator and UPS couldn't handle load
3. Core network went offline
4. Servers lost power
5. Upon power restoration, router did not recover

Power Outage Aftermath
● Event duration = 54 minutes
● Recovery duration = 12 hours
● 5% monthly credit for affected hardware

Backhoe Induced Fiber Failure
(BIFF)

Wet servers
Tornado peeled back the roof of an AT&T
datacenter in 2004.

Other Disasters
● Hurricanes
● Floods
● Earthquakes
● Fires
● Etc.

Need better uptime than providers
Can only get ~99.7% after asterisks
We have to build something better

Save a document to a hard disk
Hard Disk
Doc

Disaster Recovery
Restore from an external USB drive?

Redundant Storage
Simple case - RAID 1
Hard Disk
A
Hard Disk
B

RAID - Save it twice
Hard Disk
A
Hard Disk
B
Doc

RAID - Two copies of everything
Hard Disk
A
Hard Disk
B
Doc Doc

RAID
Hard Disk
A
Hard Drive
B
Doc Doc

RAID == Redundant Array of
Inexpensive Datacenters
Datacenter
A
Datacenter
B
Jobseekers

RAID makes datacenters more
reliable
Datacenter
A
Datacenter
B
Jobseekers

Building a more reliable system
Using inexpensive, less reliable components

99.7% in, 99.999% out
Now our system can get better availability as a
whole than any single provider can give us.

Expect your datacenter to fail
Failure is inevitable
Design for it

Simpler datacenters with RAID
Only need one of everything inside each
datacenter:
● Firewalls
● Load balancers
● Servers provisioned primarily for capacity not
redundancy

Primary and secondary datacenters
21

Datacenter level redundancy
Protects against a single datacenter failure

Datacenter level redundancy
Protects against a single datacenter failure
...
But there are problems that can affect more than
one datacenter on the same provider

Denial of service attacks
Distributed denial of service attack against
another customer who had servers in the same
facilities took multiple facilities offline

Network configuration errors
Provider pushed a bad global route which took
their entire global network offline

Protect against global provider
failure
Use multiple providers to get provider-level
redundancy

Provider-level redundancy
21
X
X

Recovering from Failure
● Offline
● Active/Passive
● Active/Active

Offline
● One active datacenter handles all traffic
● Backup systems are offline and incomplete
● Restore backups to new systems
● Downtime during switchover is ~days

Active / Passive (Dark)
● One active datacenter handles all traffic
● A second datacenter has provisioned
systems and all data
● Switch from primary to secondary
● Downtime during switchover is minutes to
hours

Active / Active
● Every datacenter handles traffic
● Data and systems are replicated
● Failover activated automatically
● Downtime during switchover measured in
seconds
● Scales beyond two facilities

Jobseeker Impact
Offline: extended downtime for all jobseekers
Active/Passive: some downtime for all
jobseekers
Active/Active: brief downtime for some
jobseekers

Which jobseekers go to which
datacenter?
Offline: go to single datacenter
Active/Passive: go to single datacenter
Active/Active: go to many datacenters?

Send jobseekers to the best
datacenter
Use dynamic DNS service to send job seekers
to the best, healthy data center

Anycast DNS
Resolving same hostname to different IP
addresses
● Client A: nslookup www.indeed.com
Server: dns.client-a.com
Address: 1.1.1.1
● Client B: nslookup www.indeed.com
Server: dns.client-b.com
Address: 2.2.2.2

DNS Lookup
Jobseeker
A
Jobseeker
DNS
Server
5.5.5.5
Indeed DNS
Service
www.indeed.com
1.1.1.1
www.indeed.com
1.1.1.1

Vary response from primary DNS
Indeed DNS
Service
www.indeed.com
1.1.1.1
www.indeed.com
1.1.1.1
Indeed DNS
Service
www.indeed.com
2.2.2.2
www.indeed.com
2.2.2.2
Jobseeker
DNS
Server
5.5.5.5
Jobseeker
DNS
Server
8.8.8.8
Jobseeker
A
Jobseeker
B

Similar jobseekers get similar
responses
Indeed DNS
Service
www.indeed.com
1.1.1.1
www.indeed.com
1.1.1.1
Indeed DNS
Service
www.indeed.com
2.2.2.2
www.indeed.com
2.2.2.2
Indeed DNS
Service
www.indeed.com
2.2.2.2
www.indeed.com
2.2.2.2
Jobseeker
DNS
Server
5.5.5.5
Jobseeker
DNS
Server
8.8.8.8
Jobseeker
DNS
Server
8.8.8.8
Jobseeker
A
Jobseeker
B
Jobseeker
C

Remap jobseekers via DNS changes
Indeed DNS
Service
www.indeed.com
1.1.1.1
www.indeed.com
1.1.1.1
R
e
c
o
n
f
i
g
Indeed DNS
Service
www.indeed.com www.indeed.com
2.2.2.22.2.2.2
Jobseeker
DNS
Server
5.5.5.5
Jobseeker
DNS
Server
5.5.5.5
Jobseeker
A
Jobseeker
A

Outsource your DNS service
Doing this well is an investment

Outsource your DNS service
● Robust
● Flexible
● Inexpensive
Our core competency is jobs
Their core competency is DNS

Degradation and Failure
Manually switch datacenter on service
degradation
Automatically switch datacenter on failure

DNS propagation delays
1. Healthcheck cycle - up to 30 seconds
2. Healthcheck server to nearest PoP
3. Jobseeker's DNS server cache refresh
4. Jobseeker's local DNS cache refresh

DNS Time-to-live (TTL)
TTL tells local name servers and clients how
long to wait before looking up a domain name
again
TTL limits load, but also slows change
propagation

Some clients and servers ignore TTL
We specify a 30 second TTL, but local DNS
servers and clients can ignore it

Impact of propagation delay
90 second traffic hole

30 minute tail
Well-behaved clients
Ignoring our TTL

Big Picture
90 second hole
Failing datacenter
Total traffic

Accepting DNS limitations
Complete datacenter failure is extremely rare
Predictable limitation
Massive costs to reduce propagation delay

Remapping Manually
The same system allows us to reroute traffic
whenever we want
● Datacenter maintenance
● Non-critical performance problems
● Non-critical feature loss
● Other degradation of jobseeker experience

Datacenter Redirection
datacenter disabled
traffic moves to others

Anycast DNS for performance
This capability is also used to improve
performance

Closer to the jobseekers
The DNS service can give the IP address of
the datacenter closest to the jobseeker.

Network hops
Based on network hops between jobseeker
DNS server and our DNS service POP

Network paths
Estimates how many networks traffic must
pass through to reach our servers

Count hops
Picks estimated shortest path

Optimize for network distance
We can push our data center presences closer
to the jobseekers to reduce network latency

Datacenters for redundancy only

Datacenters close to the jobseekers

No downtime for datacenter
replacement
Incrementally send traffic to new datacenters
Incrementally reduce traffic to old data centers

Move West Coast hosting!
-20 ms

Don't move European hosting!
+50 ms!

Search Engine Performance
Source GrabPerf.org

Page Load Time
1,000ms
9,000ms

Summary and Results
Charles Valentine

● Higher-capacity network equipment
● Redundant firewalls
● Redundant load balancers
● Bigger Internet connections
● Redundant Internet connections
This is "vertical scaling."
Traditional Scaling Model

Horizontal Scaling with RAID
Add capacity by adding datacenters
Add redundancy by adding datacenters
Rent "good" datacenters, not "best"

Avoid using proprietary features
● Load balancer
● Security devices
● Virtualization
● Servers

Use free software
No licensing costs or recurring maintenance
fees

Agile Providers
● New hardware racked and ready in a few
hours
● No need to over provision

Automate configuration
● Cobbler
● Puppet

Rent instead of buying
● Obsolete hardware is not your problem
● No depreciation
● No hardware maintenance
● No need to hire people to maintain the hardware

Architect Applications for RAID
Work with your development teams

Traditional Hardware Scaling
● Old hardware supports baseline traffic
● New hardware supports growth

Indeed Hardware Scaling
Old hardware gets replaced by new, on
demand

Moore's Law
Hardware is always getting better
● Faster processors
● More memory per chassis
● Larger, faster disks

Higher capacity, lower cost
● Number of machines drives cost
● Power of machines drives cost
● More machines => more problems
● Compute power grows faster than compute
cost

Replace hardware every 18 months
Managed hosting
+
Moore's Law
+
RAID
=
new and powerful hardware
every 18 months

Amazon EC2?
● Amazon is a single provider
● Costs more to run 24x7
○ 2x without bandwidth cost
● Can't be as close to the jobseeker

What RAID gets you
● Servers closer to your customers
● Disposable datacenters
○ Datacenter-level failover
○ Get modern hardware every 18 months
● Many hosting options

Spend Time On...
● Automation
● Managed DNS
● Investigating Providers
● Monitoring

Spend Less On
● Proprietary hardware
● Network Infrastructure
● Support Contracts
● Software Licenses
● Headcount

Monthly Server Count vs Job Search

Inexpensive
● Cost as a percentage of revenue
● Cost of delivery per job search

Revenue vs Infrastructure Cost

Revenue/Search vs. Cost/Search

Fast
● 100 ms average client time
Reliable
● > 99.999% availability in 2012
Cost Effective
● Cost of delivery < 0.5% of revenue
RAIDing FTW

[@IndeedEng] Redundant Array of Inexpensive Datacenters

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to [@IndeedEng] Redundant Array of Inexpensive Datacenters

Similar to [@IndeedEng] Redundant Array of Inexpensive Datacenters (20)

More from indeedeng

More from indeedeng (10)

Recently uploaded

Recently uploaded (20)

[@IndeedEng] Redundant Array of Inexpensive Datacenters