Video available: http://youtu.be/hOsA5UpPUSU
Learn how Indeed built one of the fastest and most reliable websites in the world. Indeed Operations ensures indeed.com is always available and always fast for the jobseeker. Operations leaders Charles Valentine and Chris Graf will share how we configure and provision multiple datacenters around the world to provide a massively scalable platform for connecting job seekers with jobs. Charles and Chris will detail a simple and inexpensive method to build a platform that provides DNS-based global load balancing and failover, provider portability, and disposable datacenters.
Speakers:
Charles Valentine (VP of Technology Services at Indeed) leads the Operations, IT, and Security teams. Prior to joining Indeed in 2011, Charles served as VP Technology Services at The Knot.
Chris Graf has managed operations at Indeed since 2011. In that time, Indeed's traffic has grown by more than 300%. Prior to Indeed, Chris managed Web operations in the online gaming industry.
4. Indeed
● 100 million unique visitors per month
● Over 50 countries and 26 languages
● 3 Billion job searches per month
5.
6.
7. Indeed Ops
● Assist development in designing new products
● Engineer scalable systems to support applications
● Monitor applications
● Fix systems when they break
9. Each Presence is Full Stack
● Applications
● Services
● Read/Write Data systems
● Communications
● Monitoring
We need serious processing power in each
datacenter!
35. Traditional Colocation
● You buy the servers, network gear, cables...
● You send people to set it up
● You send people to fix stuff when it breaks
● You manage your own pipes (maybe)
36. Traditional Colocation Expansion
1. Acquire rack space
2. Buy the hardware
3. Wait for manufacturing
4. Wait for delivery
5. Send people to the datacenter to set it all up
Expansion can take weeks
37. Traditional Colocation
Good if you have
● Fairly static environment
● Really beefy hardware
● Some centralized functionality
● Time to wait
● Lots of cap-ex budget
● Like signing long-term deals
● People to do stuff
38. ● You rent access to computing power
● You pay to reserve it if you aren't using it
● Usually abstracted from hardware layer
The Cloud
47. SLA: brief outages aren't outages
Less than 30 minutes downtime not counted
against "100% SLA"
One 5-minute outage per month < 99.99%
Two 25-minute outages per month < 99.9%
The provider can call that 100% available
48. SLA: maintenance is not downtime
Scheduled maintenance not counted against
SLA
1 hour maintenance each month < 99.9%
The provider can call that 100% available
49. SLA credits don't cover your
business
You get a refund for the services, not for lost
business and lost customer confidence
Providers lose your hosting fees
You lose your revenue
50. 100% is not really 100%
Hosting is complicated
A single datacenter is rarely 100% available
51. Bug in provider hardware caused total loss of
Internet access under certain load
Core network problem
52. Power outage
1. Utility power was disrupted
2. Backup generator and UPS couldn't handle load
3. Core network went offline
4. Servers lost power
5. Upon power restoration, router did not recover
71. Simpler datacenters with RAID
Only need one of everything inside each
datacenter:
● Firewalls
● Load balancers
● Servers provisioned primarily for capacity not
redundancy
74. Datacenter level redundancy
Protects against a single datacenter failure
...
But there are problems that can affect more than
one datacenter on the same provider
75. Denial of service attacks
Distributed denial of service attack against
another customer who had servers in the same
facilities took multiple facilities offline
82. Offline
● One active datacenter handles all traffic
● Backup systems are offline and incomplete
● Restore backups to new systems
● Downtime during switchover is ~days
83. Active / Passive (Dark)
● One active datacenter handles all traffic
● A second datacenter has provisioned
systems and all data
● Switch from primary to secondary
● Downtime during switchover is minutes to
hours
84. Active / Active
● Every datacenter handles traffic
● Data and systems are replicated
● Failover activated automatically
● Downtime during switchover measured in
seconds
● Scales beyond two facilities
85. Jobseeker Impact
Offline: extended downtime for all jobseekers
Active/Passive: some downtime for all
jobseekers
Active/Active: brief downtime for some
jobseekers
86. Which jobseekers go to which
datacenter?
Offline: go to single datacenter
Active/Passive: go to single datacenter
Active/Active: go to many datacenters?
87. Send jobseekers to the best
datacenter
Use dynamic DNS service to send job seekers
to the best, healthy data center
88. Anycast DNS
Resolving same hostname to different IP
addresses
● Client A: nslookup www.indeed.com
Server: dns.client-a.com
Address: 1.1.1.1
● Client B: nslookup www.indeed.com
Server: dns.client-b.com
Address: 2.2.2.2
90. Vary response from primary DNS
Indeed DNS
Service
www.indeed.com
1.1.1.1
www.indeed.com
1.1.1.1
Indeed DNS
Service
www.indeed.com
2.2.2.2
www.indeed.com
2.2.2.2
Jobseeker
DNS
Server
5.5.5.5
Jobseeker
DNS
Server
8.8.8.8
Jobseeker
A
Jobseeker
B
91. Similar jobseekers get similar
responses
Indeed DNS
Service
www.indeed.com
1.1.1.1
www.indeed.com
1.1.1.1
Indeed DNS
Service
www.indeed.com
2.2.2.2
www.indeed.com
2.2.2.2
Indeed DNS
Service
www.indeed.com
2.2.2.2
www.indeed.com
2.2.2.2
Jobseeker
DNS
Server
5.5.5.5
Jobseeker
DNS
Server
8.8.8.8
Jobseeker
DNS
Server
8.8.8.8
Jobseeker
A
Jobseeker
B
Jobseeker
C
92. Remap jobseekers via DNS changes
Indeed DNS
Service
www.indeed.com
1.1.1.1
www.indeed.com
1.1.1.1
R
e
c
o
n
f
i
g
Indeed DNS
Service
www.indeed.com www.indeed.com
2.2.2.22.2.2.2
Jobseeker
DNS
Server
5.5.5.5
Jobseeker
DNS
Server
5.5.5.5
Jobseeker
A
Jobseeker
A
97. DNS propagation delays
1. Healthcheck cycle - up to 30 seconds
2. Healthcheck server to nearest PoP
3. Jobseeker's DNS server cache refresh
4. Jobseeker's local DNS cache refresh
98. DNS Time-to-live (TTL)
TTL tells local name servers and clients how
long to wait before looking up a domain name
again
TTL limits load, but also slows change
propagation
99. Some clients and servers ignore TTL
We specify a 30 second TTL, but local DNS
servers and clients can ignore it
103. Accepting DNS limitations
Complete datacenter failure is extremely rare
Predictable limitation
Massive costs to reduce propagation delay
104. Remapping Manually
The same system allows us to reroute traffic
whenever we want
● Datacenter maintenance
● Non-critical performance problems
● Non-critical feature loss
● Other degradation of jobseeker experience
127. ● Higher-capacity network equipment
● Redundant firewalls
● Redundant load balancers
● Bigger Internet connections
● Redundant Internet connections
This is "vertical scaling."
Traditional Scaling Model
128. Horizontal Scaling with RAID
Add capacity by adding datacenters
Add redundancy by adding datacenters
Rent "good" datacenters, not "best"
136. Rent instead of buying
● Obsolete hardware is not your problem
● No depreciation
● No hardware maintenance
● No need to hire people to maintain the hardware
140. Moore's Law
Hardware is always getting better
● Faster processors
● More memory per chassis
● Larger, faster disks
141. Higher capacity, lower cost
● Number of machines drives cost
● Power of machines drives cost
● More machines => more problems
● Compute power grows faster than compute
cost
142. Replace hardware every 18 months
Managed hosting
+
Moore's Law
+
RAID
=
new and powerful hardware
every 18 months
143. Amazon EC2?
● Amazon is a single provider
● Costs more to run 24x7
○ 2x without bandwidth cost
● Can't be as close to the jobseeker
144. What RAID gets you
● Servers closer to your customers
● Disposable datacenters
○ Datacenter-level failover
○ Get modern hardware every 18 months
● Many hosting options
145. Spend Time On...
● Automation
● Managed DNS
● Investigating Providers
● Monitoring
146. Spend Less On
● Proprietary hardware
● Network Infrastructure
● Support Contracts
● Software Licenses
● Headcount