10. Performance
• Fast network
EC2 10 Gigabit Ethernet
- Cluster Compute
- High Memory Cluster
- Cluster GPU
- High I/O
- High Storage
- Network cards
- VLAN separation
11. Performance
• Fast network
Workload: Read/Write?
What is being stored?
Result set size
- Read / write: adds to replication oplog
- Images? Web pages? Tiny documents?
- What is being returned? Optimised to return certain fields?
12. Performance
• Fast network
Use
Network Throughput
Normal
0-100Mb/s
Replication (Initial Sync)
Burst +100Mb/s
Replication (Oplog)
0-100Mb/s
Backup
Initial Sync + Oplog
15. Performance
• Fast network
Location
Ping RTT Latency
Within USA
40-80ms
Trans-Atlantic
100ms
Trans-Pacific
150ms
Europe - Japan
300ms
Ping - low overhead
Important for replication
26. Eventual Consistency
Use Case
Needs consistency?
Graphs
No
User profile
Yes
Statistics
Depends
Alert config
Yes
Statistics - depends on when they’re updated
40. Performance
SSD vs Spinning
However, CPU usage for SSDs is higher. This may be a driver issue so worth testing your own
hardware. Tests done using Bonnie.
66. Tips: rand()
•Field names
•Covered indexes
•Collections / databases
- Dropping collections faster than remove()
- Split use cases across databases to avoid locking
- Put databases onto different disks / types e.g. SSDs
72. david@asriel ~: scp david@stelmaria:~/local/local.11 .
local.11
100% 2047MB
6.8MB/s
05:01
Restore time
- Needed to resync a database server across the US
- Take too long; oplog not large enough
- Fast internal network but slow internet
79. Monitoring tools
Run yourself
Ganglia
So Server Density is the tool my company produces but if you don’t like it, want to run your
own tools locally or just want to try some others, then that’s fine.
81. Dealing with humans
On-call
-
Sharing out the responsibility
Determining level of response: 24/7 real monitoring or first responder
24/7 real monitoring for HA environments, real people at a screen at all times
First responder: people at the end of a phone
82. Dealing with humans
On-call
1) Ops engineer
- During working hours our dedicated ops engineers take the first level
- Avoids interrupting product engineers for initial fire fighting
83. Dealing with humans
On-call
1) Ops engineer
2) All engineers
- Out of hours we rotate every engineer, product and ops
- Rotation every 7 days on a Tuesday
84. Dealing with humans
On-call
1) Ops engineer
2) All engineers
3) Ops engineer
- Always have a secondary
- This is always an ops engineer
- Thinking is if the issue needs to be escalated then it’s likely a bigger problem that needs
additional systems expertise
85. Dealing with humans
On-call
1) Ops engineer
2) All engineers
3) Ops engineer
4) Others
- Support from design / frontend engineering
- Have to press a button to get them involved
88. Dealing with humans
Uptime reporting
- Weekly internal report on G+
- Gives visibility to entire company about any incidents
- Allows us to discuss incidents to get to that 100% uptime
89. Dealing with humans
Social issues
-
How quickly can you get to a computer?
Are they out drinking on a Friday?
What happens if someone is ill?
What if there’s a sudden emergency: accident? family emergency?
Do they have enough phone battery?
Can you hear the ringtone?
90. Dealing with humans
Backup responder
-
Backup responder
Time out the initial responder
Escalate difficult problems
Essentially human redundancy: phone provider, geographic area, internet connectivity
91. Dealing with outages
Expected
- Outages are going to happen, especially at the beginning
- Costs money for redundancy
- How you deal with them
92. Dealing with outages
Communication
Externally
- Telling people what is happening
- Frequently
- Dependent on audience - we can go into more detail because our customers are techies
- Github do a good job of providing incident writeups but don’t provide a good idea of what
is happening right now
- Generally Amazon and Heroku are good and go into more detail
93. Dealing with outages
Communication
Internally
- Open Skype conferences between the responders
- Usually mostly silence or the sound of the keyboard, but simulates being in the situation
room
- Faster than typing
94. Dealing with outages
Really test your vendors
-
Shows up flaws in vendor support processes
Frustrating when waiting on someone else
You want as much information as possible
Major outage? Everyone will be calling them
95. Dealing with outages
Simulations
- Try and avoid unncessary problems
- Do servers come back up from boot?
- Can hot spares handle the load?
- Test failover: databases, HA firewalls
- Regularly reboot servers
- Wargames can happen at another stage: startups are usually too focused on building things
first