Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1oEGtyD.
Ivan Filho shares lessons learned during the development and release of several large scale services at Microsoft and Google from the perspective of a performance manager. Filmed at qconsf.com.
Ivan Santa Maria Filho is currently the performance technical lead for Google Cloud, and his prior experience includes several large releases, including Bing.com and SQL Azure.
2. Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/performance-manager-googlemicrosoft
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
3. Presented at QCon San Francisco
www.qconsf.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
4. Disclaimer
While I use what I know to do my job, the
contents of this presentation reflect only my
opinion, not those of my present or past
employers.
5. The formal performance cycle
1.
2.
3.
4.
5.
Formulate an hypothesis
Develop a prototype
Validate findings
Integrate improvements
Repeat
This is the cycle you want
6. The real performance cycle
●
●
●
●
●
●
If the numbers look good: You’re a genius! Done.
If the numbers look bad: The scenarios are wrong
If scenarios are right: The methodology is flawed
If the methodology is proven: Buggy implementation
If the implementation is proven: It is too late to change
If the team leadership decides to hold the release: It’s
been so long the scenarios are now outdated
This is the cycle you start with
7. The nature of performance work
There is no
standing still.
You are either
moving forward or
falling.
8. The nature of performance work
Developers never
stop writing code.
Make performance
improvement the
default activity.
9. The nature of performance work
Don’t dig a
performance hole.
If you’re in one
then start to get
out immediately.
10. Delivering Under Pressure
Make it a team problem
Make it the team’s second nature
“A team is a group of people with a common
goal, where every single member is necessary
to accomplish that goal, everyone knows their
role, and everyone know each other’s role”
11. Leading to a positive cycle
Organizational and personal styles
●
●
●
●
Telling: Directly tell people what to do
Selling: Influence the team or key stakeholders
Participating: Share the decision-making
Delegating: Trust other leaders, but monitor progress
Scope of influence
●
●
Senior leadership, managers, and engineers lead differently
The need to explain the why, what, when, where, and how remains
Organization maturity
●
●
●
High: Capable and confident
Moderate: Capable but unwilling; Unable but willing
Low: Unable and insecure
12. Why
Create product capability and competitive matrices
● Enterprise products succeed because of either money
savings or enabling new things
● Consumer products also succeed because they are “cool”
○ Consumers might, over a long time, change the enterprise
● Enumerate all product capabilities
● Keywords
○ Direct: “throughput”, and “latency”
○ Indirect: “fluid”, “natural”, and “amazing”
13. What
Create product backlog and metrics
●
●
Enumerate all known product features
Define how to measure success for each one - need to know when to stop
Great performance metrics are
●
●
●
●
Few and memorable
Intentional, purposeful, and consequential at all levels
Measurable and actionable
Target the competition or raise a competitor barrier of entrance
14. Types of performance metrics
Strategic
Tactical
Foundational
Type
Competitive
Customer Scenarios
Micro-benchmarks
Owner
Business owner
Dev manager, director
Engineers
Example
TPC or YSCB
TestDFSIO
FIO 4KB random reads
Notable performance metrics
●
●
●
Awesomeness: Boeing 737 passenger throughput
Unintended consequences: FAA departure on time
“It’s complicated”: MS-SQL Server Replication time to resolution
15. Metric collection methodology
Collection methodology matters
1.
2.
3.
4.
5.
6.
Prepare cluster
Deploy product and background data
Warm-up period
Run benchmark
Typically all you see
Cool-down period
Turndown cluster
Understand what you want to know
●
●
Ignoring bootstrap/turndown on cloud deployments
Ignoring perceived versus actual performance on user interfaces
16. Observable performance vs. SLAs
Service Level Agreements (SLAs)
●
●
●
●
●
Contractual obligations between provider and consumer
Clear unit of measurement and collection methodology
Reporting cards
Validation and Remediation mechanisms
Escalation path on violations
Observable performance
●
●
How the product behaved lately - no promises, no guarantees
What most customers take dependencies on
17. When
Define meaningful deadlines
●
●
Tied to “why” the product exist and “what” defines success
The team knows what they want to learn each time
Deadlines should improve the team maturity
●
●
●
●
Creates the habit to measure progress against metrics
Checkpoint to refine metrics
Opportunity to refine estimation and learn the team reaction time
Post-mortems should bring data insights, not data dumps
Deadlines should give you something to celebrate
18. How
Continuously identify cost structure improvements
●
●
Model the cost structure: CapEx and OpEx
○ How does cost grow as the user base grows?
○ How does cost grow as the background data grows?
○ How does cost grow as engagement grows?
Enumerate risk factors and dependencies
○ How does cost grow as your cloud provider price sheet changes?
Having an opinion does not equate knowing
●
●
Create price/performance models using product metrics
Highly recommended blog: http://perspectives.mvdirona.com/
19. Where
Determine market specific metrics
●
●
●
Supplier latencies: Compare to SLAs and product needs
Local user devices: how fast are the devices?
Supplier monopolies and regulatory environment
Validate new markets against performance metrics
“On the Cloud there are three key elements to
performance: Location, location, and location”
- Anthony F. Voellm (my boss)
20. Selected network latencies
Average query latencies to Google’s BigQuery (in ms)
M-LAB raw data: https://code.google.com/p/m-lab/wiki/PDEChartsNDT
21. Recognizing good performance work
Tell signs
●
●
●
●
●
Aligned in spirit with “why” the product exists
Tied to “what” the product delivers through a latency or throughput metric
Scope of work contained by “when” it should be available
Provide structural advantages on “how” the product is built (TCO is a plus)
Has leftover assets in case of failure
It is harder to spot than you might think
●
●
●
Data insight is non-obvious by definition
You might not want to share some decisions
It might be about someone failing and learning
22. Non-obvious example - Operations
“I’d like to automate our deployment process”
●
●
●
Performance metrics might not include bootstrap times
Potentially disruptive to operations
Might change how your company negotiate with suppliers
But...
●
●
●
●
If it takes 1 DBA for every 65 servers (or 65k apps), how many to run a hosted
Database service?
Creative employees do not like manual work - retention
Not everyone like to carry pagers - hiring
If “it takes a village” to run your business you might have to invade one:
○ Microsoft’s 99k+ employees to Google’s 30k+
○ GroupOn’s 10k+ employees to Netflix’s 2k+
23. Even less obvious consequences
Google operations as structural advantage
●
●
●
●
A Warehouse Computer is not a building full of computers
End to end design: Cooling, power, layout, network, …
Cost control from start: Capex, repairs, deployments, ...
Start simple and iterate
A different way to do it
●
●
●
●
Tireless study of market and competitors
Concentrate on software only
Leverage the market and partners
Start with a commodity solution and work with OEMs
24. Non obvious example - Power costs
“I’d like use energy prices to guide
storage locations”
●
●
●
Might rent space from a cloud provider Who cares?
It likely hurts tail latency
Migrating data is a pain
But...
●
●
●
Being able to migrate helps dealing with
supplier pricing
Scaling out, not up, is a more likely
model for Cloud
It might be a learning opportunity
25. Power costs look promising...
Important planning metric for Cloud providers
●
Determines which servers are available
○ Expensive energy: Newer servers
○ Cheap energy: Older servers if real estate prices and SLAs allow
Google and power management
●
●
●
●
●
Plans using thermal modeling
Raises the thermostat (to 80°F)
Manages airflow on the cheap: Big slow fans, cooling towers, drippers,
isolate components that run hotter, …
Actively manages available power and power footprint
Became an energy trader
26. … But then the network gets you
The challenge with storage today is access, not volume.
HDD Storage Historical Prices $/GB
Storage pricing (GB)
● Commodity HDD: $0.04
● AWS: $0.037 to $0.095
● Google: $0.042 and $0.085
$437K/GB
$0.04/GB
Access Pricing (egress, GB)
● AWS: $0 to $0.12, to “call us”
● Google: $0 to $0.15, to “call us”
○ Charges per operation
● Ingress almost always free
27. Notes on large distributed storage
Lessons from Google
●
●
●
Rare performance problems affect a significant fraction of all requests
Eliminating all sources of latency variability is impractical
Tail-tolerant techniques make a predictable whole from less predictable parts
Complicated systems interact in complicated ways
●
●
●
●
●
●
Global resources (switches and shared file systems)
Shared resources (locks, cores, memory and net bandwidth)
Daemons and background tasks
Maintenance (data reconstruction, log compactions, SSD garbage collection)
Power limits and management enforced by CPU and rack
Network latencies
28. Non obvious - “unexpected” assets
While chasing power costs
●
●
●
●
Learned how to move data around
Understood tail latencies and throughput (observable and SLAs)
Learned how to route requests around data moves
Developed good storage performance
So now you can
●
●
●
●
Improve the product by adding high tail latency tolerance features
Move your data around when necessary
Scale-out better
Manage your network traffic around peak demand
30. A note on Order of Growth analysis
1
get a positive integer from input
Runs once: T1
2
if n > 10
Runs once: T2
3
4
5
6
7
print "This might take a while..."
for i = 1 to n
for j = 1 to n
print i * j
print "Done!"
Runs maybe once: T3
Runs n+1 times => T4∙(n+1)
Runs n+1 times => T5∙(n)∙(n+1)
Runs (n)(n) times => T6∙(n)∙(n)
Runs once: T7
T1+T2+T3+T4∙(n+1)+T5∙(n)∙(n+1)+T6∙(n)∙(n)+T7 ⇒
T6∙n2+T5∙n2+T5∙n+T4∙n+T1+T2+T3+T4+T7 ⇒
The algorithm has O(n2), but what about T?
31. Values of T you learn over time
1 GHz CPU clock (1ns)
1MB TCP tx, 10Gbps card nominal speed (1.1
ms)
Photon travels 5,000Km in fiber (5ms)
1MB TCP tx, 1Gbps card nominal speed (5ms)
Human perceived delays (~100ms)
Human eye blink (~350ms)
MySQL 5.5 default timeout (5s)
Alerting troops 200 miles out, Tang dynasty (1h)
Duration
4ns
17ns
82us
2ms
4ms
5ms
5ms
16ms
200ms
What
L2 cache reference
Mutex lock/unlock
RAM copy 1MB (core i7-2600)
SSD 1MB sequential read
Disk seek
HDD 1MB sequential read
Copy 1 MB over network - same DC
Copy 1 MB over network - same region
Download 1 MB (fast US ISP)
2.1s Download 1 MB (slow US ISP)
1h Download 1 MB (slow US ISP and slow device)
32. Total team effort
To deliver good performance under pressure you need
the whole team involved, and it must be their second
nature to do so
●
●
●
Learn or define what the team values
Reward successful milestones on their currency
Make performance work easier
○
○
○
○
Control architecture complexity
Make data readily available for analysis
Make it easy to add more data
Reward developing and sharing tools
33. Know or define the team currency
Companies have currencies
●
●
What justifies a promotion?
What makes one shine during performance reviews?
People have currencies
●
●
●
●
●
Cash, equity, and stock
Technical challenges
Reputation
Sense of purpose
Power
34. Common “rewards”
Currency
Stick
Carrot
Money
Cut pay, fire
Money and equity, spot bonuses
Logarithmic returns after a “sanitizing level”
Technical
challenge
Assign menial tasks
Ask help with structural changes
Reputation
Public shaming
Public acknowledgement, honest flattery
Technical Leader designation
As simple as a certificate or small gift card
Sense of
purpose
Show apathy
towards their work
Show how their work helps the team mission and
product value proposition
35. Simple improvements
Instrument your product
●
●
●
Logs over profilers
Decide on logging formats and routines early
Implement distributed correlation IDs
Make product milestones work
●
●
●
Instrument test and deployment tools to collect performance data
“It only works if it works in production” mentality
○ “Any sufficiently advanced technology is indistinguishable from magic”
- Arthur C. Clarke
Design A/B testing and canary deployments
36. Final remarks
● There is no standing still
● Learn to measure what matters and let everyone know
○
○
Avoid the “smart talk trap”, avoid debating abstractions
Data insights are the greatest sanitizer
● Make performance a whole team effort
○
Assign owners to all metrics
● Reward structural, quantifiable improvements
● Use the team currency
● Make the work easier by removing blockers