Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft

Delivering Performance Under
Resource and Schedule Pressure
Ivan Santa Maria Filho
Google Cloud Performance TL

Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/performance-manager-googlemicrosoft

InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month

Presented at QCon San Francisco
www.qconsf.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide

Disclaimer
While I use what I know to do my job, the
contents of this presentation reflect only my
opinion, not those of my present or past
employers.

The formal performance cycle
1.
2.
3.
4.
5.

Formulate an hypothesis
Develop a prototype
Validate findings
Integrate improvements
Repeat

This is the cycle you want

The real performance cycle
●
●
●
●
●
●

If the numbers look good: You’re a genius! Done.
If the numbers look bad: The scenarios are wrong
If scenarios are right: The methodology is flawed
If the methodology is proven: Buggy implementation
If the implementation is proven: It is too late to change
If the team leadership decides to hold the release: It’s
been so long the scenarios are now outdated

This is the cycle you start with

The nature of performance work
There is no
standing still.
You are either
moving forward or
falling.

Developers never
stop writing code.
Make performance
improvement the
default activity.

Don’t dig a
performance hole.
If you’re in one
then start to get
out immediately.

Delivering Under Pressure
Make it a team problem
Make it the team’s second nature
“A team is a group of people with a common
goal, where every single member is necessary
to accomplish that goal, everyone knows their
role, and everyone know each other’s role”

Leading to a positive cycle
Organizational and personal styles
●
●
●
●

Telling: Directly tell people what to do
Selling: Influence the team or key stakeholders
Participating: Share the decision-making
Delegating: Trust other leaders, but monitor progress

Scope of influence
●
●

Senior leadership, managers, and engineers lead differently
The need to explain the why, what, when, where, and how remains

Organization maturity
●
●
●

High: Capable and confident
Moderate: Capable but unwilling; Unable but willing
Low: Unable and insecure

Why
Create product capability and competitive matrices
● Enterprise products succeed because of either money
savings or enabling new things
● Consumer products also succeed because they are “cool”
○ Consumers might, over a long time, change the enterprise

● Enumerate all product capabilities
● Keywords
○ Direct: “throughput”, and “latency”
○ Indirect: “fluid”, “natural”, and “amazing”

What
Create product backlog and metrics
●
●

Enumerate all known product features
Define how to measure success for each one - need to know when to stop

Great performance metrics are
●
●
●
●

Few and memorable
Intentional, purposeful, and consequential at all levels
Measurable and actionable
Target the competition or raise a competitor barrier of entrance

Types of performance metrics
Strategic

Tactical

Foundational

Type

Competitive

Customer Scenarios

Micro-benchmarks

Owner

Business owner

Dev manager, director

Engineers

Example

TPC or YSCB

TestDFSIO

FIO 4KB random reads

Notable performance metrics
●
●
●

Awesomeness: Boeing 737 passenger throughput
Unintended consequences: FAA departure on time
“It’s complicated”: MS-SQL Server Replication time to resolution

Metric collection methodology
Collection methodology matters
1.
2.
3.
4.
5.
6.

Prepare cluster
Deploy product and background data
Warm-up period
Run benchmark
Typically all you see
Cool-down period
Turndown cluster

Understand what you want to know
●
●

Ignoring bootstrap/turndown on cloud deployments
Ignoring perceived versus actual performance on user interfaces

Observable performance vs. SLAs
Service Level Agreements (SLAs)
●
●
●
●
●

Contractual obligations between provider and consumer
Clear unit of measurement and collection methodology
Reporting cards
Validation and Remediation mechanisms
Escalation path on violations

Observable performance
●
●

How the product behaved lately - no promises, no guarantees
What most customers take dependencies on

When
Define meaningful deadlines
●
●

Tied to “why” the product exist and “what” defines success
The team knows what they want to learn each time

Deadlines should improve the team maturity
●
●
●
●

Creates the habit to measure progress against metrics
Checkpoint to refine metrics
Opportunity to refine estimation and learn the team reaction time
Post-mortems should bring data insights, not data dumps

Deadlines should give you something to celebrate

How
Continuously identify cost structure improvements
●

●

Model the cost structure: CapEx and OpEx
○ How does cost grow as the user base grows?
○ How does cost grow as the background data grows?
○ How does cost grow as engagement grows?
Enumerate risk factors and dependencies
○ How does cost grow as your cloud provider price sheet changes?

Having an opinion does not equate knowing
●
●

Create price/performance models using product metrics
Highly recommended blog: http://perspectives.mvdirona.com/

Where
Determine market specific metrics
●
●
●

Supplier latencies: Compare to SLAs and product needs
Local user devices: how fast are the devices?
Supplier monopolies and regulatory environment

Validate new markets against performance metrics
“On the Cloud there are three key elements to
performance: Location, location, and location”
- Anthony F. Voellm (my boss)

Selected network latencies
Average query latencies to Google’s BigQuery (in ms)

M-LAB raw data: https://code.google.com/p/m-lab/wiki/PDEChartsNDT

Recognizing good performance work
Tell signs
●
●
●
●
●

Aligned in spirit with “why” the product exists
Tied to “what” the product delivers through a latency or throughput metric
Scope of work contained by “when” it should be available
Provide structural advantages on “how” the product is built (TCO is a plus)
Has leftover assets in case of failure

It is harder to spot than you might think
●
●
●

Data insight is non-obvious by definition
You might not want to share some decisions
It might be about someone failing and learning

Non-obvious example - Operations
“I’d like to automate our deployment process”
●
●
●

Performance metrics might not include bootstrap times
Potentially disruptive to operations
Might change how your company negotiate with suppliers

But...
●
●
●
●

If it takes 1 DBA for every 65 servers (or 65k apps), how many to run a hosted
Database service?
Creative employees do not like manual work - retention
Not everyone like to carry pagers - hiring
If “it takes a village” to run your business you might have to invade one:
○ Microsoft’s 99k+ employees to Google’s 30k+
○ GroupOn’s 10k+ employees to Netflix’s 2k+

Even less obvious consequences
Google operations as structural advantage
●
●
●
●

A Warehouse Computer is not a building full of computers
End to end design: Cooling, power, layout, network, …
Cost control from start: Capex, repairs, deployments, ...
Start simple and iterate

A different way to do it
●
●
●
●

Tireless study of market and competitors
Concentrate on software only
Leverage the market and partners
Start with a commodity solution and work with OEMs

Non obvious example - Power costs
“I’d like use energy prices to guide
storage locations”
●
●
●

Might rent space from a cloud provider Who cares?
It likely hurts tail latency
Migrating data is a pain

But...
●
●
●

Being able to migrate helps dealing with
supplier pricing
Scaling out, not up, is a more likely
model for Cloud
It might be a learning opportunity

Power costs look promising...
Important planning metric for Cloud providers
●

Determines which servers are available
○ Expensive energy: Newer servers
○ Cheap energy: Older servers if real estate prices and SLAs allow

Google and power management
●
●
●
●
●

Plans using thermal modeling
Raises the thermostat (to 80°F)
Manages airflow on the cheap: Big slow fans, cooling towers, drippers,
isolate components that run hotter, …
Actively manages available power and power footprint
Became an energy trader

… But then the network gets you
The challenge with storage today is access, not volume.
HDD Storage Historical Prices $/GB

Storage pricing (GB)
● Commodity HDD: $0.04
● AWS: $0.037 to $0.095
● Google: $0.042 and $0.085

$437K/GB

$0.04/GB

Access Pricing (egress, GB)
● AWS: $0 to $0.12, to “call us”
● Google: $0 to $0.15, to “call us”
○ Charges per operation
● Ingress almost always free

Notes on large distributed storage
Lessons from Google
●
●
●

Rare performance problems affect a significant fraction of all requests
Eliminating all sources of latency variability is impractical
Tail-tolerant techniques make a predictable whole from less predictable parts

Complicated systems interact in complicated ways
●
●
●
●
●
●

Global resources (switches and shared file systems)
Shared resources (locks, cores, memory and net bandwidth)
Daemons and background tasks
Maintenance (data reconstruction, log compactions, SSD garbage collection)
Power limits and management enforced by CPU and rack
Network latencies

Non obvious - “unexpected” assets
While chasing power costs
●
●
●
●

Learned how to move data around
Understood tail latencies and throughput (observable and SLAs)
Learned how to route requests around data moves
Developed good storage performance

So now you can
●
●
●
●

Improve the product by adding high tail latency tolerance features
Move your data around when necessary
Scale-out better
Manage your network traffic around peak demand

BigTable read latencies

Published at Source article: http://cacm.acm.org/magazines/2013/2/160173-the-tail-at-scale/fulltext

A note on Order of Growth analysis
1

get a positive integer from input

Runs once: T1

2

if n > 10

Runs once: T2

3
4
5
6
7

print "This might take a while..."
for i = 1 to n
for j = 1 to n
print i * j
print "Done!"

Runs maybe once: T3
Runs n+1 times => T4∙(n+1)
Runs n+1 times => T5∙(n)∙(n+1)
Runs (n)(n) times => T6∙(n)∙(n)
Runs once: T7

T1+T2+T3+T4∙(n+1)+T5∙(n)∙(n+1)+T6∙(n)∙(n)+T7 ⇒
T6∙n2+T5∙n2+T5∙n+T4∙n+T1+T2+T3+T4+T7 ⇒

The algorithm has O(n2), but what about T?

Values of T you learn over time
1 GHz CPU clock (1ns)

1MB TCP tx, 10Gbps card nominal speed (1.1
ms)
Photon travels 5,000Km in fiber (5ms)
1MB TCP tx, 1Gbps card nominal speed (5ms)
Human perceived delays (~100ms)
Human eye blink (~350ms)
MySQL 5.5 default timeout (5s)
Alerting troops 200 miles out, Tang dynasty (1h)

Duration
4ns
17ns
82us
2ms
4ms
5ms
5ms
16ms
200ms

What
L2 cache reference
Mutex lock/unlock
RAM copy 1MB (core i7-2600)
SSD 1MB sequential read
Disk seek
HDD 1MB sequential read
Copy 1 MB over network - same DC
Copy 1 MB over network - same region
Download 1 MB (fast US ISP)

2.1s Download 1 MB (slow US ISP)
1h Download 1 MB (slow US ISP and slow device)

Total team effort
To deliver good performance under pressure you need
the whole team involved, and it must be their second
nature to do so
●
●
●

Learn or define what the team values
Reward successful milestones on their currency
Make performance work easier
○
○
○
○

Control architecture complexity
Make data readily available for analysis
Make it easy to add more data
Reward developing and sharing tools

Know or define the team currency
Companies have currencies
●
●

What justifies a promotion?
What makes one shine during performance reviews?

People have currencies
●
●
●
●
●

Cash, equity, and stock
Technical challenges
Reputation
Sense of purpose
Power

Common “rewards”
Currency

Stick

Carrot

Money

Cut pay, fire

Money and equity, spot bonuses
Logarithmic returns after a “sanitizing level”

Technical
challenge

Assign menial tasks

Ask help with structural changes

Reputation

Public shaming

Public acknowledgement, honest flattery
Technical Leader designation
As simple as a certificate or small gift card

Sense of
purpose

Show apathy
towards their work

Show how their work helps the team mission and
product value proposition

Simple improvements
Instrument your product
●
●
●

Logs over profilers
Decide on logging formats and routines early
Implement distributed correlation IDs

Make product milestones work
●
●

●

Instrument test and deployment tools to collect performance data
“It only works if it works in production” mentality
○ “Any sufficiently advanced technology is indistinguishable from magic”
- Arthur C. Clarke
Design A/B testing and canary deployments

Final remarks
● There is no standing still
● Learn to measure what matters and let everyone know
○
○

Avoid the “smart talk trap”, avoid debating abstractions
Data insights are the greatest sanitizer

● Make performance a whole team effort
○

Assign owners to all metrics

● Reward structural, quantifiable improvements
● Use the team currency
● Make the work easier by removing blockers

Q&A
●

Ivan Santa Maria Filho (ivansmf@google.com)

Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/performance
-manager-google-microsoft

Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (15)

Mais de C4Media

Mais de C4Media (20)

Último

Último (20)

Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft