SlideShare uma empresa Scribd logo
1 de 38
Delivering Performance Under
Resource and Schedule Pressure
Ivan Santa Maria Filho
Google Cloud Performance TL
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/performance-manager-googlemicrosoft

InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Presented at QCon San Francisco
www.qconsf.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Disclaimer
While I use what I know to do my job, the
contents of this presentation reflect only my
opinion, not those of my present or past
employers.
The formal performance cycle
1.
2.
3.
4.
5.

Formulate an hypothesis
Develop a prototype
Validate findings
Integrate improvements
Repeat

This is the cycle you want
The real performance cycle
●
●
●
●
●
●

If the numbers look good: You’re a genius! Done.
If the numbers look bad: The scenarios are wrong
If scenarios are right: The methodology is flawed
If the methodology is proven: Buggy implementation
If the implementation is proven: It is too late to change
If the team leadership decides to hold the release: It’s
been so long the scenarios are now outdated

This is the cycle you start with
The nature of performance work
There is no
standing still.
You are either
moving forward or
falling.
The nature of performance work
Developers never
stop writing code.
Make performance
improvement the
default activity.
The nature of performance work
Don’t dig a
performance hole.
If you’re in one
then start to get
out immediately.
Delivering Under Pressure
Make it a team problem
Make it the team’s second nature
“A team is a group of people with a common
goal, where every single member is necessary
to accomplish that goal, everyone knows their
role, and everyone know each other’s role”
Leading to a positive cycle
Organizational and personal styles
●
●
●
●

Telling: Directly tell people what to do
Selling: Influence the team or key stakeholders
Participating: Share the decision-making
Delegating: Trust other leaders, but monitor progress

Scope of influence
●
●

Senior leadership, managers, and engineers lead differently
The need to explain the why, what, when, where, and how remains

Organization maturity
●
●
●

High: Capable and confident
Moderate: Capable but unwilling; Unable but willing
Low: Unable and insecure
Why
Create product capability and competitive matrices
● Enterprise products succeed because of either money
savings or enabling new things
● Consumer products also succeed because they are “cool”
○ Consumers might, over a long time, change the enterprise

● Enumerate all product capabilities
● Keywords
○ Direct: “throughput”, and “latency”
○ Indirect: “fluid”, “natural”, and “amazing”
What
Create product backlog and metrics
●
●

Enumerate all known product features
Define how to measure success for each one - need to know when to stop

Great performance metrics are
●
●
●
●

Few and memorable
Intentional, purposeful, and consequential at all levels
Measurable and actionable
Target the competition or raise a competitor barrier of entrance
Types of performance metrics
Strategic

Tactical

Foundational

Type

Competitive

Customer Scenarios

Micro-benchmarks

Owner

Business owner

Dev manager, director

Engineers

Example

TPC or YSCB

TestDFSIO

FIO 4KB random reads

Notable performance metrics
●
●
●

Awesomeness: Boeing 737 passenger throughput
Unintended consequences: FAA departure on time
“It’s complicated”: MS-SQL Server Replication time to resolution
Metric collection methodology
Collection methodology matters
1.
2.
3.
4.
5.
6.

Prepare cluster
Deploy product and background data
Warm-up period
Run benchmark
Typically all you see
Cool-down period
Turndown cluster

Understand what you want to know
●
●

Ignoring bootstrap/turndown on cloud deployments
Ignoring perceived versus actual performance on user interfaces
Observable performance vs. SLAs
Service Level Agreements (SLAs)
●
●
●
●
●

Contractual obligations between provider and consumer
Clear unit of measurement and collection methodology
Reporting cards
Validation and Remediation mechanisms
Escalation path on violations

Observable performance
●
●

How the product behaved lately - no promises, no guarantees
What most customers take dependencies on
When
Define meaningful deadlines
●
●

Tied to “why” the product exist and “what” defines success
The team knows what they want to learn each time

Deadlines should improve the team maturity
●
●
●
●

Creates the habit to measure progress against metrics
Checkpoint to refine metrics
Opportunity to refine estimation and learn the team reaction time
Post-mortems should bring data insights, not data dumps

Deadlines should give you something to celebrate
How
Continuously identify cost structure improvements
●

●

Model the cost structure: CapEx and OpEx
○ How does cost grow as the user base grows?
○ How does cost grow as the background data grows?
○ How does cost grow as engagement grows?
Enumerate risk factors and dependencies
○ How does cost grow as your cloud provider price sheet changes?

Having an opinion does not equate knowing
●
●

Create price/performance models using product metrics
Highly recommended blog: http://perspectives.mvdirona.com/
Where
Determine market specific metrics
●
●
●

Supplier latencies: Compare to SLAs and product needs
Local user devices: how fast are the devices?
Supplier monopolies and regulatory environment

Validate new markets against performance metrics
“On the Cloud there are three key elements to
performance: Location, location, and location”
- Anthony F. Voellm (my boss)
Selected network latencies
Average query latencies to Google’s BigQuery (in ms)

M-LAB raw data: https://code.google.com/p/m-lab/wiki/PDEChartsNDT
Recognizing good performance work
Tell signs
●
●
●
●
●

Aligned in spirit with “why” the product exists
Tied to “what” the product delivers through a latency or throughput metric
Scope of work contained by “when” it should be available
Provide structural advantages on “how” the product is built (TCO is a plus)
Has leftover assets in case of failure

It is harder to spot than you might think
●
●
●

Data insight is non-obvious by definition
You might not want to share some decisions
It might be about someone failing and learning
Non-obvious example - Operations
“I’d like to automate our deployment process”
●
●
●

Performance metrics might not include bootstrap times
Potentially disruptive to operations
Might change how your company negotiate with suppliers

But...
●
●
●
●

If it takes 1 DBA for every 65 servers (or 65k apps), how many to run a hosted
Database service?
Creative employees do not like manual work - retention
Not everyone like to carry pagers - hiring
If “it takes a village” to run your business you might have to invade one:
○ Microsoft’s 99k+ employees to Google’s 30k+
○ GroupOn’s 10k+ employees to Netflix’s 2k+
Even less obvious consequences
Google operations as structural advantage
●
●
●
●

A Warehouse Computer is not a building full of computers
End to end design: Cooling, power, layout, network, …
Cost control from start: Capex, repairs, deployments, ...
Start simple and iterate

A different way to do it
●
●
●
●

Tireless study of market and competitors
Concentrate on software only
Leverage the market and partners
Start with a commodity solution and work with OEMs
Non obvious example - Power costs
“I’d like use energy prices to guide
storage locations”
●
●
●

Might rent space from a cloud provider Who cares?
It likely hurts tail latency
Migrating data is a pain

But...
●
●
●

Being able to migrate helps dealing with
supplier pricing
Scaling out, not up, is a more likely
model for Cloud
It might be a learning opportunity
Power costs look promising...
Important planning metric for Cloud providers
●

Determines which servers are available
○ Expensive energy: Newer servers
○ Cheap energy: Older servers if real estate prices and SLAs allow

Google and power management
●
●
●
●
●

Plans using thermal modeling
Raises the thermostat (to 80°F)
Manages airflow on the cheap: Big slow fans, cooling towers, drippers,
isolate components that run hotter, …
Actively manages available power and power footprint
Became an energy trader
… But then the network gets you
The challenge with storage today is access, not volume.
HDD Storage Historical Prices $/GB

Storage pricing (GB)
● Commodity HDD: $0.04
● AWS: $0.037 to $0.095
● Google: $0.042 and $0.085

$437K/GB

$0.04/GB

Access Pricing (egress, GB)
● AWS: $0 to $0.12, to “call us”
● Google: $0 to $0.15, to “call us”
○ Charges per operation
● Ingress almost always free
Notes on large distributed storage
Lessons from Google
●
●
●

Rare performance problems affect a significant fraction of all requests
Eliminating all sources of latency variability is impractical
Tail-tolerant techniques make a predictable whole from less predictable parts

Complicated systems interact in complicated ways
●
●
●
●
●
●

Global resources (switches and shared file systems)
Shared resources (locks, cores, memory and net bandwidth)
Daemons and background tasks
Maintenance (data reconstruction, log compactions, SSD garbage collection)
Power limits and management enforced by CPU and rack
Network latencies
Non obvious - “unexpected” assets
While chasing power costs
●
●
●
●

Learned how to move data around
Understood tail latencies and throughput (observable and SLAs)
Learned how to route requests around data moves
Developed good storage performance

So now you can
●
●
●
●

Improve the product by adding high tail latency tolerance features
Move your data around when necessary
Scale-out better
Manage your network traffic around peak demand
BigTable read latencies

Published at Source article: http://cacm.acm.org/magazines/2013/2/160173-the-tail-at-scale/fulltext
A note on Order of Growth analysis
1

get a positive integer from input

Runs once: T1

2

if n > 10

Runs once: T2

3
4
5
6
7

print "This might take a while..."
for i = 1 to n
for j = 1 to n
print i * j
print "Done!"

Runs maybe once: T3
Runs n+1 times => T4∙(n+1)
Runs n+1 times => T5∙(n)∙(n+1)
Runs (n)(n) times => T6∙(n)∙(n)
Runs once: T7

T1+T2+T3+T4∙(n+1)+T5∙(n)∙(n+1)+T6∙(n)∙(n)+T7 ⇒
T6∙n2+T5∙n2+T5∙n+T4∙n+T1+T2+T3+T4+T7 ⇒

The algorithm has O(n2), but what about T?
Values of T you learn over time
1 GHz CPU clock (1ns)

1MB TCP tx, 10Gbps card nominal speed (1.1
ms)
Photon travels 5,000Km in fiber (5ms)
1MB TCP tx, 1Gbps card nominal speed (5ms)
Human perceived delays (~100ms)
Human eye blink (~350ms)
MySQL 5.5 default timeout (5s)
Alerting troops 200 miles out, Tang dynasty (1h)

Duration
4ns
17ns
82us
2ms
4ms
5ms
5ms
16ms
200ms

What
L2 cache reference
Mutex lock/unlock
RAM copy 1MB (core i7-2600)
SSD 1MB sequential read
Disk seek
HDD 1MB sequential read
Copy 1 MB over network - same DC
Copy 1 MB over network - same region
Download 1 MB (fast US ISP)

2.1s Download 1 MB (slow US ISP)
1h Download 1 MB (slow US ISP and slow device)
Total team effort
To deliver good performance under pressure you need
the whole team involved, and it must be their second
nature to do so
●
●
●

Learn or define what the team values
Reward successful milestones on their currency
Make performance work easier
○
○
○
○

Control architecture complexity
Make data readily available for analysis
Make it easy to add more data
Reward developing and sharing tools
Know or define the team currency
Companies have currencies
●
●

What justifies a promotion?
What makes one shine during performance reviews?

People have currencies
●
●
●
●
●

Cash, equity, and stock
Technical challenges
Reputation
Sense of purpose
Power
Common “rewards”
Currency

Stick

Carrot

Money

Cut pay, fire

Money and equity, spot bonuses
Logarithmic returns after a “sanitizing level”

Technical
challenge

Assign menial tasks

Ask help with structural changes

Reputation

Public shaming

Public acknowledgement, honest flattery
Technical Leader designation
As simple as a certificate or small gift card

Sense of
purpose

Show apathy
towards their work

Show how their work helps the team mission and
product value proposition
Simple improvements
Instrument your product
●
●
●

Logs over profilers
Decide on logging formats and routines early
Implement distributed correlation IDs

Make product milestones work
●
●

●

Instrument test and deployment tools to collect performance data
“It only works if it works in production” mentality
○ “Any sufficiently advanced technology is indistinguishable from magic”
- Arthur C. Clarke
Design A/B testing and canary deployments
Final remarks
● There is no standing still
● Learn to measure what matters and let everyone know
○
○

Avoid the “smart talk trap”, avoid debating abstractions
Data insights are the greatest sanitizer

● Make performance a whole team effort
○

Assign owners to all metrics

● Reward structural, quantifiable improvements
● Use the team currency
● Make the work easier by removing blockers
Q&A
●

Ivan Santa Maria Filho (ivansmf@google.com)
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/performance
-manager-google-microsoft

Mais conteúdo relacionado

Destaque

CV_Ben Taylor Engineering
CV_Ben Taylor EngineeringCV_Ben Taylor Engineering
CV_Ben Taylor EngineeringBenjamin Taylor
 
UBank Customer Connect
UBank Customer ConnectUBank Customer Connect
UBank Customer Connectguest58b602
 
Security management
Security managementSecurity management
Security managementAccord Group
 
Digital Citations
Digital CitationsDigital Citations
Digital CitationsMitul Das
 
Tema 3 1ºESO. El relieve de América.Curso 2015/2016
Tema 3 1ºESO.  El relieve de América.Curso 2015/2016Tema 3 1ºESO.  El relieve de América.Curso 2015/2016
Tema 3 1ºESO. El relieve de América.Curso 2015/2016Chema R.
 
Proyecto para patrullas escolares
Proyecto para patrullas escolaresProyecto para patrullas escolares
Proyecto para patrullas escolaresJuan Carlos Ticona
 
ID IGF 2016 - Ekonomi 2 - Tantangan membangun regulasi teknologi dan bisnis OTT
ID IGF 2016 - Ekonomi 2 - Tantangan membangun regulasi teknologi dan bisnis OTTID IGF 2016 - Ekonomi 2 - Tantangan membangun regulasi teknologi dan bisnis OTT
ID IGF 2016 - Ekonomi 2 - Tantangan membangun regulasi teknologi dan bisnis OTTIGF Indonesia
 
ID IGF 2016 - Sosial Budaya 1 - Reading Habit In Digital Era
ID IGF 2016 - Sosial Budaya 1 - Reading Habit In Digital EraID IGF 2016 - Sosial Budaya 1 - Reading Habit In Digital Era
ID IGF 2016 - Sosial Budaya 1 - Reading Habit In Digital EraIGF Indonesia
 
Ingeniería económica y gestión financiera I unidad
Ingeniería económica y gestión financiera I unidadIngeniería económica y gestión financiera I unidad
Ingeniería económica y gestión financiera I unidadAlbert Díaz
 

Destaque (15)

CV_Ben Taylor Engineering
CV_Ben Taylor EngineeringCV_Ben Taylor Engineering
CV_Ben Taylor Engineering
 
Manage change
Manage changeManage change
Manage change
 
UBank Customer Connect
UBank Customer ConnectUBank Customer Connect
UBank Customer Connect
 
Pitch-out
Pitch-outPitch-out
Pitch-out
 
Security management
Security managementSecurity management
Security management
 
Basic fire safety
Basic fire safetyBasic fire safety
Basic fire safety
 
Digital Citations
Digital CitationsDigital Citations
Digital Citations
 
Interes
Interes Interes
Interes
 
Tema 3 1ºESO. El relieve de América.Curso 2015/2016
Tema 3 1ºESO.  El relieve de América.Curso 2015/2016Tema 3 1ºESO.  El relieve de América.Curso 2015/2016
Tema 3 1ºESO. El relieve de América.Curso 2015/2016
 
Proyecto para patrullas escolares
Proyecto para patrullas escolaresProyecto para patrullas escolares
Proyecto para patrullas escolares
 
ID IGF 2016 - Ekonomi 2 - Tantangan membangun regulasi teknologi dan bisnis OTT
ID IGF 2016 - Ekonomi 2 - Tantangan membangun regulasi teknologi dan bisnis OTTID IGF 2016 - Ekonomi 2 - Tantangan membangun regulasi teknologi dan bisnis OTT
ID IGF 2016 - Ekonomi 2 - Tantangan membangun regulasi teknologi dan bisnis OTT
 
Digital locker
Digital lockerDigital locker
Digital locker
 
ID IGF 2016 - Sosial Budaya 1 - Reading Habit In Digital Era
ID IGF 2016 - Sosial Budaya 1 - Reading Habit In Digital EraID IGF 2016 - Sosial Budaya 1 - Reading Habit In Digital Era
ID IGF 2016 - Sosial Budaya 1 - Reading Habit In Digital Era
 
Road Safety
Road SafetyRoad Safety
Road Safety
 
Ingeniería económica y gestión financiera I unidad
Ingeniería económica y gestión financiera I unidadIngeniería económica y gestión financiera I unidad
Ingeniería económica y gestión financiera I unidad
 

Mais de C4Media

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoC4Media
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileC4Media
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020C4Media
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsC4Media
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No KeeperC4Media
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like OwnersC4Media
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaC4Media
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideC4Media
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 

Mais de C4Media (20)

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy Mobile
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 

Último

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Último (20)

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft

  • 1. Delivering Performance Under Resource and Schedule Pressure Ivan Santa Maria Filho Google Cloud Performance TL
  • 2. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /performance-manager-googlemicrosoft InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month
  • 3. Presented at QCon San Francisco www.qconsf.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. Disclaimer While I use what I know to do my job, the contents of this presentation reflect only my opinion, not those of my present or past employers.
  • 5. The formal performance cycle 1. 2. 3. 4. 5. Formulate an hypothesis Develop a prototype Validate findings Integrate improvements Repeat This is the cycle you want
  • 6. The real performance cycle ● ● ● ● ● ● If the numbers look good: You’re a genius! Done. If the numbers look bad: The scenarios are wrong If scenarios are right: The methodology is flawed If the methodology is proven: Buggy implementation If the implementation is proven: It is too late to change If the team leadership decides to hold the release: It’s been so long the scenarios are now outdated This is the cycle you start with
  • 7. The nature of performance work There is no standing still. You are either moving forward or falling.
  • 8. The nature of performance work Developers never stop writing code. Make performance improvement the default activity.
  • 9. The nature of performance work Don’t dig a performance hole. If you’re in one then start to get out immediately.
  • 10. Delivering Under Pressure Make it a team problem Make it the team’s second nature “A team is a group of people with a common goal, where every single member is necessary to accomplish that goal, everyone knows their role, and everyone know each other’s role”
  • 11. Leading to a positive cycle Organizational and personal styles ● ● ● ● Telling: Directly tell people what to do Selling: Influence the team or key stakeholders Participating: Share the decision-making Delegating: Trust other leaders, but monitor progress Scope of influence ● ● Senior leadership, managers, and engineers lead differently The need to explain the why, what, when, where, and how remains Organization maturity ● ● ● High: Capable and confident Moderate: Capable but unwilling; Unable but willing Low: Unable and insecure
  • 12. Why Create product capability and competitive matrices ● Enterprise products succeed because of either money savings or enabling new things ● Consumer products also succeed because they are “cool” ○ Consumers might, over a long time, change the enterprise ● Enumerate all product capabilities ● Keywords ○ Direct: “throughput”, and “latency” ○ Indirect: “fluid”, “natural”, and “amazing”
  • 13. What Create product backlog and metrics ● ● Enumerate all known product features Define how to measure success for each one - need to know when to stop Great performance metrics are ● ● ● ● Few and memorable Intentional, purposeful, and consequential at all levels Measurable and actionable Target the competition or raise a competitor barrier of entrance
  • 14. Types of performance metrics Strategic Tactical Foundational Type Competitive Customer Scenarios Micro-benchmarks Owner Business owner Dev manager, director Engineers Example TPC or YSCB TestDFSIO FIO 4KB random reads Notable performance metrics ● ● ● Awesomeness: Boeing 737 passenger throughput Unintended consequences: FAA departure on time “It’s complicated”: MS-SQL Server Replication time to resolution
  • 15. Metric collection methodology Collection methodology matters 1. 2. 3. 4. 5. 6. Prepare cluster Deploy product and background data Warm-up period Run benchmark Typically all you see Cool-down period Turndown cluster Understand what you want to know ● ● Ignoring bootstrap/turndown on cloud deployments Ignoring perceived versus actual performance on user interfaces
  • 16. Observable performance vs. SLAs Service Level Agreements (SLAs) ● ● ● ● ● Contractual obligations between provider and consumer Clear unit of measurement and collection methodology Reporting cards Validation and Remediation mechanisms Escalation path on violations Observable performance ● ● How the product behaved lately - no promises, no guarantees What most customers take dependencies on
  • 17. When Define meaningful deadlines ● ● Tied to “why” the product exist and “what” defines success The team knows what they want to learn each time Deadlines should improve the team maturity ● ● ● ● Creates the habit to measure progress against metrics Checkpoint to refine metrics Opportunity to refine estimation and learn the team reaction time Post-mortems should bring data insights, not data dumps Deadlines should give you something to celebrate
  • 18. How Continuously identify cost structure improvements ● ● Model the cost structure: CapEx and OpEx ○ How does cost grow as the user base grows? ○ How does cost grow as the background data grows? ○ How does cost grow as engagement grows? Enumerate risk factors and dependencies ○ How does cost grow as your cloud provider price sheet changes? Having an opinion does not equate knowing ● ● Create price/performance models using product metrics Highly recommended blog: http://perspectives.mvdirona.com/
  • 19. Where Determine market specific metrics ● ● ● Supplier latencies: Compare to SLAs and product needs Local user devices: how fast are the devices? Supplier monopolies and regulatory environment Validate new markets against performance metrics “On the Cloud there are three key elements to performance: Location, location, and location” - Anthony F. Voellm (my boss)
  • 20. Selected network latencies Average query latencies to Google’s BigQuery (in ms) M-LAB raw data: https://code.google.com/p/m-lab/wiki/PDEChartsNDT
  • 21. Recognizing good performance work Tell signs ● ● ● ● ● Aligned in spirit with “why” the product exists Tied to “what” the product delivers through a latency or throughput metric Scope of work contained by “when” it should be available Provide structural advantages on “how” the product is built (TCO is a plus) Has leftover assets in case of failure It is harder to spot than you might think ● ● ● Data insight is non-obvious by definition You might not want to share some decisions It might be about someone failing and learning
  • 22. Non-obvious example - Operations “I’d like to automate our deployment process” ● ● ● Performance metrics might not include bootstrap times Potentially disruptive to operations Might change how your company negotiate with suppliers But... ● ● ● ● If it takes 1 DBA for every 65 servers (or 65k apps), how many to run a hosted Database service? Creative employees do not like manual work - retention Not everyone like to carry pagers - hiring If “it takes a village” to run your business you might have to invade one: ○ Microsoft’s 99k+ employees to Google’s 30k+ ○ GroupOn’s 10k+ employees to Netflix’s 2k+
  • 23. Even less obvious consequences Google operations as structural advantage ● ● ● ● A Warehouse Computer is not a building full of computers End to end design: Cooling, power, layout, network, … Cost control from start: Capex, repairs, deployments, ... Start simple and iterate A different way to do it ● ● ● ● Tireless study of market and competitors Concentrate on software only Leverage the market and partners Start with a commodity solution and work with OEMs
  • 24. Non obvious example - Power costs “I’d like use energy prices to guide storage locations” ● ● ● Might rent space from a cloud provider Who cares? It likely hurts tail latency Migrating data is a pain But... ● ● ● Being able to migrate helps dealing with supplier pricing Scaling out, not up, is a more likely model for Cloud It might be a learning opportunity
  • 25. Power costs look promising... Important planning metric for Cloud providers ● Determines which servers are available ○ Expensive energy: Newer servers ○ Cheap energy: Older servers if real estate prices and SLAs allow Google and power management ● ● ● ● ● Plans using thermal modeling Raises the thermostat (to 80°F) Manages airflow on the cheap: Big slow fans, cooling towers, drippers, isolate components that run hotter, … Actively manages available power and power footprint Became an energy trader
  • 26. … But then the network gets you The challenge with storage today is access, not volume. HDD Storage Historical Prices $/GB Storage pricing (GB) ● Commodity HDD: $0.04 ● AWS: $0.037 to $0.095 ● Google: $0.042 and $0.085 $437K/GB $0.04/GB Access Pricing (egress, GB) ● AWS: $0 to $0.12, to “call us” ● Google: $0 to $0.15, to “call us” ○ Charges per operation ● Ingress almost always free
  • 27. Notes on large distributed storage Lessons from Google ● ● ● Rare performance problems affect a significant fraction of all requests Eliminating all sources of latency variability is impractical Tail-tolerant techniques make a predictable whole from less predictable parts Complicated systems interact in complicated ways ● ● ● ● ● ● Global resources (switches and shared file systems) Shared resources (locks, cores, memory and net bandwidth) Daemons and background tasks Maintenance (data reconstruction, log compactions, SSD garbage collection) Power limits and management enforced by CPU and rack Network latencies
  • 28. Non obvious - “unexpected” assets While chasing power costs ● ● ● ● Learned how to move data around Understood tail latencies and throughput (observable and SLAs) Learned how to route requests around data moves Developed good storage performance So now you can ● ● ● ● Improve the product by adding high tail latency tolerance features Move your data around when necessary Scale-out better Manage your network traffic around peak demand
  • 29. BigTable read latencies Published at Source article: http://cacm.acm.org/magazines/2013/2/160173-the-tail-at-scale/fulltext
  • 30. A note on Order of Growth analysis 1 get a positive integer from input Runs once: T1 2 if n > 10 Runs once: T2 3 4 5 6 7 print "This might take a while..." for i = 1 to n for j = 1 to n print i * j print "Done!" Runs maybe once: T3 Runs n+1 times => T4∙(n+1) Runs n+1 times => T5∙(n)∙(n+1) Runs (n)(n) times => T6∙(n)∙(n) Runs once: T7 T1+T2+T3+T4∙(n+1)+T5∙(n)∙(n+1)+T6∙(n)∙(n)+T7 ⇒ T6∙n2+T5∙n2+T5∙n+T4∙n+T1+T2+T3+T4+T7 ⇒ The algorithm has O(n2), but what about T?
  • 31. Values of T you learn over time 1 GHz CPU clock (1ns) 1MB TCP tx, 10Gbps card nominal speed (1.1 ms) Photon travels 5,000Km in fiber (5ms) 1MB TCP tx, 1Gbps card nominal speed (5ms) Human perceived delays (~100ms) Human eye blink (~350ms) MySQL 5.5 default timeout (5s) Alerting troops 200 miles out, Tang dynasty (1h) Duration 4ns 17ns 82us 2ms 4ms 5ms 5ms 16ms 200ms What L2 cache reference Mutex lock/unlock RAM copy 1MB (core i7-2600) SSD 1MB sequential read Disk seek HDD 1MB sequential read Copy 1 MB over network - same DC Copy 1 MB over network - same region Download 1 MB (fast US ISP) 2.1s Download 1 MB (slow US ISP) 1h Download 1 MB (slow US ISP and slow device)
  • 32. Total team effort To deliver good performance under pressure you need the whole team involved, and it must be their second nature to do so ● ● ● Learn or define what the team values Reward successful milestones on their currency Make performance work easier ○ ○ ○ ○ Control architecture complexity Make data readily available for analysis Make it easy to add more data Reward developing and sharing tools
  • 33. Know or define the team currency Companies have currencies ● ● What justifies a promotion? What makes one shine during performance reviews? People have currencies ● ● ● ● ● Cash, equity, and stock Technical challenges Reputation Sense of purpose Power
  • 34. Common “rewards” Currency Stick Carrot Money Cut pay, fire Money and equity, spot bonuses Logarithmic returns after a “sanitizing level” Technical challenge Assign menial tasks Ask help with structural changes Reputation Public shaming Public acknowledgement, honest flattery Technical Leader designation As simple as a certificate or small gift card Sense of purpose Show apathy towards their work Show how their work helps the team mission and product value proposition
  • 35. Simple improvements Instrument your product ● ● ● Logs over profilers Decide on logging formats and routines early Implement distributed correlation IDs Make product milestones work ● ● ● Instrument test and deployment tools to collect performance data “It only works if it works in production” mentality ○ “Any sufficiently advanced technology is indistinguishable from magic” - Arthur C. Clarke Design A/B testing and canary deployments
  • 36. Final remarks ● There is no standing still ● Learn to measure what matters and let everyone know ○ ○ Avoid the “smart talk trap”, avoid debating abstractions Data insights are the greatest sanitizer ● Make performance a whole team effort ○ Assign owners to all metrics ● Reward structural, quantifiable improvements ● Use the team currency ● Make the work easier by removing blockers
  • 37. Q&A ● Ivan Santa Maria Filho (ivansmf@google.com)
  • 38. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/performance -manager-google-microsoft