DevOps, pilier de la transformation stratégique de Microsoft.
Sam Guckenheimer, Corporate Product Strategist, Visual Studio and Cloud Services Munil Shah, Partner Director Software Engineering, Visual Studio Team Services
13. One code base with multiple delivery streams
Single master branch, multiple release branches
Shared abstraction layer
Update 1
Update 2
Update N
Visual Studio Team Services
Team Foundation Server
14.
15. Tests should be written at the lowest level possible
Write once, run anywhere - includes production system
Product is designed for testability
Test code is product code, only reliable tests survive
Testing infrastructure is a shared service
L0: Run against raw drop. Only access binaries. Run in the build. Must be fast and reliable.
L1: Run against raw drop, and can hit resources on machine (e.g. SQL test)
L2: Run against “special” deployed product. We analyzed sources of timing issues and eliminated them.
L3: Run against production
18. … and proactively reach out to customers with low availability
Found one of the top customers with low
availability. Proactively reached out and resolved
their issue.
22. Existing experience Baseline:
36% conversion to project
Hypothesis: Can simplify path to “,magic moment”
After 3 experiments, with different treatments:
50% to 100% customers
conversion to project (+18%)
23. All code is deployed, but feature flags control exposure
24. All code is deployed, but feature flags control exposure
38. Redundant alerts for same the issue
Needed to set right thresholds and tune often
Stateless alerts contributed to further noise
Every alert must be actionable and represent a real issue with the system
Alerts should create a sense of urgency – false alerts dilute that
Consolidate alerts so that only actionable alerts are sent to team
Autoroute according a health model based on suspect route cause
39. Found 3 errors for memory
and performance
All 3 errors are related to
the same code defect
Eliminated alert noise:
~928 alerts per week to ~22
Reduced DRI escalations by ~56%
APM component mapped to feature team
Auto-dialer engaged Global DRI
40. 40
Double blind test
Full disclosure at or near end
vs.
Share tactics & lessons learned
Continued evolution
https://www.youtube.com/watch?v=i9qf3VdfcjE
49. Live Site Health
Time to Detect
Time to Communicate
Time To Mitigate
Customer Impact
Incident prevention items
Aging live site problems
Customer support metrics
SLA per customer account
(SLA, MPI, top drivers)
Engineering
Bug cap per engineer
Aging bugs in important
categories
Pass rate & coverage by
test level
Velocity
Time to build
Time to self test
Time to deploy
Time to learn
(Telemetry pipe)
Usage
Acquisition
Engagement
Dedication
Churn
Feature usage
So… we decided that this wasn’t about the process. It was about the mechanics of our code base, and ensuring that our code base supported the way we wanted to work.
Segue: Takes us back to Team Dashboard (example of team autonomy and enterprise alignment)
Kanban Board - Expedite Lanes lets you handle live site issues
One thing live site culture requires is that it requires us to be on the same telemetry pipeline.
Azure and services built-on Azure
Where is the problem?