The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
Baki Onur Okutucu - Office365 - Service Management
1. Office 365
Hizmet Yönetimi
Baki Onur OKUTUCU
Microsoft MVP – Windows IT Pro
Birim Müdürü | Telco | BilgeAdam
bakionur.com
bakionur@bakionur.com
BakiOnurOkutucu
BakiOnur
@BakiOnur
2. Helping you run Office
365 successfully
Service Management is core to
your success running Office 365
Aware and empowered about
change
Understand the incident lifecycle
and your role
Know the resources and tools
for staying informed
We want to hear from you.
Feedback please!
3. 1. Loss of control
2. Change. Change. Change.
3. How does support work?
4. What’s my role during incidents and outages?
5. Pace of change impact on user readiness
6. Security of my data
7. Stuff breaking when change happens
8. Can I still customize stuff?
9. Keeping all of my infra current with service requirements
10. What’s my role/job/department look like once we move?
9. Respond to customer feedback
through agile development
Deliver new features and value Build trust and compliance
Continuous innovation with confidence and control
Continuous release cadence
Minor & major updates
Up-to-date, no patching
Insights to help manage change
Direct to customer communications | Organizational readiness content
Security comes first
http://www.trust.office365.com/
Evolving standards
Direct feedback
Real-time information
Common support issues
12. Office Mix
Simplified Admin
Center experience
The New
Office
New Partner Admin
Center
Office 365
Adapter
Embedded
Images OWA Policy Tips
Updated Lync mobile
clients
Office 365 SSO with
SAML 2.0 Identity
Providers
Multi-factor
authentication
Service Pack 1 for
Office 365 ProPlus
SAP and Power BI and
Power Query support
Windows Azure Active
Authentication
DirSync Scoping
and Filtering
Exchange Online
Inactive Mailboxes
PDF support for
SharePoint Online
Lync Online Integrated
Reporting
Office Online
real-time co-
authoring OneNote for Mac,
Android, iPhone, and iPad
updates
Office 365
operated by
21Vianet
Admin App for iOS,
Android, and WP
OWA Calendar Search
OneDrive for Business Storage
increase
Power Map for Excel
SharePoint
Newsfeed App
for Windows 8
Lync meeting
scheduling from OWA
Office Mobile
for iPhone &
Android phones
Rights Management
Services
OneNote
for iPad
Exchange Online
Address Book
Policies
Message Center
EXO: 50 GB Mailboxes
Exchange group
naming policy
OWA for iPhone &
OWA for iPad
New SharePoint
Workflows
Simplified Yammer
login
Office Lens
Power Map GA for all
Excel 2013 users
OneDrive for Business
Improvements
90 Day message
trace
OneDrive for Business
Sync for Windows
Lync Online Remote
PowerShell
Lync mobile
client updates
Office 365 Switch Plans
OneNote for
iPhone and
Android
phones
Azure AD
Password Sync
Lync and SharePoint
Service Reporting
Connecting
Skype & Lync
OneDrive for Business apps for
Windows 8 & iOS
People View in
OWA
1 TB for
OneDrive for
Business
Office 365
Developer APIs
S/MIME
Encryption
Office for
iPad + 1.1
update
Project Lite released
July 2013 – June 2014 highlights
13.
14. (planned maintenance outage)
Public Roadmap 1-3 months
Message Center At availability
Public Roadmap Up to 12 months
System Requirements 12+ months
Message Center At 12 months (ongoing)
Message Center 1-12 months
Service Health Dashboard 5 day minimum
Service Update CommunicationsSource Timeframe
FunctionalityupdatesPlatformupdates
16. Running a service brings Microsoft closer to the customer than ever before
Customer engagement
Send-a-smile in-product feedback
Support and community aggregate
customers’ issues
Old
New
17.
18. Message
Center
In-product notification of critical
changes and new features coming to
Office 365
Provides details on changes and
highlights required admin actions
Notification bell drives admin
awareness and action
Targeted communications to specific
tenant as needed
19.
20.
21. • Embrace what change means from waves to ripples
• Visit and bookmark communications channels
• Stay current on functionality and platform changes
• Download the Office 365 Admin app
• Visit the RoadMap, Office 365 Blog and Service Descriptions
• Provide us your feedback on how we can improve
23. Service Health &
Incidents
Are approach to Service Health &
Continuity
Understanding the incident lifecycle
your role
Tools and resources to keep you
aware and empowered
24. Redundancy
Physical
Data
Functional
Resiliency
Active load balancing
Recovery across “failure
domains” regularly tested
Human intervention
by exception
Automated recovery alerts
24x7 on-call engineer
On-call engineers are core
product group members
Distributed Workloads
Resilient
Most failures contained
to single service
Service component isolation
Complexity avoidance
and graceful degradation
Standardized hardware
Fully automated
deployment
Built-in workload
management mechanisms
Predictability and
Inspectability
Incident avoidance
Deep internal monitoring and
outside-in monitoring
Diagnostics for SI impacting events
27. Your role
Service
operating
normally
During an
incident
After an
incident
Service
operating
normally
Understand what to expect and where to receive communications from
Microsoft in the case of Service Incident
Download Office 365 Admin app for the tenant admin’s favorite mobile
device
28. During an Incident
During an
incident
During an
incident
After an
incident
Service
operating
normally
Check service status on the Service Health Dashboard
Connect with your account team or partner
Engage internal stakeholders, customers and/or partners as appropriate by role
33. Service Health
Dashboard
First and Best Content
Updated Hourly
Emergency Broadcast System will
automatically redirect customers
http://status.office365.com
34. ? Investigating
Monitors have indicated a service
anomaly and/or Microsoft has
received reports of a potential service
incident. Microsoft is currently
investigating.
Service
Interruption
Microsoft has confirmed that
normal services are being impacted.
Microsoft is taking immediate action
to understand the cause of the failure
and determine best course of action
to restore service.
Service
Degradation
Services are still active, but service
responsiveness and/or delivery times
may be slower than usual. Microsoft
is working to restore normal service
responsiveness.
Restoring
Service
Microsoft has isolated the likely cause
of the incident and is in the process of
restoring service
Extended
Recover
Services are restored and may be
slower than usual
Service Restored
Normal system services have
been restored
i Additional
Information
There is additional
information provided
Normal Service
The service is healthy
36. • Provides tenant specific Office 365
service health and maintenance
information on the go
• Available for Windows Phone, iOS
and Android devices
• Partner Admin App Available
Office 365 Admin App
37. Ability to query a tenant
and see service health
results and Message
center information.
• http://blogs.office.com/2014/07/29/new-office-365-admin-tools/
38. After an Incident
After an
incident
During an
incident
After an
incident
Service
operating
normally
Review PIR which is published within 5 business days on the Service Health Dashboard
Review and document internal processes for future incident planning procedures
Give feedback on incident communication and suggestions for improvements
39. Focus is on future protection from
similar issues
Next steps determined
Post Incident Review
within 5 days
Monthly Service Review
within 30 days
Improvement Cadance
Solid next steps
Tracked through delivery
Continuous Learning
1 immediate next
step in PIR
10 additional changes
in comprehensive plan
40. • Service Management is Core to Office 365
• Understand how to be aware and empowered regarding change and
communications
• Understand the incident lifecycle and your role
• Join the Yammer Office 365 Technical Network
• Know and leverage your resources for staying in the know (Message
Center, Service Health Dashboard, Roadmap, Office Blog)
• Stay tuned for more Service Management Excellence Readiness
Content, Resources and Tooling!
• Feedback! Feedback! Feedback! Tell us how we can improve.
43. Support
• TechNet
• O365 Admin Center
Maintenance
Performance
& Monitoring
• Management Pack
• API
44. Teşekkürler…
Baki Onur OKUTUCU
Microsoft MVP – Windows IT Pro
Birim Müdürü | Telco | BilgeAdam
bakionur.com
bakionur@bakionur.com
BakiOnurOkutucu
BakiOnur
@BakiOnur
Notas do Editor
Service Management excellence is at the core of the Office 365 experience.
Make sure IT Admins and customers understand what SM is and why it's important
Microsoft has a framework for SM and they need to understand their role in the process
Setting the right expectations for running the cloud service and how it differs to on prem
Change
Communications
Service Health
Incidents
Support
Resources for self-empowerment
This session is designed to give you the Office 365 Service Management knowledge and insight of change management, service incidents and support. As cloud service, Office 365 continues to evolve as we develop new innovations in productivity and enhance existing services. Learn about how we deploy and communicate updates in the Office 365 service. We will discuss the Office 365 roadmap disclosure process, service change communications, review the key channels for you to stay ahead of change and discuss a bit of the behind the scenes on how we develop, track and deploy updates to the service."
Me helping you knowing how to run the service, IT PRO Empowerment.
We want to make you successful, we’ve learned a lot in the last two years, we’ve grown, we’re working with other services in the company to constantly improve.
High level – Office 365 is maintaining a service up to date, many moving parts (patched, performance, more scale). About features and functionality.
This is also about in the past we can listen to your feedback and provide more new features and innovations more quickly. Features and things you want happen rapidly, no longer do you have to wait until we send a new release through a DVD.
That is not done without the cost of security privacy and trust – mention Trust center
We want to ensure you have controls, we don’t look at your data,
And it’s about scale. To be able to provide multi-tenant, resilient, elastic, fault tolerant platform. There are also still plenty of areas you can customize to fit your business need.
So what is different?
Like on prem, there are things you will most certainly get like world class service, security and infrastructure, however now with Office 365 and the app model you can build on top of the service.
We acknowledge the challenge moving to the cloud, it can be a little scary, however O365 is not limiting, actually freeing. Things will be different and we’re going to help make you even more productive like with Office Graph, Modern Groups and power BI
It’s highly configurable, you can mix and match services and have the flexibility between cloud and hybrid.
All consistently providing resiliency, reliability and operations and Service Management as core.
So wherever you are at in the journey, whether you’re in the consider phase, deploy phase or run phase, I’m here to focus on service management excellence. How to successfully operate and optimize the service.
The new frontier of service management excellence is at the core of the 365 experience, it is the next stage of the journey
So let’s talk about how we’re looking at SM.
We’ve created this framework to provide a holistic view of how to get the most benefit of the service. Evolving as a service provider.
It’s meant to be a way to bring you on this journey to help understand what SM is.
It’s a way to show you how to work with us.
It’s the guide to show you how to be successful at running, using and adopting the service.
We’re basing our framework on MOF/ITIL and because the customer is not as engaged in IT, we’re providing a framework to be more agile and advanced.
I’m going to focus on two of these areas today.
We’ve already been going a lot of work around SM but now we can approach from a holistic view, so let’s dive a bit deeper.
ITIL & MOF have great structure – but the Office 365 approach adds engineering automation to solve the problem.
Ops model – ops team runs service. When they have a problem, they escalate to the engineering. Challenge is the mean time to resolve the problem is quite long. Even worse in a tiered model – each tier has an SLA, has to escalate.
Direct support model – initial triage, then give it direct to the engineering team – but tier 1 still needs to know how to triage and which developer to bring in.
Instead went with engineering ops model. Software looks at the monitoring, automatically dispatches to the appropriate team. Ideally, the system self heals through reboots etc., but if that fails escalate directly to the right team. In many cases they are the best people to address the issues. They know what code they checked in last week and how it might effect the operation of the service.
There is still a role for ops. They focus on specialize pieces of the system such as network, deployment etc. We also have roles for incident managers – senior engineering managers that have the knowledge and power to resolve issues, engage with comms managers to communicate out appropriate updates etc.
Let’s how we approach service updates.
Continuous releases, built on trust, data of usage
Informed, manage the changes, be successful – channels you have available to you
This has changed over the years.
Feedback: IT Pro Network, Blog, etc.
Office 365 is a continuously improving service focused on providing customers new features that delight users and improve the productivity experience, building a trusted service with high resiliency, harden against attacks, broader certifications and compliance breath, and responding to customer feedback. All of this provided with insights to help manage change to the service through customer communications and insight into upcoming changes.
Segue to the what we’ve delivered slide: Before we look at new upcoming feature we will walk through the process and types of changes in more detail. First Office 365 has already been operating with many of these principles since launched nearly two years ago…
Versionless… Minor increments
Greatest tool I want to give you is to understand what it’s like to be subscribed to an evergreen service. By moving to continuous releases
Change is coming everyday can be somewhat scary. How can you stay on top of it?
In the wave model, we’d throw out a huge amount of changes and often you postponed releases. However with every release you skip, the greater the risk of impact.
With this new service, you’ll always be up to date.
With continuous innovation, there will still be work from your IT. Changes will be small, and easier to adopt.
We find that with this process, there is increased adoption
Innovation more rapidly, responsive but very different
Office 365, why Change is important
Look at what we’re delivered in the last year
Without policy and notification
Stay in the know
Help you get the most benefit
Look at all this innovation
This is just a fraction of what we’ve delivered.
This is the way of the future.
Let’s continue on the journey of types of change and how we communicate.
Ongoing new things coming to the service – you as an admin or user experience
Platform updates – a lot of these happening behind the scenes, this will be communicated differently than ongoing updates
System Requirements – this is to help you plan, as there may be things in your environment may need to be maintained
Lastly, unexpected things like outages, incidents, degradation of capability. Our goal is to communicate as quickly as possible to set expectations
How we classify change and how we communicate it:
Functionality updates (end user features). 1. Improving to existing features and 2. Introducing new services.
Platform changes in the background.
Here’s how we communicate change and the sources of truth.
Future updates: Major and Minor updates, Roadmap – Change has impact for different businesses, you may need to your own readiness and documentation.
New introductions: At availability things like Office for iPad
Disruptive: We want to respect the change that will affect you. These changes will take more preparation for IT so we give you as at least 12 months in advance to prepare.
Config Changes: Smaller changes, tweaks IT need to make to optimize the service. (IP Changes) Little prep is needed. These in the message center depending on severity for notification.
Infrastructure: On a continuous basis, we’re managing the service (patching servers, security fixes, improving performance, network optimization, etc.) : SHD (which I’ll show you in a bit)
Not all change are created equal… trust that MS is going to maintain the improving.
To dive a bit deeper on disruptive change, Across MS, this is our policy.
SharePoint Online Site Collection Upgrades
IE Browser Support
Tags & Notes
In the spirit of listening to you, we want to hear your feedback.
Finding new features being created and implemented due to your feedback.
Responding to you with customer feedback – in product and from our field. – you asked for native PDF in One Drive and we were able to bring that into the product. Encourage you talk to your account team, we want to hear from you
Improving usability – Send a smile – this allows us to see what you like the feature.
Finally we look into other community networks.
Highly recommend to join the IT Pro Yammer Network – though feedback we were also able to improve the password reset within the product. Password policy used to be Pwershell.
More on other ways of communication, we’ve recently launched the public roadmap. This will give you details on what’s being rolled out as well as changes launched, and what’s in development.
High level detail.
Roadmap pitch.
Primary way to provide information on changes is through the message center. Use MC to see in product messaging about changes coming up.
Feature Story Telling. Newspaper you should read on a daily basis. Hey what’s new. MC points to blog posts by our developers. Destination is the blog.
How do you stay current as well? Centralized 4000 system requirements into the Office 365 SD. This is also translated into 40 languages. We’ll provide one year notification notice if any one of these things change.
We’ve created this framework to.
Get the most benefit of the service.
It’s meant to be a way to bring your customers on this journey to help understand what SM is.
It’s a way to show your customers how they have to change to work with us.
It’s the guide to show your customers on how to be successful at running, using and adopting the service.
We’re basing our framework on MOF/ITIL and because the customer is not as engaged in IT, we’re providing a framework to be more agile and advanced.
We’ve already been going a lot of work around SM but now we can approach from a holistic view, so let’s dive a bit deeper.
ITIL & MOF have great structure – but the Office 365 approach adds engineering automation to solve the problem.
Ops model – ops team runs service. When they have a problem, they escalate to the engineering. Challenge is the mean time to resolve the problem is quite long. Even worse in a tiered model – each tier has an SLA, has to escalate.
Direct support model – initial triage, then give it direct to the engineering team – but tier 1 still needs to know how to triage and which developer to bring in.
Instead went with engineering ops model. Software looks at the monitoring, automatically dispatches to the appropriate team. Ideally, the system self heals through reboots etc., but if that fails escalate directly to the right team. In many cases they are the best people to address the issues. They know what code they checked in last week and how it might effect the operation of the service.
There is still a role for ops. They focus on specialize pieces of the system such as network, deployment etc. We also have roles for incident managers – senior engineering managers that have the knowledge and power to resolve issues, engage with comms managers to communicate out appropriate updates etc.
Xbox experience, making you successful, we’ve learned a lot in the last two years, we’ve grown, we’re working with other services in the company to constantly improve.
Make sure IT Admins and customers understand what SM is and why it's important
Microsoft has a framework for SM and they need to understand their role in the process
Setting the right expectations for running the cloud service and how it differs to on prem
Change
Communications
Service Health
Incidents
Support
Resources for self-empowerment
We’ve architected Office 365 to ensure the service is reliable and predictable we have implemented service continuity by design. From the ground up we’ve dedicated our service to be…
Data is Redundant - Physical – we have your data in multiple data centers. We have options to move. Data is copied between all locations real time. So data loss is not an issue. Functional: Offline clients, web apps – we offer many ways to access the application OWA vs. web app. Offline client synced later when back to office.
Resiliency – We think of this as “fail forward” Always innovating in this space. Running load balancing across locations: Think Active Active Active! We used to have large incidents (North America) because we’re working our fault size down to a smaller level of infrastructure. Every incident we learn from to ensure we change the infrastructure to have less impact.
Distributed workloads – Isolated the workloads, so there is not a domino effect so failures are contained at the smallest level possible. Also we have data centers around the world and we’ll sonnect closest to location which improves performance. Data will still be distributed in region for compliance.
Human intervention – We have engineers however we’re automating as much as possible. We try to innovate out humans but they are here for support. The same people that write the code are the ones are the ones who respond the incident (via oncall).
Predictability – engineers can look at the system anytime, we area always monitoring, looking for new scenarios. We have a solid but improving monitoring system to avoid incidents.
We’ve moved more from hardware to software and automation and management components. Our system knows when it’s sick and knows how to fix it. An example is Exchange if it knows it’s unhealthy, it will remove the users and engineers will correct it with no impact to users.
Redundancy vs. resiliency
When we talk about redundancy, we’re referring to the various layers of infrastructure that can failover to one another if a primary resource drops out. On the other hand, when we talk about resiliency, we mean the ability of Office 365 as a whole to protect service integrity and recover if one or more of its technology components fails. For example, delivering the Outlook Web App requires server and networking hardware, numerous web server farms, messaging databases, and much more. In this example, service resiliency describes the service’s reaction to one or more of these components suddenly failing. Let’s say a primary web server farm suddenly becomes unavailable. What’s the response of the Outlook Web App service in that particular data center in terms of failing over to a backup server farm, redirecting user requests, or restoring messaging data? Answering those questions is the essence of resiliency.
Redundancy is key with respect to delivering high availability. The Office 365 architecture is designed to provide redundancy at every layer:
Physical Level:
Network and hardware redundancy
Facilities and power redundancy
At least 2 datacenters per region
Physical redundancy at disk, NIC, power supply, and server levels.
Data centers located in seismically safe zones
Data Level:
Content is constantly replicated from a primary data center to one or more secondary data centers
Customer data is stored in a redundant environment with robust backup, restoration, and failover capabilities to enable availability, business continuity, and rapid recovery
Functional Level: Online and offline functionality provide continuity in case of:
Cloud disruptions
Network interruptions
The realities of business life (airplane mode)
When we talk about Resiliency we are referring to the ability of Office 365 as a whole to protect service integrity and recover if one or more of its technology components fails.
Active load balancing to restructure the system against rare extreme load conditions
Automated failover to healthy resources in response to:
Hardware or software failures
Monitoring alerts
Human initiated failover to healthy resources in response to:
Service incidents
Customer reported incidents
Recovery across “failure domains” tested regularly
Incidents will happen, now you have to rely on MS. How we respond makes the difference.
Before an incident, you need to understand your role. Also what do we do and what you can expect from us.
We want to make sure you know where to go for information, when and what type of communication you can expect from us in case of an incident.
Download the admin app to your mobile device.
If you’re prepared ahead of time, if there is an incident think about how you’d communicate and plan, etc.
We post data to the Service Health Dashboard as quickly as possible at least 30 minutes when the incident is identified.
Connect with your account team and partner (if you have those resources – to be your advocates) to learn more about the details
Take the time to inform your users and team on the processes you’ve put in place determined by the incident
When an incident occurs, via monitoring or customer, we put a message in SHD, hourly posting updates, when restored we put details in the closure summary, PIR posted to be able to review with your stakeholders.
Slide objective: Walk through the incident notification process and highlight milestones. Provide high level view of roles and responsibilities for CM, IM and Support functions.
Provide overview of the Communication flow
Incident Identification
Automated Alerting
Standardized Notifications
Communication and Incident Managers
Ongoing Communication
Primary Vehicle is SHD
Post Incident Wrap-up
Closure Summary Explanation
Timely Post Incident Reviews
Incident occurs
Incident Manager (IM) evaluates and if warranted contacts the Communication Manager (CM)
If major incident then the Office 365 Service Incident Communication Playbooks are followed
Service Incident Comms Manager post “Investigating” message to Service Health Dashboard (SHD)
CM update SHD
Determine true scope of the outage
What caused the outage?
What is the fastest and best path to resolution, making sure the recovery option doesn’t encounter the same issue that caused the outage?
IM/CM makes a determination regarding updating severity, scope, and affected infrastructure
Service Incident may be downgraded
SHD updated to indicate severity/scope/infrastructure change.
SHD may indicate extended update interval for standard issues (one last message sent to SHD to indicate scope/infrastructure change)
Only impacted scope/infrastructure will receive additional alerts
5. CM posts timely message indicating Service Restored to both SHD
CM gather key facts for the Closure Summary
Service Incident Start and End times (with time zone reference)
Nature of the incident, infrastructure affected, and resolution steps
Any “Next Steps” which may need to be completed after system restoration
6. CM Posts a second “Service Restored” message with the Closure Summary elements identified above
7. If applicable, CM Posts the PIR
We realized need to be timely, targeted, accurate and flexible.
Timely: We’re working on decreasing the time to post information on an incident. The monitoring we have in place is now automating “red alerts” for the most impacting incidents are posted to SHD. Support team have a “Big Red Button” if they see an increase of issue, they can post to SHD. The Enhanced PIR process (Post Incident Review) – We heard you, as you want more information. So now we’re providing details on the incident 48 hours vs. 5 days and it’s more detailed. During an incident we post updates every hour unless otherwise stated.
Targeted: We have an authenticated dashboard (SHD) is specific to your tenant (tenant specific). We can upload issues to a specific amount of customers. We’re also paying attention to users hitting the SHD to determine if we targeted the scope correctly.
Accurate is about the detail we put into a form that is consumable, including the PIR (post incident review). We’ve heard you over the last year, we’ve evolved this. We’re trying to provide as much information at the time. We’re always pushing engineering for more details. We do have a SHD survey during a high volume incidents to determine the accuracy
Flexibility: Message Center, Admin App and the API which I’ll go over more
A lot of asked for the API, wanted to extend to health dashboards.
Direct to you:
A key piece for you is to understand the various communication channels for staying informed of the incident. SHD
Message center, have more notifications on change and maintenance which can affect users that may not be an incident
PIR: post mortem that reflect what we learned and what we went through, provide detail
Admin app: focused on admin to have service health information on the go
Other channels on a great scale:
Community and Twitter and social networks will be used for larger issues as well as allow us to
EBS- Status (in the unlikely event SHD goes down) this shows you we’re aware of an issue and we’re working on it.
Status, hover over icon, trying to provide work arounds in messaging, 30 day history, legend on the bottom, view history, see planned maintenance.
Dashboard icon upper left for easy nav,
SHD is the best source for you. We do our best to keep it updated, committed to hourly updates unless we stat otherwise.
Here is the ledged and here’s what the icons mean.
Multiplatform app also available for Partner
We launched this a few months ago, This API allows you to query a tenant and see service health. Through authenticated view can pull into your own platform. There is lots of flexibility, however the tenant admin or partner with rights will have to
Systems SCOM pack is another way to integrate message center posts back into the processes to manage their environment.
So, what’s your role after the incident? We’ll post the closure or PIR document on SHD within 5 days of the incident. Talk to your account or partner team to learn more, and provide feedback to us.
Our number one goal is to prevent the incident from happening again.
Moving forward, ideally PIRs are not necessary because we provide customers enough detail at Service Incident resolution.
Be confident in MS as we’re constantly learning. Each incident goes through detailed review with engineering. We document all next steps, how to improve monitoring, processes, communications, etc. We also hold a MSR with leadership where all the PIRS are reviewed by leadership and all the due diligence is tracked to make sure all bugs, improvements and processes are executed.
Building Trust – Control.
We have something called Post Incident Reviews (PIR) after every Service Incident, regardless of how small or big the issue is. A PIR includes Root Cause Analysis for Service Incidents and is completed within 5 business days of an incident. PIR for a particular incident is attached in the Service Health Dashboard for the customers that are impacted. A PIR can always be requested through support.
Once PIR is completed, reviews are done at executive levels to ensure that we improve the service to prevent future similar failures.
Service Reliability has been tough the past couple of months, due to multiple drivers ranging from scale (current service has 10x tenants of BPOS) to platform stability to Upgrades (thousands of customers per day, 7 days a week, for over 7 months: 550,000+ as of 11/1 – across Live@edu to Office 365, pre-upgrade to wave15, and FOPE to EOP) to faster release cycles (monthly release cadence has unearthed some regressions).
Reducing the scope and time of failure. Aggressively focused on reducing failure domains and time to restore the service via fault tolerance and better detection.
Changes to the SI communication process
Based on observations and experience as well as feedback from customers, there are some changes being made in the SI process for the O365 multi-tenant service. The following is a summary of these changes, which are aimed at improving the customer experience through clear expectations and delivery against committed SLA’s.
Creation of an enhanced process for SI’s with high call volumes
> For any SI that has more than a specified volume of Support tickets associated to it, we plan to add several enhancements to the SI process
o The specified volume is >100 Support tickets for Exchange, and >50 Support tickets for all other workloads
* There were 4 incidents that would have qualified for this in August, and 7 in September
* This would exclude any long-running/multi-day issues where the calls trickled in over a period of days o We will deliver an “executive awareness” update (example)
o We will deliver an “enhanced” PIR that includes more robust Summary and Next Steps sections, greater precision on root cause, and an added Timeline of Events (example)
o We will attempt to post the initial PIR within 48 hours of the SI, with the final PIR to be posted within the normal 5 business days (replacing the initial PIR)
o PMG will be leading field bridges during or upon resolution of the incident (depending on how severe / long-running the issues is), as well as a PIR pre-review call prior to the initial posting of the PIR
Change in PIR process for SI’s with no associated Support cases
> For any SI that has *no* known Support cases associated to it at the time of SI resolution, and where the SI was not initiated by Support, we plan to make the Closure Summary the final update on the SI
o At the end of the Closure Summary we will state “Upon analysis of the incident, impact was determined to be minimal and no known support requests were associated with this event. Next steps have been identified and will be remediated appropriately to ensure the issue does not reoccur. Please consider this Closure Summary the final update on the event.”
o As a number of our monitoring/Ops-raised SI’s do not have any Support cases associated to them (23 in August and 17 in September), this should free up some time across the teams
o This change will be implemented for any SI that occurs on or after Monday 10/14
Criteria for a PIR with no associated SI
> Occasionally we get requests for PIRs for incidents where there was no posting on the SHD. If the request comes in from more than 5 customers we have agreement from the workloads to assume this was a “missed” SI, and post a resolved status and PIR after-the-fact. The criteria to deliver such a PIR is:
o Requests from 5 or more customers for the same incident
o Request within 5 business days of the incident
Improvements in postings for long-running/multiple-day issues
> While we look to provide meaningful updates in every post on the SHD, long-running incidents can be frustrating to our customers as the content in the communication can become repetitive as full resolution occurs (examples are mail queues clearing). For issues that span beyond 48 hours from SI initiation, the workloads have agreed to provide updated, relevant information that we can include in the post. An example of the type of content we will be looking to improve is: Microsoft is continuing the restoration process for customers who are experiencing delays when attempting to upgrade site collections and personal sites (MySites) from SharePoint 2010 to the latest version. Investigation determined that a programming error caused delays to the upgrade process and a fix has been deployed across the environment to mitigate end-user impact. SharePoint Online engineers are monitoring service health while the upgrade request queue is processed. The next update will be provided by September 18th, 2013 at 7:00 PM UTC. We will look to drive the threshold down from 48 hours as we get better optics into the service and recovery ETA’s.
We will continue to request feedback from customers via SHD Closure Summaries through the rest of FY14 in an attempt to continually refine this process. Please reach out to Katy Olmstead (katyo) with
learn how to get the most out of Office 365, quickly get your team onboard, and drive adoption
So let’s dive into the areas I just outlined. the objective here is to summarize, drive awareness and remind you what we have in place TODAY, however
You have likely heard about most of these if not all of these. This is a reminder of the resources we’re providing you to be better equip overall.
>run through topics, call out awesome work by Katy, Andy, and the support teams, also mention many of these things will be covered in deep dives throughout TechReady<
Management Pack & API - Official announcement goes out tomorrow, so stay tuned.