SlideShare uma empresa Scribd logo
1 de 28
Design for recovery Prof. Ian Sommerville
Objectives To discuss the notion of ‘failure’ in software systems To explain why this conventional notion of ‘failure’ is not appropriate for many LSCITS To propose an approach to failure management in LSCITS based on recoverability rather than failure avoidance
Complex IT systems Organisational systems that support different functions within an organisation Can usually be considered as systems of systems, ie different parts are systems in their own right Usually distributed and normally constructed by integrating existing systems/components/services Not subject to limitations derived from the laws of physics (so, no natural constraints on their size) Data intensive, with very long lifetime data An integral part of wider socio-technical systems
What is failure? From a reductionist perspective, a failure can be considered to be ‘a deviation from a specification’. An oracle can examine a specification and observe a system’s behaviour and detect failures. Failure is an absolute - the system has either failed or it hasn’t Of course, some failures are more serious than others; it is widely accepted that failures with minor consequences are to be expected and tolerated
A question to the audience A hospital system is designed to maintain information about available beds for incoming patients and to provide information about the number of beds to the admissions unit.  It is assumed that the hospital has a number of empty beds and this changes over time. The variable B reflects the number of empty beds known to the system. Sometimes the system reports that the number of empty beds is the actual number available; sometimes the system reports that fewer than the actual number are available .  In circumstances where the system reports that an incorrect number of beds are available, is this a failure?
Bed management system The percentage of system users who considered the system’s incorrect reporting of the number of available beds to be a failure was 0%. Mostly, the number did not matter so long as it was greater than 1. What mattered was whether or not patients could be admitted to the hospital. When the hospital was very busy (available beds = 0), then people understood that it was practically impossible for the system to be accurate.   They used other methods to find out whether or not a bed was available for an incoming patient.
Failure is a judgement Specifications are a simplification of reality.  Users don’t read and don’t care about specifications Whether or not system behaviour should be considered to be a failure, depends on the judgement of an observer of that behaviour This judgement depends on: The observer’s expectations The observer’s knowledge and experience The observer’s role The observer’s context or situation The observer’s authority
System failure ‘Failures’ are not just catastrophic events but normal, everyday system behaviour that disrupts normal work and that mean that people have to spend more time on a task than necessary A system failure occurs when a direct or indirect user of a system has to carry out extra work, over and above that normally required to carry out some task, in response to some inappropriate system behaviour This extra work constitutes the cost of recovery from system failure
Failures are inevitable Technical reasons When systems are composed of opaque and uncontrolled components, the behaviour of these components cannot be completely understood Failures often can be considered to be failures in data rather than failures in behaviour Socio-technical reasons Changing contexts of use mean that the judgement on what constitutes a failure changes as the effectiveness of the system in supporting work changes Different stakeholders will interpret the same behaviour in different ways because of different interpretations of ‘the problem’
Conflict inevitability Impossible to establish a set of requirements where stakeholder conflicts are all resolved Therefore, successful operation of a system for one set of stakeholders will inevitably mean ‘failure’ for another set of stakeholders Groups of stakeholders in organisations are often in perennial conflict (e.g. managers and clinicians in a hospital). The support delivered by a system depends on the power held at some time by a stakeholder group.
Where are we? Large-scale information systems are inevitably complex systems Such systems cannot be created using a reductionist approach Failures are a judgement and this may change over time Failures are inevitable and cannot be engineered out of a system
The way forward Software design has to be seen as part of a wider process of LSCITS engineering We need to accept that technical system failures will always occur and examine how we can design these systems to allow the broader socio-technical systems, in which these technical systems are used, to recognise, diagnose and recover from these failures
Software dependability A reductionist approach to software dependability takes the view that software failures are a consequence of software faults Techniques to improve dependability include Fault avoidance Fault detection Fault tolerance These approaches have taken us quite a long way in improving software dependability. However, further progress is unlikely to be achieved by further improvement of these techniques as they rely on a reductionist view of failure.
Failure recovery Recognition Recognise that inappropriate behaviour has occurred Hypothesis Formulate an explanation for the unexpected behaviour Recovery Take steps to compensate for the problem that has arisen
Coping with failure Socio-technical systems are remarkably robust because people are good at coping with unexpected situations when things go wrong. We have the unique ability to apply previous experience from different areas to unseen problems. Individuals can take the initiative, adopt responsibilities and, where necessary, break the rules or step outside the normal process of doing things. People can prioritise and focus on the essence of a problem
Recovering from failure Local knowledge  Who to call; who knows what; where things are Process reconfiguration Doing things in a different way from that defined in the ‘standard’ process Work-arounds, breaking the rules (safe violations) Redundancy and diversity Maintaining copies of information in different forms from that maintained in a software system Informal information annotation Using multiple communication channels Trust Relying on others to cope
Design for recovery The aim of a strategy of design for recovery is to: Ensure that system design decisions do not increase the amount of recovery work required Make system design decisions that make it easier to recover from problems (i.e. reduce extra work required) Earlier recognition of problems Visibility to make hypotheses easier to formulate Flexibility to support recovery actions Designing for recovery is an holistic approach to system design and not (just) the identification of ‘recovery requirements’ Should support the natural ability of people and organisations to cope with problems
Problems	 Security and recoverability Automation hiding Process tyranny Multi-organisational systems
Security and recoverability There is an inherent tension between security and recoverability Recoverability Relies on trusting operators of the system not to abuse privileges that they may have been granted to help recover from problems Security Relies on mistrusting users and restricting access to information on a ‘need to know’ basis
Automation hiding A problem with automation is that information becomes subject to organizational policies that restrict access to that information. Even when access is not restricted, we don’t have any shared culture in how to organise a large information store Generally,  authorisation models maintained by the system are based on normal rather than exceptional operation. When problems arise and/or when people are unavailable, breaking the rules to solve these problems is made more difficult.
Process tyranny Increasingly, there is a notion that ‘standard’ business processes can be defined and embedded in systems that support these processes Implicitly or explicitly, the system enforces the use of the ‘standard’ process But this assumes three things: The standard process is always appropriate The standard process has anticipated all possible failures The system can be respond in a timely way to process changes
Multi-organisational systems Many rules enforced in different ways by different systems. No single manager or owner of the system . Who do you call when failures occur? Information is distributed - users may not be aware of where information is located, who owns information, etc. Processes involve remote actors so process reconfiguration is more difficult Restricted information channels (e.g. help unavailable outside normal business hours; no phone numbers published, etc.) Lack of trust. Owners of components will blame other components for system failure. Learning is inhibited and trust compromised.
Design guidelines Local knowledge Process reconfiguration Redundancy and diversity
Local knowledge Local knowledge includes knowledge of who does what, how authority structures can be bypassed, what rules can be broken, etc. Impossible to replicate entirely in distributed systems but some steps can be taken Maintain information about the provenance of data Who provided the data, where the data came from, when it was created, edited, etc. Maintain organisational models Who is responsible for what, contact details
Process reconfiguration Make workflows explicit rather than embedding them in the software Not just ‘continue’ buttons! Users should know where they are and where they are supposed to go Support workflow navigation/interruption/restart Design systems with an ‘emergency mode’ where the the system changes from enforcing policies to auditing actions This would allow the rules to be broken but the system would maintain a log of what has been done and why so that subsequent investigations could trace what happened Support ‘Help, I’m in trouble!’ as well as ‘Help, I need information?’
Redundancy and diversity Maintaining a single ‘golden copy’ of data may be efficient but it may not be effective or desirable Encourage the creation of ‘shadow systems’ and provide import and export from these systems Allow schemas to be extended Schemas for data are rarely designed for problem solving. Always allow informal extension (a free text box) so that annotations, explanations and additional information can be provided  Maintain organisational models To allow for multi-channel communications when things go wrong
Current research Our current work is concerned with the development of responsibility models that make responsibilities across different organisations explicit These models show who is responsible for what and the resources required to discharge responsibilities They provide a basis for maintaining local knowledge about a situation and discovering who to involve when problems have to be solved The role of trust in recovery and the relationship between trust and responsibility is an issue that we have still to address
Summary A reductionist approach to software engineering is no longer viable. on its own, for complex systems engineering Improving existing software engineering methods will help but will not deal with the problems of complexity that are inherent in distributed systems of systems We must learn to live with normal, everyday failures Design for recovery involves designing so that the work required to recover from a failure is minimised Recovery strategies include supporting information redundancy and annotation and maintaining organisational models

Mais conteúdo relacionado

Mais procurados

7 deadly sins of backup and recovery
7 deadly sins of backup and recovery7 deadly sins of backup and recovery
7 deadly sins of backup and recoverygeekmodeboy
 
Commandalot Ugm Pmac 2010 Act
Commandalot Ugm Pmac 2010 ActCommandalot Ugm Pmac 2010 Act
Commandalot Ugm Pmac 2010 ActAdam Tallinger
 
PGodfrey_IS &_DSS_Term_Paper
PGodfrey_IS &_DSS_Term_PaperPGodfrey_IS &_DSS_Term_Paper
PGodfrey_IS &_DSS_Term_PaperPaul Godfrey
 
EMR Implementation Considerations Slides
EMR Implementation Considerations SlidesEMR Implementation Considerations Slides
EMR Implementation Considerations SlidesPiLNAfrica
 
Expert Systems
Expert SystemsExpert Systems
Expert Systemsosmancikk
 
Dynamic Rule Base Construction and Maintenance Scheme for Disease Prediction
Dynamic Rule Base Construction and Maintenance Scheme for Disease PredictionDynamic Rule Base Construction and Maintenance Scheme for Disease Prediction
Dynamic Rule Base Construction and Maintenance Scheme for Disease Predictionijsrd.com
 
Self healing-systems
Self healing-systemsSelf healing-systems
Self healing-systemsSKORDEMIR
 
MIS 07 Expert Systems
MIS 07  Expert SystemsMIS 07  Expert Systems
MIS 07 Expert SystemsTushar B Kute
 
Application of expert system
Application of expert systemApplication of expert system
Application of expert systemDinkar DP
 
Dependability requirements for LSCITS
Dependability requirements for LSCITSDependability requirements for LSCITS
Dependability requirements for LSCITSIan Sommerville
 
Process Safety Blind Spots: EXPOSED [Infographic]
Process Safety Blind Spots: EXPOSED [Infographic]Process Safety Blind Spots: EXPOSED [Infographic]
Process Safety Blind Spots: EXPOSED [Infographic]Darwin Jayson Mariano
 

Mais procurados (20)

Expert system
Expert systemExpert system
Expert system
 
7 deadly sins of backup and recovery
7 deadly sins of backup and recovery7 deadly sins of backup and recovery
7 deadly sins of backup and recovery
 
6401 Wk6APPDonovanC
6401 Wk6APPDonovanC6401 Wk6APPDonovanC
6401 Wk6APPDonovanC
 
Expert system (mis)
Expert system (mis)Expert system (mis)
Expert system (mis)
 
Commandalot Ugm Pmac 2010 Act
Commandalot Ugm Pmac 2010 ActCommandalot Ugm Pmac 2010 Act
Commandalot Ugm Pmac 2010 Act
 
PGodfrey_IS &_DSS_Term_Paper
PGodfrey_IS &_DSS_Term_PaperPGodfrey_IS &_DSS_Term_Paper
PGodfrey_IS &_DSS_Term_Paper
 
Expert system
Expert systemExpert system
Expert system
 
EMR Implementation Considerations Slides
EMR Implementation Considerations SlidesEMR Implementation Considerations Slides
EMR Implementation Considerations Slides
 
Expert Systems
Expert SystemsExpert Systems
Expert Systems
 
Expert systems
Expert systemsExpert systems
Expert systems
 
Ch2
Ch2Ch2
Ch2
 
Ch2
Ch2Ch2
Ch2
 
Dynamic Rule Base Construction and Maintenance Scheme for Disease Prediction
Dynamic Rule Base Construction and Maintenance Scheme for Disease PredictionDynamic Rule Base Construction and Maintenance Scheme for Disease Prediction
Dynamic Rule Base Construction and Maintenance Scheme for Disease Prediction
 
Self healing-systems
Self healing-systemsSelf healing-systems
Self healing-systems
 
MIS 07 Expert Systems
MIS 07  Expert SystemsMIS 07  Expert Systems
MIS 07 Expert Systems
 
expert systems
expert systemsexpert systems
expert systems
 
Application of expert system
Application of expert systemApplication of expert system
Application of expert system
 
Ch12 safety engineering
Ch12 safety engineeringCh12 safety engineering
Ch12 safety engineering
 
Dependability requirements for LSCITS
Dependability requirements for LSCITSDependability requirements for LSCITS
Dependability requirements for LSCITS
 
Process Safety Blind Spots: EXPOSED [Infographic]
Process Safety Blind Spots: EXPOSED [Infographic]Process Safety Blind Spots: EXPOSED [Infographic]
Process Safety Blind Spots: EXPOSED [Infographic]
 

Destaque

Destaque (8)

L6 LSCITS Engineering
L6 LSCITS EngineeringL6 LSCITS Engineering
L6 LSCITS Engineering
 
L1 Intro To Lscits
L1 Intro To LscitsL1 Intro To Lscits
L1 Intro To Lscits
 
L3 Requirements Eng Overview
L3 Requirements Eng OverviewL3 Requirements Eng Overview
L3 Requirements Eng Overview
 
L2 Socio Tech Systems
L2 Socio Tech SystemsL2 Socio Tech Systems
L2 Socio Tech Systems
 
Emergent properties
Emergent propertiesEmergent properties
Emergent properties
 
Intro to requirements eng.
Intro to requirements eng.Intro to requirements eng.
Intro to requirements eng.
 
Requirements engineering challenges
Requirements engineering challengesRequirements engineering challenges
Requirements engineering challenges
 
Architectural patterns for real-time systems
Architectural patterns for real-time systemsArchitectural patterns for real-time systems
Architectural patterns for real-time systems
 

Semelhante a L7 Design For Recovery

Risk Management: A Holistic Organizational Approach
Risk Management: A Holistic Organizational ApproachRisk Management: A Holistic Organizational Approach
Risk Management: A Holistic Organizational ApproachGraydon McKee
 
Walden UniversityMaster of Science in NursingWeek 2 Journa.docx
Walden UniversityMaster of Science in NursingWeek 2 Journa.docxWalden UniversityMaster of Science in NursingWeek 2 Journa.docx
Walden UniversityMaster of Science in NursingWeek 2 Journa.docxcelenarouzie
 
How Enterprise Architects Can Build Resilient, Reliable Software-Based Health...
How Enterprise Architects Can Build Resilient, Reliable Software-Based Health...How Enterprise Architects Can Build Resilient, Reliable Software-Based Health...
How Enterprise Architects Can Build Resilient, Reliable Software-Based Health...Cognizant
 
Working with Health IT Systems is available under a Creative C.docx
Working with Health IT Systems is available under a Creative C.docxWorking with Health IT Systems is available under a Creative C.docx
Working with Health IT Systems is available under a Creative C.docxhelzerpatrina
 
Designing Complex Systems for Recovery (LSCITS EngD 2011)
Designing Complex Systems for Recovery (LSCITS EngD 2011)Designing Complex Systems for Recovery (LSCITS EngD 2011)
Designing Complex Systems for Recovery (LSCITS EngD 2011)Ian Sommerville
 
Performance improvement through mobile devices
Performance improvement through mobile devicesPerformance improvement through mobile devices
Performance improvement through mobile devicesSawad thotathil
 
6 computer systems
6 computer systems6 computer systems
6 computer systemshccit
 
Total Quality Management Lec.6.pptx
Total Quality Management Lec.6.pptxTotal Quality Management Lec.6.pptx
Total Quality Management Lec.6.pptxAynur27
 
Information GovernanceTitleCourse NameTo.docx
Information GovernanceTitleCourse NameTo.docxInformation GovernanceTitleCourse NameTo.docx
Information GovernanceTitleCourse NameTo.docxdirkrplav
 
Planning and Deploying an Effective Vulnerability Management Program
Planning and Deploying an Effective Vulnerability Management ProgramPlanning and Deploying an Effective Vulnerability Management Program
Planning and Deploying an Effective Vulnerability Management ProgramSasha Nunke
 
Commissioning And Procurement
Commissioning And ProcurementCommissioning And Procurement
Commissioning And Procurementalecfraher
 
Commissioning And Procurement
Commissioning And ProcurementCommissioning And Procurement
Commissioning And Procurementalecfraher
 
Building Information System
Building Information SystemBuilding Information System
Building Information SystemRabia Jabeen
 
Depandability in Software Engineering SE16
Depandability in Software Engineering SE16Depandability in Software Engineering SE16
Depandability in Software Engineering SE16koolkampus
 

Semelhante a L7 Design For Recovery (20)

Risk Management: A Holistic Organizational Approach
Risk Management: A Holistic Organizational ApproachRisk Management: A Holistic Organizational Approach
Risk Management: A Holistic Organizational Approach
 
Walden UniversityMaster of Science in NursingWeek 2 Journa.docx
Walden UniversityMaster of Science in NursingWeek 2 Journa.docxWalden UniversityMaster of Science in NursingWeek 2 Journa.docx
Walden UniversityMaster of Science in NursingWeek 2 Journa.docx
 
How Enterprise Architects Can Build Resilient, Reliable Software-Based Health...
How Enterprise Architects Can Build Resilient, Reliable Software-Based Health...How Enterprise Architects Can Build Resilient, Reliable Software-Based Health...
How Enterprise Architects Can Build Resilient, Reliable Software-Based Health...
 
Expert system design
Expert system designExpert system design
Expert system design
 
Ch10
Ch10Ch10
Ch10
 
Working with Health IT Systems is available under a Creative C.docx
Working with Health IT Systems is available under a Creative C.docxWorking with Health IT Systems is available under a Creative C.docx
Working with Health IT Systems is available under a Creative C.docx
 
4
44
4
 
Designing Complex Systems for Recovery (LSCITS EngD 2011)
Designing Complex Systems for Recovery (LSCITS EngD 2011)Designing Complex Systems for Recovery (LSCITS EngD 2011)
Designing Complex Systems for Recovery (LSCITS EngD 2011)
 
Performance improvement through mobile devices
Performance improvement through mobile devicesPerformance improvement through mobile devices
Performance improvement through mobile devices
 
Expert Systems
Expert SystemsExpert Systems
Expert Systems
 
6 computer systems
6 computer systems6 computer systems
6 computer systems
 
Chap3 RE elicitation
Chap3 RE elicitationChap3 RE elicitation
Chap3 RE elicitation
 
Total Quality Management Lec.6.pptx
Total Quality Management Lec.6.pptxTotal Quality Management Lec.6.pptx
Total Quality Management Lec.6.pptx
 
Information GovernanceTitleCourse NameTo.docx
Information GovernanceTitleCourse NameTo.docxInformation GovernanceTitleCourse NameTo.docx
Information GovernanceTitleCourse NameTo.docx
 
Planning and Deploying an Effective Vulnerability Management Program
Planning and Deploying an Effective Vulnerability Management ProgramPlanning and Deploying an Effective Vulnerability Management Program
Planning and Deploying an Effective Vulnerability Management Program
 
W3 requirements engineering processes
W3   requirements engineering processesW3   requirements engineering processes
W3 requirements engineering processes
 
Commissioning And Procurement
Commissioning And ProcurementCommissioning And Procurement
Commissioning And Procurement
 
Commissioning And Procurement
Commissioning And ProcurementCommissioning And Procurement
Commissioning And Procurement
 
Building Information System
Building Information SystemBuilding Information System
Building Information System
 
Depandability in Software Engineering SE16
Depandability in Software Engineering SE16Depandability in Software Engineering SE16
Depandability in Software Engineering SE16
 

Último

Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Último (20)

Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

L7 Design For Recovery

  • 1. Design for recovery Prof. Ian Sommerville
  • 2. Objectives To discuss the notion of ‘failure’ in software systems To explain why this conventional notion of ‘failure’ is not appropriate for many LSCITS To propose an approach to failure management in LSCITS based on recoverability rather than failure avoidance
  • 3. Complex IT systems Organisational systems that support different functions within an organisation Can usually be considered as systems of systems, ie different parts are systems in their own right Usually distributed and normally constructed by integrating existing systems/components/services Not subject to limitations derived from the laws of physics (so, no natural constraints on their size) Data intensive, with very long lifetime data An integral part of wider socio-technical systems
  • 4. What is failure? From a reductionist perspective, a failure can be considered to be ‘a deviation from a specification’. An oracle can examine a specification and observe a system’s behaviour and detect failures. Failure is an absolute - the system has either failed or it hasn’t Of course, some failures are more serious than others; it is widely accepted that failures with minor consequences are to be expected and tolerated
  • 5. A question to the audience A hospital system is designed to maintain information about available beds for incoming patients and to provide information about the number of beds to the admissions unit. It is assumed that the hospital has a number of empty beds and this changes over time. The variable B reflects the number of empty beds known to the system. Sometimes the system reports that the number of empty beds is the actual number available; sometimes the system reports that fewer than the actual number are available . In circumstances where the system reports that an incorrect number of beds are available, is this a failure?
  • 6. Bed management system The percentage of system users who considered the system’s incorrect reporting of the number of available beds to be a failure was 0%. Mostly, the number did not matter so long as it was greater than 1. What mattered was whether or not patients could be admitted to the hospital. When the hospital was very busy (available beds = 0), then people understood that it was practically impossible for the system to be accurate. They used other methods to find out whether or not a bed was available for an incoming patient.
  • 7. Failure is a judgement Specifications are a simplification of reality. Users don’t read and don’t care about specifications Whether or not system behaviour should be considered to be a failure, depends on the judgement of an observer of that behaviour This judgement depends on: The observer’s expectations The observer’s knowledge and experience The observer’s role The observer’s context or situation The observer’s authority
  • 8. System failure ‘Failures’ are not just catastrophic events but normal, everyday system behaviour that disrupts normal work and that mean that people have to spend more time on a task than necessary A system failure occurs when a direct or indirect user of a system has to carry out extra work, over and above that normally required to carry out some task, in response to some inappropriate system behaviour This extra work constitutes the cost of recovery from system failure
  • 9. Failures are inevitable Technical reasons When systems are composed of opaque and uncontrolled components, the behaviour of these components cannot be completely understood Failures often can be considered to be failures in data rather than failures in behaviour Socio-technical reasons Changing contexts of use mean that the judgement on what constitutes a failure changes as the effectiveness of the system in supporting work changes Different stakeholders will interpret the same behaviour in different ways because of different interpretations of ‘the problem’
  • 10. Conflict inevitability Impossible to establish a set of requirements where stakeholder conflicts are all resolved Therefore, successful operation of a system for one set of stakeholders will inevitably mean ‘failure’ for another set of stakeholders Groups of stakeholders in organisations are often in perennial conflict (e.g. managers and clinicians in a hospital). The support delivered by a system depends on the power held at some time by a stakeholder group.
  • 11. Where are we? Large-scale information systems are inevitably complex systems Such systems cannot be created using a reductionist approach Failures are a judgement and this may change over time Failures are inevitable and cannot be engineered out of a system
  • 12. The way forward Software design has to be seen as part of a wider process of LSCITS engineering We need to accept that technical system failures will always occur and examine how we can design these systems to allow the broader socio-technical systems, in which these technical systems are used, to recognise, diagnose and recover from these failures
  • 13. Software dependability A reductionist approach to software dependability takes the view that software failures are a consequence of software faults Techniques to improve dependability include Fault avoidance Fault detection Fault tolerance These approaches have taken us quite a long way in improving software dependability. However, further progress is unlikely to be achieved by further improvement of these techniques as they rely on a reductionist view of failure.
  • 14. Failure recovery Recognition Recognise that inappropriate behaviour has occurred Hypothesis Formulate an explanation for the unexpected behaviour Recovery Take steps to compensate for the problem that has arisen
  • 15. Coping with failure Socio-technical systems are remarkably robust because people are good at coping with unexpected situations when things go wrong. We have the unique ability to apply previous experience from different areas to unseen problems. Individuals can take the initiative, adopt responsibilities and, where necessary, break the rules or step outside the normal process of doing things. People can prioritise and focus on the essence of a problem
  • 16. Recovering from failure Local knowledge Who to call; who knows what; where things are Process reconfiguration Doing things in a different way from that defined in the ‘standard’ process Work-arounds, breaking the rules (safe violations) Redundancy and diversity Maintaining copies of information in different forms from that maintained in a software system Informal information annotation Using multiple communication channels Trust Relying on others to cope
  • 17. Design for recovery The aim of a strategy of design for recovery is to: Ensure that system design decisions do not increase the amount of recovery work required Make system design decisions that make it easier to recover from problems (i.e. reduce extra work required) Earlier recognition of problems Visibility to make hypotheses easier to formulate Flexibility to support recovery actions Designing for recovery is an holistic approach to system design and not (just) the identification of ‘recovery requirements’ Should support the natural ability of people and organisations to cope with problems
  • 18. Problems Security and recoverability Automation hiding Process tyranny Multi-organisational systems
  • 19. Security and recoverability There is an inherent tension between security and recoverability Recoverability Relies on trusting operators of the system not to abuse privileges that they may have been granted to help recover from problems Security Relies on mistrusting users and restricting access to information on a ‘need to know’ basis
  • 20. Automation hiding A problem with automation is that information becomes subject to organizational policies that restrict access to that information. Even when access is not restricted, we don’t have any shared culture in how to organise a large information store Generally, authorisation models maintained by the system are based on normal rather than exceptional operation. When problems arise and/or when people are unavailable, breaking the rules to solve these problems is made more difficult.
  • 21. Process tyranny Increasingly, there is a notion that ‘standard’ business processes can be defined and embedded in systems that support these processes Implicitly or explicitly, the system enforces the use of the ‘standard’ process But this assumes three things: The standard process is always appropriate The standard process has anticipated all possible failures The system can be respond in a timely way to process changes
  • 22. Multi-organisational systems Many rules enforced in different ways by different systems. No single manager or owner of the system . Who do you call when failures occur? Information is distributed - users may not be aware of where information is located, who owns information, etc. Processes involve remote actors so process reconfiguration is more difficult Restricted information channels (e.g. help unavailable outside normal business hours; no phone numbers published, etc.) Lack of trust. Owners of components will blame other components for system failure. Learning is inhibited and trust compromised.
  • 23. Design guidelines Local knowledge Process reconfiguration Redundancy and diversity
  • 24. Local knowledge Local knowledge includes knowledge of who does what, how authority structures can be bypassed, what rules can be broken, etc. Impossible to replicate entirely in distributed systems but some steps can be taken Maintain information about the provenance of data Who provided the data, where the data came from, when it was created, edited, etc. Maintain organisational models Who is responsible for what, contact details
  • 25. Process reconfiguration Make workflows explicit rather than embedding them in the software Not just ‘continue’ buttons! Users should know where they are and where they are supposed to go Support workflow navigation/interruption/restart Design systems with an ‘emergency mode’ where the the system changes from enforcing policies to auditing actions This would allow the rules to be broken but the system would maintain a log of what has been done and why so that subsequent investigations could trace what happened Support ‘Help, I’m in trouble!’ as well as ‘Help, I need information?’
  • 26. Redundancy and diversity Maintaining a single ‘golden copy’ of data may be efficient but it may not be effective or desirable Encourage the creation of ‘shadow systems’ and provide import and export from these systems Allow schemas to be extended Schemas for data are rarely designed for problem solving. Always allow informal extension (a free text box) so that annotations, explanations and additional information can be provided Maintain organisational models To allow for multi-channel communications when things go wrong
  • 27. Current research Our current work is concerned with the development of responsibility models that make responsibilities across different organisations explicit These models show who is responsible for what and the resources required to discharge responsibilities They provide a basis for maintaining local knowledge about a situation and discovering who to involve when problems have to be solved The role of trust in recovery and the relationship between trust and responsibility is an issue that we have still to address
  • 28. Summary A reductionist approach to software engineering is no longer viable. on its own, for complex systems engineering Improving existing software engineering methods will help but will not deal with the problems of complexity that are inherent in distributed systems of systems We must learn to live with normal, everyday failures Design for recovery involves designing so that the work required to recover from a failure is minimised Recovery strategies include supporting information redundancy and annotation and maintaining organisational models