SlideShare uma empresa Scribd logo
1 de 48
Baixar para ler offline
Operating Systems Meet Fault Tolerance
Bj¨orn D¨obel
06.08.2013
Bj¨orn D¨obel () OS Resilience 06.08.2013 1 / 58
Introduction
“If there’s more than one possible outcome of a job or task,
and one of those outcome will result in disaster or an
undesirable consequence, then somebody will do it that way.”
(Edward Murphy jr.)
Bj¨orn D¨obel () OS Resilience 06.08.2013 2 / 58
Introduction
Outline
Murphy and the OS: Is it really that bad?
Fault-Tolerant Operating Systems
Minix3
CuriOS
L4ReAnimator
Dealing with Hardware Errors
Transparent replication as an OS service
Bj¨orn D¨obel () OS Resilience 06.08.2013 3 / 58
Introduction
Why Things go Wrong
Programming in C:
This pointer is certainly never going to be NULL!
Bj¨orn D¨obel () OS Resilience 06.08.2013 4 / 58
Introduction
Why Things go Wrong
Programming in C:
This pointer is certainly never going to be NULL!
Layering vs. responsibility:
Of course, someone in the higher layers will already have checked
this return value.
Bj¨orn D¨obel () OS Resilience 06.08.2013 4 / 58
Introduction
Why Things go Wrong
Programming in C:
This pointer is certainly never going to be NULL!
Layering vs. responsibility:
Of course, someone in the higher layers will already have checked
this return value.
Concurrency:
This struct is shared between an IRQ handler and a kernel thread.
But they will never execute in parallel.
Bj¨orn D¨obel () OS Resilience 06.08.2013 4 / 58
Introduction
Why Things go Wrong
Programming in C:
This pointer is certainly never going to be NULL!
Layering vs. responsibility:
Of course, someone in the higher layers will already have checked
this return value.
Concurrency:
This struct is shared between an IRQ handler and a kernel thread.
But they will never execute in parallel.
Hardware interaction:
But the device spec said, this was not allowed to happen!
Bj¨orn D¨obel () OS Resilience 06.08.2013 4 / 58
Introduction
Why Things go Wrong
Programming in C:
This pointer is certainly never going to be NULL!
Layering vs. responsibility:
Of course, someone in the higher layers will already have checked
this return value.
Concurrency:
This struct is shared between an IRQ handler and a kernel thread.
But they will never execute in parallel.
Hardware interaction:
But the device spec said, this was not allowed to happen!
Hypocrisy:
I’m a cool OS hacker. I won’t make mistakes, so I don’t need to
test my code!
Bj¨orn D¨obel () OS Resilience 06.08.2013 4 / 58
Introduction
A Classic Study
A. Chou et al.: An empirical study of operating system errors, SOSP 2001
Automated software error detection (today: http://www.coverity.com)
Target: Linux (1.0 - 2.4)
Where are the errors?
How are they distributed?
How long do they survive?
Du bugs cluster in certain locations?
Bj¨orn D¨obel () OS Resilience 06.08.2013 5 / 58
Introduction
Revalidation of Chou’s Results
N. Palix et al.: Faults in Linux: Ten years later, ASPLOS 2011
10 years of work on tools to decrease error counts - has it worked?
Repeated Chou’s analysis until Linux 2.6.34
Bj¨orn D¨obel () OS Resilience 06.08.2013 6 / 58
Introduction
Linux: Lines of Code
Bj¨orn D¨obel () OS Resilience 06.08.2013 7 / 58
Introduction
Fault Rate per Subdirectory (2001)
Bj¨orn D¨obel () OS Resilience 06.08.2013 8 / 58
Introduction
Fault Rate per Subdirectory (2011)
Bj¨orn D¨obel () OS Resilience 06.08.2013 9 / 58
Introduction
Bug Lifetimes (2011)
Bj¨orn D¨obel () OS Resilience 06.08.2013 10 / 58
Introduction
Bug Distribution
Bj¨orn D¨obel () OS Resilience 06.08.2013 11 / 58
Introduction
Break
Faults are an issue.
Hardware-related stu↵ is worst.
Now what can the OS do about it?
Bj¨orn D¨obel () OS Resilience 06.08.2013 12 / 58
Introduction
Minix3 – A Fault-tolerant OS
Userprocesses
User Processes
Server Processes
Device Processes Disk TTY Net Printer Other
File PM Reinc ... Other
Shell Make User ... Other
Kernel
Kernel Clock Task System Task
Bj¨orn D¨obel () OS Resilience 06.08.2013 13 / 58
Introduction
Minix3: Fault Tolerance
Address Space Isolation
Applications only access private memory
Faults do not spread to other components
User-level OS services
Principle of Least Privilege
Fine-grain control over resource access
e.g., DMA only for specific drivers
Small components
Easy to replace (micro-reboot)
Bj¨orn D¨obel () OS Resilience 06.08.2013 14 / 58
Introduction
Minix3: Fault Detection
Fault model: transient errors caused by software bugs
Fix: Component restart
Reincarnation server monitors components
Program termination (crash)
CPU exception (div by 0)
Heartbeat messages
Users may also indicate that something is wrong
Bj¨orn D¨obel () OS Resilience 06.08.2013 15 / 58
Introduction
Repair
Restarting a component is insu cient:
Applications may depend on restarted component
After restart, component state is lost
Minix3: explicit mechanisms
Reincarnation server signals applications about restart
Applications store state at data store server
In any case: program interaction needed
Restarted app: store/recover state
User apps: recover server connection
Bj¨orn D¨obel () OS Resilience 06.08.2013 16 / 58
Introduction
Break
Minix3 fault tolerance:
Architectural Isolation
Explicit monitoring and notifications
Other approaches:
CuriOS: smart session state handling
L4ReAnimator: semi-transparent restart in a capability-based system
Bj¨orn D¨obel () OS Resilience 06.08.2013 17 / 58
Introduction
CuriOS: Servers and Sessions
State recovery is tricky
Minix3: Data Store for application data
But: applications interact
Servers store session-specific state
Server restart requires potential rollback for every participant
Server
State
Client
A
State
Client
B
StateServer
Client A
Client B
Bj¨orn D¨obel () OS Resilience 06.08.2013 18 / 58
Introduction
CuriOS: Server State Regions
CuriOS kernel (CuiK) manages dedicated session memory: Server State
Regions
SSRs are managed by the kernel and attached to a client-server connection
Server
State
Server
Client A
Client State A
Client B
Client State B
Bj¨orn D¨obel () OS Resilience 06.08.2013 19 / 58
Introduction
CuriOS: Protecting Sessions
SSR gets mapped only when a client actually invokes the server
Solves another problem: failure while handling A’s request will never corrupt
B’s session state
Server
State
Server
Client A
Client State A
Client B
Client State B
Bj¨orn D¨obel () OS Resilience 06.08.2013 20 / 58
Introduction
CuriOS: Protecting Sessions
SSR gets mapped only when a client actually invokes the server
Solves another problem: failure while handling A’s request will never corrupt
B’s session state
Server
State
Server
Client A
Client State A
Client B
Client State B
call()
Bj¨orn D¨obel () OS Resilience 06.08.2013 20 / 58
Introduction
CuriOS: Protecting Sessions
SSR gets mapped only when a client actually invokes the server
Solves another problem: failure while handling A’s request will never corrupt
B’s session state
Server
State
Server
Client A
Client State A
Client B
Client State B
Client
A
State
call()
Bj¨orn D¨obel () OS Resilience 06.08.2013 20 / 58
Introduction
CuriOS: Protecting Sessions
SSR gets mapped only when a client actually invokes the server
Solves another problem: failure while handling A’s request will never corrupt
B’s session state
Server
State
Server
Client A
Client State A
Client B
Client State B
reply()
Bj¨orn D¨obel () OS Resilience 06.08.2013 20 / 58
Introduction
CuriOS: Protecting Sessions
SSR gets mapped only when a client actually invokes the server
Solves another problem: failure while handling A’s request will never corrupt
B’s session state
Server
State
Server
Client A
Client State A
Client B
Client State Bcall()
Bj¨orn D¨obel () OS Resilience 06.08.2013 20 / 58
Introduction
CuriOS: Protecting Sessions
SSR gets mapped only when a client actually invokes the server
Solves another problem: failure while handling A’s request will never corrupt
B’s session state
Server
State
Server
Client A
Client State A
Client B
Client State B
Client
B
State
call()
Bj¨orn D¨obel () OS Resilience 06.08.2013 20 / 58
Introduction
CuriOS: Protecting Sessions
SSR gets mapped only when a client actually invokes the server
Solves another problem: failure while handling A’s request will never corrupt
B’s session state
Server
State
Server
Client A
Client State A
Client B
Client State B
reply()
Bj¨orn D¨obel () OS Resilience 06.08.2013 20 / 58
Introduction
CuriOS: Transparent Restart
CuriOS is a Single-Address-Space OS:
Every application runs on the same page table (with modified access rights)
OS A A B BShared Mem
OS A A B BB Running
OS A A B BA Running
OS A A B BVirt. Memory
Bj¨orn D¨obel () OS Resilience 06.08.2013 21 / 58
Introduction
Transparent Restart
Single Address Space
Each object has unique address
Identical in all programs
Server := C++ object
Restart
Replace old C++ object with new one
Reuse previous memory location
References in other applications remain valid
OS blocks access during restart
Bj¨orn D¨obel () OS Resilience 06.08.2013 22 / 58
Introduction
L4ReAnimator: Restart on L4Re
L4Re Applications
Loader component: ned
Detects application termination: parent signal
Restart: re-execute Lua init script (or parts of it)
Problem after restart: capabilities
No single component knows everyone owning a capability to an object
Minix3 signals won’t work
Bj¨orn D¨obel () OS Resilience 06.08.2013 23 / 58
Introduction
L4Re: Session Creation
Client Server
Loader
Session
Creation
Capability
Bj¨orn D¨obel () OS Resilience 06.08.2013 24 / 58
Introduction
L4Re: Session Creation
Client Server
Loader
(1)create
Bj¨orn D¨obel () OS Resilience 06.08.2013 24 / 58
Introduction
L4Re: Session Creation
Client Server
Loader
(2) Mapped
during startup
Bj¨orn D¨obel () OS Resilience 06.08.2013 24 / 58
Introduction
L4Re: Session Creation
Client Server
Loader
(3) factory.create()
Bj¨orn D¨obel () OS Resilience 06.08.2013 24 / 58
Introduction
L4Re: Session Creation
Client Server
Loader
Session Capability
Bj¨orn D¨obel () OS Resilience 06.08.2013 24 / 58
Introduction
L4Re: Session Creation
Client Server
Loader
(4) create
Bj¨orn D¨obel () OS Resilience 06.08.2013 24 / 58
Introduction
L4Re: Session Creation
Client Server
Loader
(5) use
Bj¨orn D¨obel () OS Resilience 06.08.2013 24 / 58
Introduction
L4Re: Server Crash
Client Server
Loader
Session Capability
(6)exitKernel destroys mem-
ory, server objects
(channels...)
Bj¨orn D¨obel () OS Resilience 06.08.2013 25 / 58
Introduction
L4Re: Server Crash
Client Server
Loader
Session Capability
(7)restart
Bj¨orn D¨obel () OS Resilience 06.08.2013 25 / 58
Introduction
L4Re: Restarted Server
Client Server
Loader
Session Capability
Bj¨orn D¨obel () OS Resilience 06.08.2013 26 / 58
Introduction
L4Re: Restarted Server
Client Server
Loader
Session Capability
use
Bj¨orn D¨obel () OS Resilience 06.08.2013 26 / 58
Introduction
L4Re: Restarted Server
Client Server
Loader
Session Capability
use
Error!
Bj¨orn D¨obel () OS Resilience 06.08.2013 26 / 58
Introduction
L4ReAnimator
Only the application itself can detect that a capability vanished
Kernel raises Capability fault
Application needs to re-obtain the capability: execute capability fault handler
Capfault handler: application-specific
Create new communication channel
Restore session state
Programming model:
Capfault handler provided by server implementor
Handling transparent for application developer
Semi-transparency
Bj¨orn D¨obel () OS Resilience 06.08.2013 27 / 58
Introduction
L4ReAnimator: Cleanup
Some channels have resources attached (e.g., frame bu↵er for graphical
console)
Resource may come from a di↵erent resource (e.g., frame bu↵er from
memory manager)
Resources remain intact (stale) upon crash
Client ends up using old version of the resource
Requires additional app-specific knowledge
Unmap handler
Bj¨orn D¨obel () OS Resilience 06.08.2013 28 / 58
Introduction
Summary
L4ReAnimator
Capfault: Clients detect server restarts lazily
Capfault Handler: application-specific knowledge on how to regain access to
the server
Unmap handler: clean up old resources after restart
All these frameworks only deal with software errors.
What about hardware faults? ! Next session!
Bj¨orn D¨obel () OS Resilience 06.08.2013 29 / 58

Mais conteúdo relacionado

Mais procurados

Kernel Recipes 2014 - Testing Video4Linux Applications and Drivers
Kernel Recipes 2014 - Testing Video4Linux Applications and DriversKernel Recipes 2014 - Testing Video4Linux Applications and Drivers
Kernel Recipes 2014 - Testing Video4Linux Applications and DriversAnne Nicolas
 
Study on Android Emulator
Study on Android EmulatorStudy on Android Emulator
Study on Android EmulatorSamael Wang
 
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...Stefano Di Carlo
 
Bootkits: past, present & future
Bootkits: past, present & futureBootkits: past, present & future
Bootkits: past, present & futureAlex Matrosov
 
Defeating x64: The Evolution of the TDL Rootkit
Defeating x64: The Evolution of the TDL RootkitDefeating x64: The Evolution of the TDL Rootkit
Defeating x64: The Evolution of the TDL RootkitAlex Matrosov
 
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMUSFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMULinaro
 
[Ruxcon] Breaking virtualization by switching the cpu to virtual 8086 mode
[Ruxcon] Breaking virtualization by switching the cpu to virtual 8086 mode[Ruxcon] Breaking virtualization by switching the cpu to virtual 8086 mode
[Ruxcon] Breaking virtualization by switching the cpu to virtual 8086 modeMoabi.com
 
Android Boot Time Optimization
Android Boot Time OptimizationAndroid Boot Time Optimization
Android Boot Time OptimizationKan-Ru Chen
 
Threading Game Engines: QUAKE 4 & Enemy Territory QUAKE Wars
Threading Game Engines: QUAKE 4 & Enemy Territory QUAKE WarsThreading Game Engines: QUAKE 4 & Enemy Territory QUAKE Wars
Threading Game Engines: QUAKE 4 & Enemy Territory QUAKE Warspsteinb
 
44CON London - Attacking VxWorks: from Stone Age to Interstellar
44CON London - Attacking VxWorks: from Stone Age to Interstellar44CON London - Attacking VxWorks: from Stone Age to Interstellar
44CON London - Attacking VxWorks: from Stone Age to Interstellar44CON
 
Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel Intel® Software
 
Workshop su Android Kernel Hacking
Workshop su Android Kernel HackingWorkshop su Android Kernel Hacking
Workshop su Android Kernel HackingDeveler S.r.l.
 

Mais procurados (13)

Kernel Recipes 2014 - Testing Video4Linux Applications and Drivers
Kernel Recipes 2014 - Testing Video4Linux Applications and DriversKernel Recipes 2014 - Testing Video4Linux Applications and Drivers
Kernel Recipes 2014 - Testing Video4Linux Applications and Drivers
 
Study on Android Emulator
Study on Android EmulatorStudy on Android Emulator
Study on Android Emulator
 
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
 
Bootkits: past, present & future
Bootkits: past, present & futureBootkits: past, present & future
Bootkits: past, present & future
 
Defeating x64: The Evolution of the TDL Rootkit
Defeating x64: The Evolution of the TDL RootkitDefeating x64: The Evolution of the TDL Rootkit
Defeating x64: The Evolution of the TDL Rootkit
 
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMUSFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
 
[Ruxcon] Breaking virtualization by switching the cpu to virtual 8086 mode
[Ruxcon] Breaking virtualization by switching the cpu to virtual 8086 mode[Ruxcon] Breaking virtualization by switching the cpu to virtual 8086 mode
[Ruxcon] Breaking virtualization by switching the cpu to virtual 8086 mode
 
Explore Android Internals
Explore Android InternalsExplore Android Internals
Explore Android Internals
 
Android Boot Time Optimization
Android Boot Time OptimizationAndroid Boot Time Optimization
Android Boot Time Optimization
 
Threading Game Engines: QUAKE 4 & Enemy Territory QUAKE Wars
Threading Game Engines: QUAKE 4 & Enemy Territory QUAKE WarsThreading Game Engines: QUAKE 4 & Enemy Territory QUAKE Wars
Threading Game Engines: QUAKE 4 & Enemy Territory QUAKE Wars
 
44CON London - Attacking VxWorks: from Stone Age to Interstellar
44CON London - Attacking VxWorks: from Stone Age to Interstellar44CON London - Attacking VxWorks: from Stone Age to Interstellar
44CON London - Attacking VxWorks: from Stone Age to Interstellar
 
Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel
 
Workshop su Android Kernel Hacking
Workshop su Android Kernel HackingWorkshop su Android Kernel Hacking
Workshop su Android Kernel Hacking
 

Destaque

Прикладная Информатика 6 (36) 2011
Прикладная Информатика 6 (36) 2011Прикладная Информатика 6 (36) 2011
Прикладная Информатика 6 (36) 2011Vasily Sartakov
 
Сетевая подсистема в L4Re и Genode
Сетевая подсистема в L4Re и GenodeСетевая подсистема в L4Re и Genode
Сетевая подсистема в L4Re и GenodeVasily Sartakov
 
Защита памяти при помощи NX-bit в среде L4Re
Защита памяти при помощи NX-bit в среде L4ReЗащита памяти при помощи NX-bit в среде L4Re
Защита памяти при помощи NX-bit в среде L4ReVasily Sartakov
 
RnD Collaborations in Asia-Pacific Region
RnD Collaborations in Asia-Pacific RegionRnD Collaborations in Asia-Pacific Region
RnD Collaborations in Asia-Pacific RegionVasily Sartakov
 

Destaque (7)

Прикладная Информатика 6 (36) 2011
Прикладная Информатика 6 (36) 2011Прикладная Информатика 6 (36) 2011
Прикладная Информатика 6 (36) 2011
 
Intro
IntroIntro
Intro
 
Memory, IPC and L4Re
Memory, IPC and L4ReMemory, IPC and L4Re
Memory, IPC and L4Re
 
Genode OS Framework
Genode OS FrameworkGenode OS Framework
Genode OS Framework
 
Сетевая подсистема в L4Re и Genode
Сетевая подсистема в L4Re и GenodeСетевая подсистема в L4Re и Genode
Сетевая подсистема в L4Re и Genode
 
Защита памяти при помощи NX-bit в среде L4Re
Защита памяти при помощи NX-bit в среде L4ReЗащита памяти при помощи NX-bit в среде L4Re
Защита памяти при помощи NX-bit в среде L4Re
 
RnD Collaborations in Asia-Pacific Region
RnD Collaborations in Asia-Pacific RegionRnD Collaborations in Asia-Pacific Region
RnD Collaborations in Asia-Pacific Region
 

Semelhante a Operating Systems Meet Fault Tolerance

VaMoS 2021 - Deep Software Variability: Towards Handling Cross-Layer Configur...
VaMoS 2021 - Deep Software Variability: Towards Handling Cross-Layer Configur...VaMoS 2021 - Deep Software Variability: Towards Handling Cross-Layer Configur...
VaMoS 2021 - Deep Software Variability: Towards Handling Cross-Layer Configur...Luc Lesoil
 
IT241_TEST.BANK مرتب (2).pdf
IT241_TEST.BANK مرتب (2).pdfIT241_TEST.BANK مرتب (2).pdf
IT241_TEST.BANK مرتب (2).pdfSHEHABALYAMANI
 
How to Test Enterprise Java Applications
How to Test Enterprise Java ApplicationsHow to Test Enterprise Java Applications
How to Test Enterprise Java ApplicationsAlex Soto
 
Fault tolerance techniques tsp
Fault tolerance techniques tspFault tolerance techniques tsp
Fault tolerance techniques tspPradeep Kumar TS
 
DDD Framework for Java: JdonFramework
DDD Framework for Java: JdonFrameworkDDD Framework for Java: JdonFramework
DDD Framework for Java: JdonFrameworkbanq jdon
 
Lightning talk: highly scalable databases and the PACELC theorem
Lightning talk: highly scalable databases and the PACELC theoremLightning talk: highly scalable databases and the PACELC theorem
Lightning talk: highly scalable databases and the PACELC theoremVishal Bardoloi
 
Virtual Machines Security Internals: Detection and Exploitation
 Virtual Machines Security Internals: Detection and Exploitation Virtual Machines Security Internals: Detection and Exploitation
Virtual Machines Security Internals: Detection and ExploitationMattia Salvi
 
chaos-engineering-Knolx
chaos-engineering-Knolxchaos-engineering-Knolx
chaos-engineering-KnolxKnoldus Inc.
 
Achieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-KernelsAchieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-KernelsJiannan Ouyang, PhD
 
On technology transfer: experience from the CARP project... and beyond
On technology transfer: experience from the CARP project... and beyondOn technology transfer: experience from the CARP project... and beyond
On technology transfer: experience from the CARP project... and beyonddividiti
 
Make A Shoot ‘Em Up Game with Amethyst Framework
Make A Shoot ‘Em Up Game with Amethyst FrameworkMake A Shoot ‘Em Up Game with Amethyst Framework
Make A Shoot ‘Em Up Game with Amethyst FrameworkYodalee
 
2023 Ivanti December Patch Tuesday
2023 Ivanti December Patch Tuesday2023 Ivanti December Patch Tuesday
2023 Ivanti December Patch TuesdayIvanti
 
Predicting Reassignments of Bug Reports — an Exploratory Investigation
Predicting Reassignments of Bug Reports — an Exploratory InvestigationPredicting Reassignments of Bug Reports — an Exploratory Investigation
Predicting Reassignments of Bug Reports — an Exploratory InvestigationAhmed Lamkanfi
 
Filtering Bug Reports for Fix-Time Analysis
Filtering Bug Reports for Fix-Time AnalysisFiltering Bug Reports for Fix-Time Analysis
Filtering Bug Reports for Fix-Time AnalysisAhmed Lamkanfi
 
Truly non-intrusive OpenStack Cinder backup for mission critical systems
Truly non-intrusive OpenStack Cinder backup for mission critical systemsTruly non-intrusive OpenStack Cinder backup for mission critical systems
Truly non-intrusive OpenStack Cinder backup for mission critical systemsDipak Kumar Singh
 
2023 Ivanti September Patch Tuesday
2023 Ivanti September Patch Tuesday2023 Ivanti September Patch Tuesday
2023 Ivanti September Patch TuesdayIvanti
 

Semelhante a Operating Systems Meet Fault Tolerance (20)

3Requirements.ppt
3Requirements.ppt3Requirements.ppt
3Requirements.ppt
 
VaMoS 2021 - Deep Software Variability: Towards Handling Cross-Layer Configur...
VaMoS 2021 - Deep Software Variability: Towards Handling Cross-Layer Configur...VaMoS 2021 - Deep Software Variability: Towards Handling Cross-Layer Configur...
VaMoS 2021 - Deep Software Variability: Towards Handling Cross-Layer Configur...
 
IT241_TEST.BANK مرتب (2).pdf
IT241_TEST.BANK مرتب (2).pdfIT241_TEST.BANK مرتب (2).pdf
IT241_TEST.BANK مرتب (2).pdf
 
How to Test Enterprise Java Applications
How to Test Enterprise Java ApplicationsHow to Test Enterprise Java Applications
How to Test Enterprise Java Applications
 
Fault tolerance techniques tsp
Fault tolerance techniques tspFault tolerance techniques tsp
Fault tolerance techniques tsp
 
DDD Framework for Java: JdonFramework
DDD Framework for Java: JdonFrameworkDDD Framework for Java: JdonFramework
DDD Framework for Java: JdonFramework
 
Lightning talk: highly scalable databases and the PACELC theorem
Lightning talk: highly scalable databases and the PACELC theoremLightning talk: highly scalable databases and the PACELC theorem
Lightning talk: highly scalable databases and the PACELC theorem
 
2337610
23376102337610
2337610
 
Handout2o
Handout2oHandout2o
Handout2o
 
Virtual Machines Security Internals: Detection and Exploitation
 Virtual Machines Security Internals: Detection and Exploitation Virtual Machines Security Internals: Detection and Exploitation
Virtual Machines Security Internals: Detection and Exploitation
 
chaos-engineering-Knolx
chaos-engineering-Knolxchaos-engineering-Knolx
chaos-engineering-Knolx
 
Achieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-KernelsAchieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-Kernels
 
On technology transfer: experience from the CARP project... and beyond
On technology transfer: experience from the CARP project... and beyondOn technology transfer: experience from the CARP project... and beyond
On technology transfer: experience from the CARP project... and beyond
 
Make A Shoot ‘Em Up Game with Amethyst Framework
Make A Shoot ‘Em Up Game with Amethyst FrameworkMake A Shoot ‘Em Up Game with Amethyst Framework
Make A Shoot ‘Em Up Game with Amethyst Framework
 
2023 Ivanti December Patch Tuesday
2023 Ivanti December Patch Tuesday2023 Ivanti December Patch Tuesday
2023 Ivanti December Patch Tuesday
 
Predicting Reassignments of Bug Reports — an Exploratory Investigation
Predicting Reassignments of Bug Reports — an Exploratory InvestigationPredicting Reassignments of Bug Reports — an Exploratory Investigation
Predicting Reassignments of Bug Reports — an Exploratory Investigation
 
Filtering Bug Reports for Fix-Time Analysis
Filtering Bug Reports for Fix-Time AnalysisFiltering Bug Reports for Fix-Time Analysis
Filtering Bug Reports for Fix-Time Analysis
 
Truly non-intrusive OpenStack Cinder backup for mission critical systems
Truly non-intrusive OpenStack Cinder backup for mission critical systemsTruly non-intrusive OpenStack Cinder backup for mission critical systems
Truly non-intrusive OpenStack Cinder backup for mission critical systems
 
AKS: k8s e azure
AKS: k8s e azureAKS: k8s e azure
AKS: k8s e azure
 
2023 Ivanti September Patch Tuesday
2023 Ivanti September Patch Tuesday2023 Ivanti September Patch Tuesday
2023 Ivanti September Patch Tuesday
 

Mais de Vasily Sartakov

Мейнстрим технологии шифрованной памяти
Мейнстрим технологии шифрованной памятиМейнстрим технологии шифрованной памяти
Мейнстрим технологии шифрованной памятиVasily Sartakov
 
Operating Systems Hardening
Operating Systems HardeningOperating Systems Hardening
Operating Systems HardeningVasily Sartakov
 
Особенности Национального RnD
Особенности Национального RnDОсобенности Национального RnD
Особенности Национального RnDVasily Sartakov
 
Introduction to Microkernels
Introduction to MicrokernelsIntroduction to Microkernels
Introduction to MicrokernelsVasily Sartakov
 
Advanced Components on Top of L4Re
Advanced Components on Top of L4ReAdvanced Components on Top of L4Re
Advanced Components on Top of L4ReVasily Sartakov
 
Применение Fiasco.OC
Применение Fiasco.OCПрименение Fiasco.OC
Применение Fiasco.OCVasily Sartakov
 
Разработка встраиваемой операционной системы на базе микроядерной архитектуры...
Разработка встраиваемой операционной системы на базе микроядерной архитектуры...Разработка встраиваемой операционной системы на базе микроядерной архитектуры...
Разработка встраиваемой операционной системы на базе микроядерной архитектуры...Vasily Sartakov
 
Образование, наука, бизнес. Сегодня, завтра, послезавтра
Образование, наука, бизнес. Сегодня, завтра, послезавтраОбразование, наука, бизнес. Сегодня, завтра, послезавтра
Образование, наука, бизнес. Сегодня, завтра, послезавтраVasily Sartakov
 

Mais de Vasily Sartakov (15)

Мейнстрим технологии шифрованной памяти
Мейнстрим технологии шифрованной памятиМейнстрим технологии шифрованной памяти
Мейнстрим технологии шифрованной памяти
 
Operating Systems Hardening
Operating Systems HardeningOperating Systems Hardening
Operating Systems Hardening
 
Особенности Национального RnD
Особенности Национального RnDОсобенности Национального RnD
Особенности Национального RnD
 
Genode Architecture
Genode ArchitectureGenode Architecture
Genode Architecture
 
Genode Components
Genode ComponentsGenode Components
Genode Components
 
Genode Programming
Genode ProgrammingGenode Programming
Genode Programming
 
Genode Compositions
Genode CompositionsGenode Compositions
Genode Compositions
 
Trusted Computing Base
Trusted Computing BaseTrusted Computing Base
Trusted Computing Base
 
System Integrity
System IntegritySystem Integrity
System Integrity
 
Intro
IntroIntro
Intro
 
Introduction to Microkernels
Introduction to MicrokernelsIntroduction to Microkernels
Introduction to Microkernels
 
Advanced Components on Top of L4Re
Advanced Components on Top of L4ReAdvanced Components on Top of L4Re
Advanced Components on Top of L4Re
 
Применение Fiasco.OC
Применение Fiasco.OCПрименение Fiasco.OC
Применение Fiasco.OC
 
Разработка встраиваемой операционной системы на базе микроядерной архитектуры...
Разработка встраиваемой операционной системы на базе микроядерной архитектуры...Разработка встраиваемой операционной системы на базе микроядерной архитектуры...
Разработка встраиваемой операционной системы на базе микроядерной архитектуры...
 
Образование, наука, бизнес. Сегодня, завтра, послезавтра
Образование, наука, бизнес. Сегодня, завтра, послезавтраОбразование, наука, бизнес. Сегодня, завтра, послезавтра
Образование, наука, бизнес. Сегодня, завтра, послезавтра
 

Último

ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 

Último (20)

ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 

Operating Systems Meet Fault Tolerance

  • 1. Operating Systems Meet Fault Tolerance Bj¨orn D¨obel 06.08.2013 Bj¨orn D¨obel () OS Resilience 06.08.2013 1 / 58
  • 2. Introduction “If there’s more than one possible outcome of a job or task, and one of those outcome will result in disaster or an undesirable consequence, then somebody will do it that way.” (Edward Murphy jr.) Bj¨orn D¨obel () OS Resilience 06.08.2013 2 / 58
  • 3. Introduction Outline Murphy and the OS: Is it really that bad? Fault-Tolerant Operating Systems Minix3 CuriOS L4ReAnimator Dealing with Hardware Errors Transparent replication as an OS service Bj¨orn D¨obel () OS Resilience 06.08.2013 3 / 58
  • 4. Introduction Why Things go Wrong Programming in C: This pointer is certainly never going to be NULL! Bj¨orn D¨obel () OS Resilience 06.08.2013 4 / 58
  • 5. Introduction Why Things go Wrong Programming in C: This pointer is certainly never going to be NULL! Layering vs. responsibility: Of course, someone in the higher layers will already have checked this return value. Bj¨orn D¨obel () OS Resilience 06.08.2013 4 / 58
  • 6. Introduction Why Things go Wrong Programming in C: This pointer is certainly never going to be NULL! Layering vs. responsibility: Of course, someone in the higher layers will already have checked this return value. Concurrency: This struct is shared between an IRQ handler and a kernel thread. But they will never execute in parallel. Bj¨orn D¨obel () OS Resilience 06.08.2013 4 / 58
  • 7. Introduction Why Things go Wrong Programming in C: This pointer is certainly never going to be NULL! Layering vs. responsibility: Of course, someone in the higher layers will already have checked this return value. Concurrency: This struct is shared between an IRQ handler and a kernel thread. But they will never execute in parallel. Hardware interaction: But the device spec said, this was not allowed to happen! Bj¨orn D¨obel () OS Resilience 06.08.2013 4 / 58
  • 8. Introduction Why Things go Wrong Programming in C: This pointer is certainly never going to be NULL! Layering vs. responsibility: Of course, someone in the higher layers will already have checked this return value. Concurrency: This struct is shared between an IRQ handler and a kernel thread. But they will never execute in parallel. Hardware interaction: But the device spec said, this was not allowed to happen! Hypocrisy: I’m a cool OS hacker. I won’t make mistakes, so I don’t need to test my code! Bj¨orn D¨obel () OS Resilience 06.08.2013 4 / 58
  • 9. Introduction A Classic Study A. Chou et al.: An empirical study of operating system errors, SOSP 2001 Automated software error detection (today: http://www.coverity.com) Target: Linux (1.0 - 2.4) Where are the errors? How are they distributed? How long do they survive? Du bugs cluster in certain locations? Bj¨orn D¨obel () OS Resilience 06.08.2013 5 / 58
  • 10. Introduction Revalidation of Chou’s Results N. Palix et al.: Faults in Linux: Ten years later, ASPLOS 2011 10 years of work on tools to decrease error counts - has it worked? Repeated Chou’s analysis until Linux 2.6.34 Bj¨orn D¨obel () OS Resilience 06.08.2013 6 / 58
  • 11. Introduction Linux: Lines of Code Bj¨orn D¨obel () OS Resilience 06.08.2013 7 / 58
  • 12. Introduction Fault Rate per Subdirectory (2001) Bj¨orn D¨obel () OS Resilience 06.08.2013 8 / 58
  • 13. Introduction Fault Rate per Subdirectory (2011) Bj¨orn D¨obel () OS Resilience 06.08.2013 9 / 58
  • 14. Introduction Bug Lifetimes (2011) Bj¨orn D¨obel () OS Resilience 06.08.2013 10 / 58
  • 15. Introduction Bug Distribution Bj¨orn D¨obel () OS Resilience 06.08.2013 11 / 58
  • 16. Introduction Break Faults are an issue. Hardware-related stu↵ is worst. Now what can the OS do about it? Bj¨orn D¨obel () OS Resilience 06.08.2013 12 / 58
  • 17. Introduction Minix3 – A Fault-tolerant OS Userprocesses User Processes Server Processes Device Processes Disk TTY Net Printer Other File PM Reinc ... Other Shell Make User ... Other Kernel Kernel Clock Task System Task Bj¨orn D¨obel () OS Resilience 06.08.2013 13 / 58
  • 18. Introduction Minix3: Fault Tolerance Address Space Isolation Applications only access private memory Faults do not spread to other components User-level OS services Principle of Least Privilege Fine-grain control over resource access e.g., DMA only for specific drivers Small components Easy to replace (micro-reboot) Bj¨orn D¨obel () OS Resilience 06.08.2013 14 / 58
  • 19. Introduction Minix3: Fault Detection Fault model: transient errors caused by software bugs Fix: Component restart Reincarnation server monitors components Program termination (crash) CPU exception (div by 0) Heartbeat messages Users may also indicate that something is wrong Bj¨orn D¨obel () OS Resilience 06.08.2013 15 / 58
  • 20. Introduction Repair Restarting a component is insu cient: Applications may depend on restarted component After restart, component state is lost Minix3: explicit mechanisms Reincarnation server signals applications about restart Applications store state at data store server In any case: program interaction needed Restarted app: store/recover state User apps: recover server connection Bj¨orn D¨obel () OS Resilience 06.08.2013 16 / 58
  • 21. Introduction Break Minix3 fault tolerance: Architectural Isolation Explicit monitoring and notifications Other approaches: CuriOS: smart session state handling L4ReAnimator: semi-transparent restart in a capability-based system Bj¨orn D¨obel () OS Resilience 06.08.2013 17 / 58
  • 22. Introduction CuriOS: Servers and Sessions State recovery is tricky Minix3: Data Store for application data But: applications interact Servers store session-specific state Server restart requires potential rollback for every participant Server State Client A State Client B StateServer Client A Client B Bj¨orn D¨obel () OS Resilience 06.08.2013 18 / 58
  • 23. Introduction CuriOS: Server State Regions CuriOS kernel (CuiK) manages dedicated session memory: Server State Regions SSRs are managed by the kernel and attached to a client-server connection Server State Server Client A Client State A Client B Client State B Bj¨orn D¨obel () OS Resilience 06.08.2013 19 / 58
  • 24. Introduction CuriOS: Protecting Sessions SSR gets mapped only when a client actually invokes the server Solves another problem: failure while handling A’s request will never corrupt B’s session state Server State Server Client A Client State A Client B Client State B Bj¨orn D¨obel () OS Resilience 06.08.2013 20 / 58
  • 25. Introduction CuriOS: Protecting Sessions SSR gets mapped only when a client actually invokes the server Solves another problem: failure while handling A’s request will never corrupt B’s session state Server State Server Client A Client State A Client B Client State B call() Bj¨orn D¨obel () OS Resilience 06.08.2013 20 / 58
  • 26. Introduction CuriOS: Protecting Sessions SSR gets mapped only when a client actually invokes the server Solves another problem: failure while handling A’s request will never corrupt B’s session state Server State Server Client A Client State A Client B Client State B Client A State call() Bj¨orn D¨obel () OS Resilience 06.08.2013 20 / 58
  • 27. Introduction CuriOS: Protecting Sessions SSR gets mapped only when a client actually invokes the server Solves another problem: failure while handling A’s request will never corrupt B’s session state Server State Server Client A Client State A Client B Client State B reply() Bj¨orn D¨obel () OS Resilience 06.08.2013 20 / 58
  • 28. Introduction CuriOS: Protecting Sessions SSR gets mapped only when a client actually invokes the server Solves another problem: failure while handling A’s request will never corrupt B’s session state Server State Server Client A Client State A Client B Client State Bcall() Bj¨orn D¨obel () OS Resilience 06.08.2013 20 / 58
  • 29. Introduction CuriOS: Protecting Sessions SSR gets mapped only when a client actually invokes the server Solves another problem: failure while handling A’s request will never corrupt B’s session state Server State Server Client A Client State A Client B Client State B Client B State call() Bj¨orn D¨obel () OS Resilience 06.08.2013 20 / 58
  • 30. Introduction CuriOS: Protecting Sessions SSR gets mapped only when a client actually invokes the server Solves another problem: failure while handling A’s request will never corrupt B’s session state Server State Server Client A Client State A Client B Client State B reply() Bj¨orn D¨obel () OS Resilience 06.08.2013 20 / 58
  • 31. Introduction CuriOS: Transparent Restart CuriOS is a Single-Address-Space OS: Every application runs on the same page table (with modified access rights) OS A A B BShared Mem OS A A B BB Running OS A A B BA Running OS A A B BVirt. Memory Bj¨orn D¨obel () OS Resilience 06.08.2013 21 / 58
  • 32. Introduction Transparent Restart Single Address Space Each object has unique address Identical in all programs Server := C++ object Restart Replace old C++ object with new one Reuse previous memory location References in other applications remain valid OS blocks access during restart Bj¨orn D¨obel () OS Resilience 06.08.2013 22 / 58
  • 33. Introduction L4ReAnimator: Restart on L4Re L4Re Applications Loader component: ned Detects application termination: parent signal Restart: re-execute Lua init script (or parts of it) Problem after restart: capabilities No single component knows everyone owning a capability to an object Minix3 signals won’t work Bj¨orn D¨obel () OS Resilience 06.08.2013 23 / 58
  • 34. Introduction L4Re: Session Creation Client Server Loader Session Creation Capability Bj¨orn D¨obel () OS Resilience 06.08.2013 24 / 58
  • 35. Introduction L4Re: Session Creation Client Server Loader (1)create Bj¨orn D¨obel () OS Resilience 06.08.2013 24 / 58
  • 36. Introduction L4Re: Session Creation Client Server Loader (2) Mapped during startup Bj¨orn D¨obel () OS Resilience 06.08.2013 24 / 58
  • 37. Introduction L4Re: Session Creation Client Server Loader (3) factory.create() Bj¨orn D¨obel () OS Resilience 06.08.2013 24 / 58
  • 38. Introduction L4Re: Session Creation Client Server Loader Session Capability Bj¨orn D¨obel () OS Resilience 06.08.2013 24 / 58
  • 39. Introduction L4Re: Session Creation Client Server Loader (4) create Bj¨orn D¨obel () OS Resilience 06.08.2013 24 / 58
  • 40. Introduction L4Re: Session Creation Client Server Loader (5) use Bj¨orn D¨obel () OS Resilience 06.08.2013 24 / 58
  • 41. Introduction L4Re: Server Crash Client Server Loader Session Capability (6)exitKernel destroys mem- ory, server objects (channels...) Bj¨orn D¨obel () OS Resilience 06.08.2013 25 / 58
  • 42. Introduction L4Re: Server Crash Client Server Loader Session Capability (7)restart Bj¨orn D¨obel () OS Resilience 06.08.2013 25 / 58
  • 43. Introduction L4Re: Restarted Server Client Server Loader Session Capability Bj¨orn D¨obel () OS Resilience 06.08.2013 26 / 58
  • 44. Introduction L4Re: Restarted Server Client Server Loader Session Capability use Bj¨orn D¨obel () OS Resilience 06.08.2013 26 / 58
  • 45. Introduction L4Re: Restarted Server Client Server Loader Session Capability use Error! Bj¨orn D¨obel () OS Resilience 06.08.2013 26 / 58
  • 46. Introduction L4ReAnimator Only the application itself can detect that a capability vanished Kernel raises Capability fault Application needs to re-obtain the capability: execute capability fault handler Capfault handler: application-specific Create new communication channel Restore session state Programming model: Capfault handler provided by server implementor Handling transparent for application developer Semi-transparency Bj¨orn D¨obel () OS Resilience 06.08.2013 27 / 58
  • 47. Introduction L4ReAnimator: Cleanup Some channels have resources attached (e.g., frame bu↵er for graphical console) Resource may come from a di↵erent resource (e.g., frame bu↵er from memory manager) Resources remain intact (stale) upon crash Client ends up using old version of the resource Requires additional app-specific knowledge Unmap handler Bj¨orn D¨obel () OS Resilience 06.08.2013 28 / 58
  • 48. Introduction Summary L4ReAnimator Capfault: Clients detect server restarts lazily Capfault Handler: application-specific knowledge on how to regain access to the server Unmap handler: clean up old resources after restart All these frameworks only deal with software errors. What about hardware faults? ! Next session! Bj¨orn D¨obel () OS Resilience 06.08.2013 29 / 58