SlideShare uma empresa Scribd logo
1 de 33
Baixar para ler offline
Reliability, Availability, and
Serviceability (RAS) on AArch64
Fu Wei (Linaro LEG)
Supreeth Venkatesh (ARM)
ENGINEERS AND DEVICES
WORKING TOGETHER
AGENDA
1. Brief introduction of RAS
○ Definition, Importance, History
2. RAS on AArch64
○ Overview
■ Hardware support
● RAS Extension
■ Software Architecture
● ARM-Trusted-Firmware, UEFI, APEI tables
● SDEI
○ Prototype Solution for Firmware First Error Handling
■ MM Secure Partition, Secure Partition Manager
■ Uncorrected error -- HEST & MM
■ Demo time
3. Status and Future Plans
Brief introduction of RAS
● What is RAS?
● Why do we need RAS?
● History of RAS
ENGINEERS AND DEVICES
WORKING TOGETHER
What is RAS? -- Definition
Reliability
Continuity, Computation
needs be correct and reliable.
Availability
Readiness, System needs to
remain available as long as
possible.
Serviceability
Ability to undergo modifications
and repairs,System should
provide information to
administrator to aid in system
servicing.
The RAS architecture primarily cares about ERRORs produced from HARDWARE .
ENGINEERS AND DEVICES
WORKING TOGETHER
Why do we need RAS? -- Importance
Impacts
Continuity, Computation
needs be correct and reliable.
Inevitability
Although faults are rare,
enterprise systems can be
very large. So failures are
inevitable.
So we have to maintain
system very well, and
Operating Expense (OPEX)
for maintenance is inevitable.
OPEX for maintenance is reduced by
1. replacing only failed parts
2. scheduled maintenance (is
cheaper than unscheduled
service outages)
ENGINEERS AND DEVICES
WORKING TOGETHER
Why do we need RAS? -- Importance
Inevitability
<DRAM Errors in the Wild: A Large-Scale Field Study>
by Bianca Schroeder, Eduardo Pinheiro, Wolf-Dietrich
Weber
Important Conclusion:
● The incidence of memory errors and the range of
error rates across different DIMMs to be much
higher than previously reported.
● Memory errors are strongly correlated.
● The incidence of CEs increases with age, while
the incidence of UEs decreases with age (due to
re-placements).
● Error rates are unlikely to be dominated by soft
errors.
Benefit from ECC in DIMM
● Single-bit error --> CE
● Avoid (multi-bit errors)UEs from
beginning (Single-bit error, CEs)
● The statistical data of CEs/UEs could
be a reference for maintenance to
reduce the cost of unscheduled service
outage.
ENGINEERS AND DEVICES
WORKING TOGETHER
Server without RAS
How to avoid "Inevitability" ?
To Be Successful in Business, You Need a Little Luck. --Richard Branson
/*
_ooOoo_
o8888888o
88" . "88
(| -_- |)
O = /O
____/`---'____
.' | |// `.
/ ||| : |||// 
/ _||||| -:- |||||- 
| |  - /// | |
| _| ''---/'' | |
 .-__ `-` ___/-. /
___`. .' /--.-- `. . __
."" '< `.____<|>_/___.' >'"".
| | : `- `.;` _ /`;.`/ - ` : | |
  `-. _ __ /__ _/ .-` / /
======`-.____`-.________/___.-`____.-'=====
=
`=---='
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^
Buddha blessed No Error Forever
*/
Emperor Yongzheng
Defeat Ba A Ge (Bug)
ENGINEERS AND DEVICES
WORKING TOGETHER
History
ECC in memory controllers and I/O RAMs
Machine Check Architecture (MCA)
● A mechanism in which the CPU reports
hardware errors to the OS
○ model-specific registers (MSRs)
■ set up machine checking
■ record detected errors
■ the info they contain is CPU
specific
○ Machine Check Exception (MCE)
■ signals the detection of an
uncorrected machine-check
error
■ handler collect information
about error from MSRs
○ Utility: mcelog
PCI-E: Advanced Error Reporting (AER)
Linux kernel
● EDAC (Error Detection and Correction)
○ designed to report and possibly act
on hardware errors
○ inspect the hardware directly
(system-specific handling and
decoding.)
○ only support memory controller and
PCI/AGP errors
Firmware (first) FF
● APEI
● UEFI
RAS on AArch64
● Overview of Hardware & Software
● Prototype Solution for Firmware First Error Handling
ENGINEERS AND DEVICES
WORKING TOGETHER
Hardware support for RAS
● CPU
○ ARMv8-A architecture (a mandatory
extension to ARMv8.2)
○ EL2, EL3, or both
○ Virtualization extension or Security
extensions or both
● GICv3
○ Interrupt routing modes
○ Private and shared interrupts (PPI/SPI)
○ Ability to set an interrupt pending event
signaling and delegation
Interrupt groups/priority
RAS Extension
● ESB (Error Synchronization Barrier) instructions
● RAS Extension registers
● Corrupted data poisoning
ENGINEERS AND DEVICES
WORKING TOGETHER
RAS Extension
ESB instruction
● ESB (Error Synchronization Barrier) can be
used to isolate Unrecoverable errors.
● Software can determine that:
○ The error was reported as
Unrecoverable.
○ The preferred return address of the SEI
is an ESB instruction.
○ The software between that ESB and
the previous ESB can be isolated.
○ ESB might update
DISR_EL1 / DISR (Deferred Interrupt
Status Register) and VDISR_EL2 /
VDISR (Virtual Deferred Interrupt Status
Register)
RAS Extension registers:
● Feature Register/Component ID Register
● Error Record Register
○ Feature
○ Control
○ Record Primary Syndrome
○ Record Address Register
○ Record Miscellaneous Registers
● Hypervisor Configuration Register
● Virtual SError Exception Syndrome Register
● Secure Configuration Register
Or
● Interrupt Register for Fault-Handling and
Recovery
● Device Affinity and Architecture Register
ENGINEERS AND DEVICES
WORKING TOGETHER
RAS Extension -- gather HW error info for FW
ESB instruction
Help to locate Error
RAS Extension registers
● Provide the error info to FW
● Control the Interrupt by FW
ARMv8-A RAS extensions standardize the interface between HW and FW
ENGINEERS AND DEVICES
WORKING TOGETHER
Software Architecture
Firmware First error handling requires standard interfaces between multiple SW
components.
ENGINEERS AND DEVICES
WORKING TOGETHER
SoftwareArchitecture
ENGINEERS AND DEVICES
WORKING TOGETHER
Firmware
ARM Trusted Firmware
● Reference EL3 Runtime (BL31)
○ Standard power control (PSCI)
○ Optional Trusted OS integration
● Trusted boot firmware
○ Optional
○ Compatible with other firmware (like
EDK2)
● Applicable to all segments
● Open Source at GitHub with
BSD-3-clause license
UEFI
Unified Extensible Firmware Interface.
Firmware interface between the
platform and the operating system.
Predominate interfaces are in the boot
services (BS) or pre-OS. Few runtime
(RT) services.
On AArch64, it (tianocore EDK2) works
with ARM TF as BL33 in EL2
ENGINEERS AND DEVICES
WORKING TOGETHER
APEI (ACPI Platform Error Interfaces)
APEI
EINJERST
BERT HEST For runtimeFor last crash
For TestingFor Storage Provides a standard way
to convey error info
from Firmware to OS
ENGINEERS AND DEVICES
WORKING TOGETHER
APEI tables
HEST
(Hardware Error Source Table)
Key info:
HOW to get trigger
WHERE are the error records
HOW to release records’ mem
For ARM64 : GHES v2
HOW to get trigger: Notification Structure
WHERE are the error records:
Error Status Address
(GAS : Generic Address Structure)
HOW to release records’ mem:
Read Ack Register
For IA-32 :
MCE/CMC/NMI
For PCI:
AER Root
Port/Endpoint/Br
idge
For generic
hardware:
GHES (Generic
Hardware Error
Source) V1/V2
ENGINEERS AND DEVICES
WORKING TOGETHER
APEI tables
BERT: Boot Error Record Table
Record fatal errors, then report it in the second boot
CPER (in the Appendix of UEFI spec)
Common Platform Error Record, with this help,
OS can get all kinds of error we could think of.
ENGINEERS AND DEVICES
WORKING TOGETHER
APEI tables -- ERST & EINJ
● ERST: Error Record Serialization Table
○ Operation abstract, provides details necessary
to communicate with on-board persistent
storage for error recording
● EINJ: Error Injection Table
○ Operation abstract, provides a generic
interface which OSPM can inject hardware
errors to the platform without requiring platform
specific software.
ENGINEERS AND DEVICES
WORKING TOGETHER
SDEI usage in RAS
Software Delegated Exception Interface:An interface between FW & OS, for registering, notifying and
servicing system events using SMC/HVC. SDEI Specification (ARM DEN0054A)
ENGINEERS AND DEVICES
WORKING TOGETHER
Prototype Solution for Error Handling
MM Secure Partition, Secure Partition Manager
Uncorrected error -- HEST & MM
ENGINEERS AND DEVICES
WORKING TOGETHER
What Are We Doing
● Define standard interfaces to enable FF handling of AP RAS errors
● Demonstrate use of RAS extensions
● Demonstrate interfaces with reference software and platforms
○ uncorrected DIMM & CPU errors
ENGINEERS AND DEVICES
WORKING TOGETHER
MM Secure Partition
● MM Secure Partition implements
management functions , runs in
S-EL0 to achieve isolation from
S-EL1 & EL3
○ Leverages existing
firmware code based on
EDK2: Standalone MM
● Partition communicates with
ARM TF through a standard
interface: MM_COMMUNICATE
SMC
● Partition is managed by ARM TF
● ARM TF BL31 stage owns EL3
and S-EL1
● Secure partition resources are
described in BL31 platform port
● Minimise code in EL3 and
delegate RAS error handling
ENGINEERS AND DEVICES
WORKING TOGETHER
Secure Partition Manager (SPM)
● Secure Partition Manager in BL31 exports standard ABI to
○ Initialize the partition
○ Delegate SMC requests to the partition
ENGINEERS AND DEVICES
WORKING TOGETHER
UncorrectedError
--HEST&MM
ENGINEERS AND DEVICES
WORKING TOGETHER
Uncorrected Error -- HEST & MM
1. System boot: BootROM-->BL2-->BL3x
a. BL31 initializes SPM (includes MM
dispatcher) and SDEI dispatcher.
b. UEFI (BL33), DXE, UEFI Platform Driver:
i. query SPI (Secure Partition
Image, BL32) for error source info
ii. SPI return error source info back
to UEFI
iii. UEFI map in and mark error
record region as Runtime
Services Data Region
iv. Update/add error source info in
HEST
2. OS starts running: HEST driver scan HEST
table and register error handlers by SDEI
3. UE occurred, the event will be routed to EL3
(SPM)
4. SPM routes the event to RAS error handler in
S-EL0 (MM Foundation)
5. MM Foundation creates the CPER blobs by
the info from RAS Extension
6. SPM notifies SDEI to call the corresponding
OS registered handler
7. OS gets the CPER blobs by Error Status
Address block, process the error, try to
recovery.
8. report the error event by RAS event
9. rasdaemon log error info from RAS event to
recorder
ENGINEERS AND DEVICES
WORKING TOGETHER
Demo Time
● Prototype RAS solution on FVP
○ arm-trusted-firmware (bl1, bl2, bl31)
○ tianocore edk2 (bl32, bl33)
○ Linux kernel, Shell command
Status and Future Plans
● Current development status
● Ongoing development
● TODO list for Reference Solution
ENGINEERS AND DEVICES
WORKING TOGETHER
Current development status
● Hardware
○ ARM engineers are working FVP, LEG team is developing on QEMU
○ RAS spec has released (ARM DDI 0587A)
● Firmware
○ SDEI
■ SDEI Specification released (ARM DEN0054A)
■ SDEI added as hardware error notification type in ACPI 6.2
■ Linux SDEI client implementation v3 patchset has been posted on
kvmarm and devicetree mailing list.
■ ACPICA support for SDEI up-streamed
■ SDEI DT bindings acked
■ ARM TF support posted to github and includes
● SDEI Dispatcher
● Framework for managing interrupts handled in EL3
● OS (Linux):
○ APEI on ARM64 can be enabled in kernel.
○ Memory failure support merged
ENGINEERS AND DEVICES
WORKING TOGETHER
Ongoing development
● ARM TF
○ Simplify error interrupt handling for platform ports
○ Framework for handling External aborts (EA) in design
○ RAS Extensions support in design
■ ESB
■ RAS Error Record driver
● EDK2
○ Driver for creating APEI HEST under development
○ Library for creating APEI CPERs under development
○ Prototyping use of Standalone MM partition to create error records
on QEMU
● OS(Linux)
○ KVM changes for virtualizing SDEI under development
ENGINEERS AND DEVICES
WORKING TOGETHER
TODO list
● Hardware
○ Test on a real hardware (ARMv8.2, including RAS extension)
● Firmware
○ ARM-TF
■ Support for Double fault handling
■ Support for v8.4 RAS Extensions
○ EDK2:
■ Support for BERT
■ ERST and EINJ implementation
ENGINEERS
AND DEVICES
WORKING
TOGETHER
Acknowledgments
● Achin Gupta (ARM)
● John Feeney (Red Hat)
● Leif Lindholm (ARM)
● Supreeth Venkatesh (ARM)
Thank You
#SFO17
BUD17 keynotes and videos on: connect.linaro.org
For further information: www.linaro.org

Mais conteúdo relacionado

Mais procurados

Uboot startup sequence
Uboot startup sequenceUboot startup sequence
Uboot startup sequenceHoucheng Lin
 
Pci express3-device-architecture-optimizations-idf2009-presentation
Pci express3-device-architecture-optimizations-idf2009-presentationPci express3-device-architecture-optimizations-idf2009-presentation
Pci express3-device-architecture-optimizations-idf2009-presentationjkcontee
 
Reliability, Availability and Serviceability on Linux
Reliability, Availability and Serviceability on LinuxReliability, Availability and Serviceability on Linux
Reliability, Availability and Serviceability on LinuxSamsung Open Source Group
 
Linux Kernel MMC Storage driver Overview
Linux Kernel MMC Storage driver OverviewLinux Kernel MMC Storage driver Overview
Linux Kernel MMC Storage driver OverviewRajKumar Rampelli
 
U Boot or Universal Bootloader
U Boot or Universal BootloaderU Boot or Universal Bootloader
U Boot or Universal BootloaderSatpal Parmar
 
Linux Initialization Process (1)
Linux Initialization Process (1)Linux Initialization Process (1)
Linux Initialization Process (1)shimosawa
 
An Introduction to RISC-V bootflow
An Introduction to RISC-V bootflowAn Introduction to RISC-V bootflow
An Introduction to RISC-V bootflowAtish Patra
 
Block I/O Layer Tracing: blktrace
Block I/O Layer Tracing: blktraceBlock I/O Layer Tracing: blktrace
Block I/O Layer Tracing: blktraceBabak Farrokhi
 
Introduction Linux Device Drivers
Introduction Linux Device DriversIntroduction Linux Device Drivers
Introduction Linux Device DriversNEEVEE Technologies
 
Creating Your Own PCI Express System Using FPGAs: Embedded World 2010
Creating Your Own PCI Express System Using FPGAs: Embedded World 2010Creating Your Own PCI Express System Using FPGAs: Embedded World 2010
Creating Your Own PCI Express System Using FPGAs: Embedded World 2010Altera Corporation
 
Hardware Probing in the Linux Kernel
Hardware Probing in the Linux KernelHardware Probing in the Linux Kernel
Hardware Probing in the Linux KernelKernel TLV
 
Message Signaled Interrupts
Message Signaled InterruptsMessage Signaled Interrupts
Message Signaled InterruptsAnshuman Biswal
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Andriy Berestovskyy
 

Mais procurados (20)

Uboot startup sequence
Uboot startup sequenceUboot startup sequence
Uboot startup sequence
 
Pci express3-device-architecture-optimizations-idf2009-presentation
Pci express3-device-architecture-optimizations-idf2009-presentationPci express3-device-architecture-optimizations-idf2009-presentation
Pci express3-device-architecture-optimizations-idf2009-presentation
 
Reliability, Availability and Serviceability on Linux
Reliability, Availability and Serviceability on LinuxReliability, Availability and Serviceability on Linux
Reliability, Availability and Serviceability on Linux
 
Introduction to Modern U-Boot
Introduction to Modern U-BootIntroduction to Modern U-Boot
Introduction to Modern U-Boot
 
USB Drivers
USB DriversUSB Drivers
USB Drivers
 
Linux Kernel MMC Storage driver Overview
Linux Kernel MMC Storage driver OverviewLinux Kernel MMC Storage driver Overview
Linux Kernel MMC Storage driver Overview
 
U Boot or Universal Bootloader
U Boot or Universal BootloaderU Boot or Universal Bootloader
U Boot or Universal Bootloader
 
Linux device drivers
Linux device drivers Linux device drivers
Linux device drivers
 
Linux Initialization Process (1)
Linux Initialization Process (1)Linux Initialization Process (1)
Linux Initialization Process (1)
 
ARM AAE - Architecture
ARM AAE - ArchitectureARM AAE - Architecture
ARM AAE - Architecture
 
An Introduction to RISC-V bootflow
An Introduction to RISC-V bootflowAn Introduction to RISC-V bootflow
An Introduction to RISC-V bootflow
 
Block I/O Layer Tracing: blktrace
Block I/O Layer Tracing: blktraceBlock I/O Layer Tracing: blktrace
Block I/O Layer Tracing: blktrace
 
Boot process: BIOS vs UEFI
Boot process: BIOS vs UEFIBoot process: BIOS vs UEFI
Boot process: BIOS vs UEFI
 
Introduction Linux Device Drivers
Introduction Linux Device DriversIntroduction Linux Device Drivers
Introduction Linux Device Drivers
 
PCI Drivers
PCI DriversPCI Drivers
PCI Drivers
 
Creating Your Own PCI Express System Using FPGAs: Embedded World 2010
Creating Your Own PCI Express System Using FPGAs: Embedded World 2010Creating Your Own PCI Express System Using FPGAs: Embedded World 2010
Creating Your Own PCI Express System Using FPGAs: Embedded World 2010
 
Hardware Probing in the Linux Kernel
Hardware Probing in the Linux KernelHardware Probing in the Linux Kernel
Hardware Probing in the Linux Kernel
 
Qemu Pcie
Qemu PcieQemu Pcie
Qemu Pcie
 
Message Signaled Interrupts
Message Signaled InterruptsMessage Signaled Interrupts
Message Signaled Interrupts
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
 

Semelhante a Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203

Assembly programming
Assembly programmingAssembly programming
Assembly programmingOmar Sanchez
 
Spike yuan server ras and uefi cper final
Spike yuan  server ras and uefi cper finalSpike yuan  server ras and uefi cper final
Spike yuan server ras and uefi cper finalparth bera
 
HKG18-116 - RAS Solutions for Arm64 Servers
HKG18-116 - RAS Solutions for Arm64 ServersHKG18-116 - RAS Solutions for Arm64 Servers
HKG18-116 - RAS Solutions for Arm64 ServersLinaro
 
AAME ARM Techcon2013 002v02 Advanced Features
AAME ARM Techcon2013 002v02 Advanced FeaturesAAME ARM Techcon2013 002v02 Advanced Features
AAME ARM Techcon2013 002v02 Advanced FeaturesAnh Dung NGUYEN
 
Chapter7_InputOutputStorageSystems.pptx
Chapter7_InputOutputStorageSystems.pptxChapter7_InputOutputStorageSystems.pptx
Chapter7_InputOutputStorageSystems.pptxJanethMedina31
 
IO and file systems
IO and file systems IO and file systems
IO and file systems EktaVaswani2
 
Iosystemspre final-160922112930
Iosystemspre final-160922112930Iosystemspre final-160922112930
Iosystemspre final-160922112930marangburu42
 
Virtualization Support in ARMv8+
Virtualization Support in ARMv8+Virtualization Support in ARMv8+
Virtualization Support in ARMv8+Aananth C N
 
VJITSk 6713 user manual
VJITSk 6713 user manualVJITSk 6713 user manual
VJITSk 6713 user manualkot seelam
 
ARM Architecture and Meltdown/Spectre
ARM Architecture and Meltdown/SpectreARM Architecture and Meltdown/Spectre
ARM Architecture and Meltdown/SpectreGlobalLogic Ukraine
 
AAME ARM Techcon2013 004v02 Debug and Optimization
AAME ARM Techcon2013 004v02 Debug and OptimizationAAME ARM Techcon2013 004v02 Debug and Optimization
AAME ARM Techcon2013 004v02 Debug and OptimizationAnh Dung NGUYEN
 
Introduction to-microprocessor
Introduction to-microprocessorIntroduction to-microprocessor
Introduction to-microprocessorankitnav1
 
Introduction to-microprocessor
Introduction to-microprocessorIntroduction to-microprocessor
Introduction to-microprocessorankitnav1
 
Module 5 embedded systems,8051
Module 5 embedded systems,8051Module 5 embedded systems,8051
Module 5 embedded systems,8051Deepak John
 

Semelhante a Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203 (20)

Assembly programming
Assembly programmingAssembly programming
Assembly programming
 
Spike yuan server ras and uefi cper final
Spike yuan  server ras and uefi cper finalSpike yuan  server ras and uefi cper final
Spike yuan server ras and uefi cper final
 
HKG18-116 - RAS Solutions for Arm64 Servers
HKG18-116 - RAS Solutions for Arm64 ServersHKG18-116 - RAS Solutions for Arm64 Servers
HKG18-116 - RAS Solutions for Arm64 Servers
 
AAME ARM Techcon2013 002v02 Advanced Features
AAME ARM Techcon2013 002v02 Advanced FeaturesAAME ARM Techcon2013 002v02 Advanced Features
AAME ARM Techcon2013 002v02 Advanced Features
 
x86_1.ppt
x86_1.pptx86_1.ppt
x86_1.ppt
 
Chapter7_InputOutputStorageSystems.pptx
Chapter7_InputOutputStorageSystems.pptxChapter7_InputOutputStorageSystems.pptx
Chapter7_InputOutputStorageSystems.pptx
 
Introduction to 8085svv
Introduction to 8085svvIntroduction to 8085svv
Introduction to 8085svv
 
Io systems final
Io systems finalIo systems final
Io systems final
 
IO and file systems
IO and file systems IO and file systems
IO and file systems
 
Iosystemspre final-160922112930
Iosystemspre final-160922112930Iosystemspre final-160922112930
Iosystemspre final-160922112930
 
Virtualization Support in ARMv8+
Virtualization Support in ARMv8+Virtualization Support in ARMv8+
Virtualization Support in ARMv8+
 
VJITSk 6713 user manual
VJITSk 6713 user manualVJITSk 6713 user manual
VJITSk 6713 user manual
 
Arm architecture overview
Arm architecture overviewArm architecture overview
Arm architecture overview
 
Assignment
AssignmentAssignment
Assignment
 
Important questions
Important questionsImportant questions
Important questions
 
ARM Architecture and Meltdown/Spectre
ARM Architecture and Meltdown/SpectreARM Architecture and Meltdown/Spectre
ARM Architecture and Meltdown/Spectre
 
AAME ARM Techcon2013 004v02 Debug and Optimization
AAME ARM Techcon2013 004v02 Debug and OptimizationAAME ARM Techcon2013 004v02 Debug and Optimization
AAME ARM Techcon2013 004v02 Debug and Optimization
 
Introduction to-microprocessor
Introduction to-microprocessorIntroduction to-microprocessor
Introduction to-microprocessor
 
Introduction to-microprocessor
Introduction to-microprocessorIntroduction to-microprocessor
Introduction to-microprocessor
 
Module 5 embedded systems,8051
Module 5 embedded systems,8051Module 5 embedded systems,8051
Module 5 embedded systems,8051
 

Mais de Linaro

Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloDeep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloLinaro
 
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta VekariaArm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta VekariaLinaro
 
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua MoraHuawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua MoraLinaro
 
Bud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qaBud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qaLinaro
 
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018Linaro
 
HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018Linaro
 
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...Linaro
 
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Linaro
 
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Linaro
 
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Linaro
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineLinaro
 
HKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteHKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteLinaro
 
HKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopHKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopLinaro
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineLinaro
 
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allHKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allLinaro
 
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorHKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorLinaro
 
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMULinaro
 
HKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MHKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MLinaro
 
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation Linaro
 
HKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootHKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootLinaro
 

Mais de Linaro (20)

Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloDeep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
 
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta VekariaArm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
 
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua MoraHuawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
 
Bud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qaBud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qa
 
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
 
HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018
 
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
 
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
 
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
 
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
 
HKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteHKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening Keynote
 
HKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopHKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP Workshop
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
 
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allHKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
 
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorHKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
 
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMU
 
HKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MHKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8M
 
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation
 
HKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootHKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted boot
 

Último

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...itnewsafrica
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 

Último (20)

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 

Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203

  • 1. Reliability, Availability, and Serviceability (RAS) on AArch64 Fu Wei (Linaro LEG) Supreeth Venkatesh (ARM)
  • 2. ENGINEERS AND DEVICES WORKING TOGETHER AGENDA 1. Brief introduction of RAS ○ Definition, Importance, History 2. RAS on AArch64 ○ Overview ■ Hardware support ● RAS Extension ■ Software Architecture ● ARM-Trusted-Firmware, UEFI, APEI tables ● SDEI ○ Prototype Solution for Firmware First Error Handling ■ MM Secure Partition, Secure Partition Manager ■ Uncorrected error -- HEST & MM ■ Demo time 3. Status and Future Plans
  • 3. Brief introduction of RAS ● What is RAS? ● Why do we need RAS? ● History of RAS
  • 4. ENGINEERS AND DEVICES WORKING TOGETHER What is RAS? -- Definition Reliability Continuity, Computation needs be correct and reliable. Availability Readiness, System needs to remain available as long as possible. Serviceability Ability to undergo modifications and repairs,System should provide information to administrator to aid in system servicing. The RAS architecture primarily cares about ERRORs produced from HARDWARE .
  • 5. ENGINEERS AND DEVICES WORKING TOGETHER Why do we need RAS? -- Importance Impacts Continuity, Computation needs be correct and reliable. Inevitability Although faults are rare, enterprise systems can be very large. So failures are inevitable. So we have to maintain system very well, and Operating Expense (OPEX) for maintenance is inevitable. OPEX for maintenance is reduced by 1. replacing only failed parts 2. scheduled maintenance (is cheaper than unscheduled service outages)
  • 6. ENGINEERS AND DEVICES WORKING TOGETHER Why do we need RAS? -- Importance Inevitability <DRAM Errors in the Wild: A Large-Scale Field Study> by Bianca Schroeder, Eduardo Pinheiro, Wolf-Dietrich Weber Important Conclusion: ● The incidence of memory errors and the range of error rates across different DIMMs to be much higher than previously reported. ● Memory errors are strongly correlated. ● The incidence of CEs increases with age, while the incidence of UEs decreases with age (due to re-placements). ● Error rates are unlikely to be dominated by soft errors. Benefit from ECC in DIMM ● Single-bit error --> CE ● Avoid (multi-bit errors)UEs from beginning (Single-bit error, CEs) ● The statistical data of CEs/UEs could be a reference for maintenance to reduce the cost of unscheduled service outage.
  • 7. ENGINEERS AND DEVICES WORKING TOGETHER Server without RAS How to avoid "Inevitability" ? To Be Successful in Business, You Need a Little Luck. --Richard Branson /* _ooOoo_ o8888888o 88" . "88 (| -_- |) O = /O ____/`---'____ .' | |// `. / ||| : |||// / _||||| -:- |||||- | | - /// | | | _| ''---/'' | | .-__ `-` ___/-. / ___`. .' /--.-- `. . __ ."" '< `.____<|>_/___.' >'"". | | : `- `.;` _ /`;.`/ - ` : | | `-. _ __ /__ _/ .-` / / ======`-.____`-.________/___.-`____.-'===== = `=---=' ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^ Buddha blessed No Error Forever */ Emperor Yongzheng Defeat Ba A Ge (Bug)
  • 8. ENGINEERS AND DEVICES WORKING TOGETHER History ECC in memory controllers and I/O RAMs Machine Check Architecture (MCA) ● A mechanism in which the CPU reports hardware errors to the OS ○ model-specific registers (MSRs) ■ set up machine checking ■ record detected errors ■ the info they contain is CPU specific ○ Machine Check Exception (MCE) ■ signals the detection of an uncorrected machine-check error ■ handler collect information about error from MSRs ○ Utility: mcelog PCI-E: Advanced Error Reporting (AER) Linux kernel ● EDAC (Error Detection and Correction) ○ designed to report and possibly act on hardware errors ○ inspect the hardware directly (system-specific handling and decoding.) ○ only support memory controller and PCI/AGP errors Firmware (first) FF ● APEI ● UEFI
  • 9. RAS on AArch64 ● Overview of Hardware & Software ● Prototype Solution for Firmware First Error Handling
  • 10. ENGINEERS AND DEVICES WORKING TOGETHER Hardware support for RAS ● CPU ○ ARMv8-A architecture (a mandatory extension to ARMv8.2) ○ EL2, EL3, or both ○ Virtualization extension or Security extensions or both ● GICv3 ○ Interrupt routing modes ○ Private and shared interrupts (PPI/SPI) ○ Ability to set an interrupt pending event signaling and delegation Interrupt groups/priority RAS Extension ● ESB (Error Synchronization Barrier) instructions ● RAS Extension registers ● Corrupted data poisoning
  • 11. ENGINEERS AND DEVICES WORKING TOGETHER RAS Extension ESB instruction ● ESB (Error Synchronization Barrier) can be used to isolate Unrecoverable errors. ● Software can determine that: ○ The error was reported as Unrecoverable. ○ The preferred return address of the SEI is an ESB instruction. ○ The software between that ESB and the previous ESB can be isolated. ○ ESB might update DISR_EL1 / DISR (Deferred Interrupt Status Register) and VDISR_EL2 / VDISR (Virtual Deferred Interrupt Status Register) RAS Extension registers: ● Feature Register/Component ID Register ● Error Record Register ○ Feature ○ Control ○ Record Primary Syndrome ○ Record Address Register ○ Record Miscellaneous Registers ● Hypervisor Configuration Register ● Virtual SError Exception Syndrome Register ● Secure Configuration Register Or ● Interrupt Register for Fault-Handling and Recovery ● Device Affinity and Architecture Register
  • 12. ENGINEERS AND DEVICES WORKING TOGETHER RAS Extension -- gather HW error info for FW ESB instruction Help to locate Error RAS Extension registers ● Provide the error info to FW ● Control the Interrupt by FW ARMv8-A RAS extensions standardize the interface between HW and FW
  • 13. ENGINEERS AND DEVICES WORKING TOGETHER Software Architecture Firmware First error handling requires standard interfaces between multiple SW components.
  • 14. ENGINEERS AND DEVICES WORKING TOGETHER SoftwareArchitecture
  • 15. ENGINEERS AND DEVICES WORKING TOGETHER Firmware ARM Trusted Firmware ● Reference EL3 Runtime (BL31) ○ Standard power control (PSCI) ○ Optional Trusted OS integration ● Trusted boot firmware ○ Optional ○ Compatible with other firmware (like EDK2) ● Applicable to all segments ● Open Source at GitHub with BSD-3-clause license UEFI Unified Extensible Firmware Interface. Firmware interface between the platform and the operating system. Predominate interfaces are in the boot services (BS) or pre-OS. Few runtime (RT) services. On AArch64, it (tianocore EDK2) works with ARM TF as BL33 in EL2
  • 16. ENGINEERS AND DEVICES WORKING TOGETHER APEI (ACPI Platform Error Interfaces) APEI EINJERST BERT HEST For runtimeFor last crash For TestingFor Storage Provides a standard way to convey error info from Firmware to OS
  • 17. ENGINEERS AND DEVICES WORKING TOGETHER APEI tables HEST (Hardware Error Source Table) Key info: HOW to get trigger WHERE are the error records HOW to release records’ mem For ARM64 : GHES v2 HOW to get trigger: Notification Structure WHERE are the error records: Error Status Address (GAS : Generic Address Structure) HOW to release records’ mem: Read Ack Register For IA-32 : MCE/CMC/NMI For PCI: AER Root Port/Endpoint/Br idge For generic hardware: GHES (Generic Hardware Error Source) V1/V2
  • 18. ENGINEERS AND DEVICES WORKING TOGETHER APEI tables BERT: Boot Error Record Table Record fatal errors, then report it in the second boot CPER (in the Appendix of UEFI spec) Common Platform Error Record, with this help, OS can get all kinds of error we could think of.
  • 19. ENGINEERS AND DEVICES WORKING TOGETHER APEI tables -- ERST & EINJ ● ERST: Error Record Serialization Table ○ Operation abstract, provides details necessary to communicate with on-board persistent storage for error recording ● EINJ: Error Injection Table ○ Operation abstract, provides a generic interface which OSPM can inject hardware errors to the platform without requiring platform specific software.
  • 20. ENGINEERS AND DEVICES WORKING TOGETHER SDEI usage in RAS Software Delegated Exception Interface:An interface between FW & OS, for registering, notifying and servicing system events using SMC/HVC. SDEI Specification (ARM DEN0054A)
  • 21. ENGINEERS AND DEVICES WORKING TOGETHER Prototype Solution for Error Handling MM Secure Partition, Secure Partition Manager Uncorrected error -- HEST & MM
  • 22. ENGINEERS AND DEVICES WORKING TOGETHER What Are We Doing ● Define standard interfaces to enable FF handling of AP RAS errors ● Demonstrate use of RAS extensions ● Demonstrate interfaces with reference software and platforms ○ uncorrected DIMM & CPU errors
  • 23. ENGINEERS AND DEVICES WORKING TOGETHER MM Secure Partition ● MM Secure Partition implements management functions , runs in S-EL0 to achieve isolation from S-EL1 & EL3 ○ Leverages existing firmware code based on EDK2: Standalone MM ● Partition communicates with ARM TF through a standard interface: MM_COMMUNICATE SMC ● Partition is managed by ARM TF ● ARM TF BL31 stage owns EL3 and S-EL1 ● Secure partition resources are described in BL31 platform port ● Minimise code in EL3 and delegate RAS error handling
  • 24. ENGINEERS AND DEVICES WORKING TOGETHER Secure Partition Manager (SPM) ● Secure Partition Manager in BL31 exports standard ABI to ○ Initialize the partition ○ Delegate SMC requests to the partition
  • 25. ENGINEERS AND DEVICES WORKING TOGETHER UncorrectedError --HEST&MM
  • 26. ENGINEERS AND DEVICES WORKING TOGETHER Uncorrected Error -- HEST & MM 1. System boot: BootROM-->BL2-->BL3x a. BL31 initializes SPM (includes MM dispatcher) and SDEI dispatcher. b. UEFI (BL33), DXE, UEFI Platform Driver: i. query SPI (Secure Partition Image, BL32) for error source info ii. SPI return error source info back to UEFI iii. UEFI map in and mark error record region as Runtime Services Data Region iv. Update/add error source info in HEST 2. OS starts running: HEST driver scan HEST table and register error handlers by SDEI 3. UE occurred, the event will be routed to EL3 (SPM) 4. SPM routes the event to RAS error handler in S-EL0 (MM Foundation) 5. MM Foundation creates the CPER blobs by the info from RAS Extension 6. SPM notifies SDEI to call the corresponding OS registered handler 7. OS gets the CPER blobs by Error Status Address block, process the error, try to recovery. 8. report the error event by RAS event 9. rasdaemon log error info from RAS event to recorder
  • 27. ENGINEERS AND DEVICES WORKING TOGETHER Demo Time ● Prototype RAS solution on FVP ○ arm-trusted-firmware (bl1, bl2, bl31) ○ tianocore edk2 (bl32, bl33) ○ Linux kernel, Shell command
  • 28. Status and Future Plans ● Current development status ● Ongoing development ● TODO list for Reference Solution
  • 29. ENGINEERS AND DEVICES WORKING TOGETHER Current development status ● Hardware ○ ARM engineers are working FVP, LEG team is developing on QEMU ○ RAS spec has released (ARM DDI 0587A) ● Firmware ○ SDEI ■ SDEI Specification released (ARM DEN0054A) ■ SDEI added as hardware error notification type in ACPI 6.2 ■ Linux SDEI client implementation v3 patchset has been posted on kvmarm and devicetree mailing list. ■ ACPICA support for SDEI up-streamed ■ SDEI DT bindings acked ■ ARM TF support posted to github and includes ● SDEI Dispatcher ● Framework for managing interrupts handled in EL3 ● OS (Linux): ○ APEI on ARM64 can be enabled in kernel. ○ Memory failure support merged
  • 30. ENGINEERS AND DEVICES WORKING TOGETHER Ongoing development ● ARM TF ○ Simplify error interrupt handling for platform ports ○ Framework for handling External aborts (EA) in design ○ RAS Extensions support in design ■ ESB ■ RAS Error Record driver ● EDK2 ○ Driver for creating APEI HEST under development ○ Library for creating APEI CPERs under development ○ Prototyping use of Standalone MM partition to create error records on QEMU ● OS(Linux) ○ KVM changes for virtualizing SDEI under development
  • 31. ENGINEERS AND DEVICES WORKING TOGETHER TODO list ● Hardware ○ Test on a real hardware (ARMv8.2, including RAS extension) ● Firmware ○ ARM-TF ■ Support for Double fault handling ■ Support for v8.4 RAS Extensions ○ EDK2: ■ Support for BERT ■ ERST and EINJ implementation
  • 32. ENGINEERS AND DEVICES WORKING TOGETHER Acknowledgments ● Achin Gupta (ARM) ● John Feeney (Red Hat) ● Leif Lindholm (ARM) ● Supreeth Venkatesh (ARM)
  • 33. Thank You #SFO17 BUD17 keynotes and videos on: connect.linaro.org For further information: www.linaro.org