SlideShare uma empresa Scribd logo
1 de 48
The e820 trap of Linux kernelThe e820 trap of Linux kernel
hibernationhibernation
AugAug, 2015, COSCUP 2015, Taipei, 2015, COSCUP 2015, Taipei
Joey Lee, SUSE Labs Taipei
2
Agenda
• Fundamental
• Hibernation (suspen to disk)
• e820, EFI memmap
• e820 shift
• Platform vs. Shutdown
• memory size changing
• EFI memmap shift
• setup_data and nosave regions
• EFI runtime services broken after S4
• Challenges
• Q&A
FundamentalFundamental
4
Memory (physical)
pfn = 0
pfn = max_pfn
5
Memory (runtime)
0
max_pfn
6
Hibernation (suspend to disk)
• Create snapshot image of runtime memory.
• Store snapshot image to swap partition or file.
• Restore snapshot image to memory.
7
Hibernation (restore)
0
max_pfn
0
max_pfn
Memory restored
8
Memory (physical)
pfn = 0
pfn = max_pfn
9
Memory (BIOS memory map)
0
max_pfn
0
max_pfn
Boot
Boot
10
e820
• Wikipedia:
• e820 is shorthand to refer to the facility by which the
BIOS of x86-based computer systems reports the
memory map to the operating system or boot loader.
• It is accessed via the int 15h call, by setting the AX
register to value E820 in hexadecimal. It reports which
memory address ranges are usable and which are
reserved for use by the BIOS.
11
12
e820 entry type
Type Kernel Define String in dmesg Description
Type 1 E820_RAM usable,
System RAM
Usable (normal) RAM
Type 2 E820_RESERVED reserved,
reserved
Reserved - unusable
Type 3 E820_ACPI ACPI data,
ACPI Tables
ACPI reclaimable memory
Type 4 E820_NVS* ACPI NVS,
ACPI Non-volatile Storage
ACPI NVS memory,
ACPI Non-Volatile-Sleeping
Memory (NVS)
Type 5 E820_UNUSABLE Unusable,
Unusable memory
Area containing bad
memory
* drivers/acpi/nvs.c::suspend_nvs_*() handle ACPI NVS for S4
13
Memory (BIOS memory map)
0
max_pfn
0
max_pfn
Boot
Boot
14
Memory (runtime)
0
max_pfn
0
max_pfn
Boot
ACPI NVS
reserved
ACPI data
reserved
Boot
useable
useable
useable
useable
useable
useable
0
max_pfn
Boot
ACPI NVS
reserved
ACPI data
reserved
useable
useable
useable
useable
useable
useable
OS
15
EFI memory map
• EFI spec v2.5
• EFI_BOOT_SERVICES.GetMemoryMap()
• Returns the current memory map.
• 6.2 Memory Allocation Services
• Table 25. Memory Type Usage before
ExitBootServices()
• Table 26. Memory Type Usage after ExitBootServices()
16
17
e820 entry type vs. EFI memory region type
E820 Type E820 entry type EFI memory region type
Type 1 E820_RAM EFI_LOADER_CODE (type 1)
EFI_LOADER_DATA (type 2)
EFI_BOOT_SERVICES_CODE (type 3)
EFI_BOOT_SERVICES_DATA (type 4)
EFI_CONVENTIONAL_MEMORY (type 7)
Type 2 E820_RESERVED EFI_RESERVED_TYPE (type 0)
EFI_RUNTIME_SERVICES_CODE (type 5)
EFI_RUNTIME_SERVICES_DATA (type 6)
EFI_MEMORY_MAPPED_IO (type 11)
EFI_MEMORY_MAPPED_IO_PORT_SPACE
(type 12)
EFI_PAL_CODE (type 13)
Type 3 E820_ACPI EFI_ACPI_RECLAIM_MEMORY (type 9)
Type 4 E820_NVS EFI_ACPI_MEMORY_NVS (type 10)
Type 5 E820_UNUSABLE EFI_UNUSABLE_MEMORY (type 8)
New* E820_PMEM EFI_PERSISTENT_MEMORY (type 14)
* v4.2-rc4
arch/x86/boot/compressed/eboot.c::setup_e820()
e820 shifte820 shift
19
20
21
e820 shift (1)
Boot 1:
Boot 2:
22
e820 shift (2)
• Boot:
• [ 0.000000] BIOS-e820: [mem 0x0000000068f45000-0x0000000069d4ffff]
usable
• Resume Boot:
• [ 0.000000] BIOS-e820: [mem 0x0000000069d4f000-0x0000000069e12fff]
reserved
• [ 0.000000] PM: Registered nosave memory: [mem 0x69d4f000-0x69e12fff]
• [ 17.410733] PM: Image loading progress: 0%
• [ 17.929495] BUG: unable to handle kernel paging request at ffff880069d4f000
• [ 17.933469] IP: [<ffffffff810a1cf0>] load_image_lzo+0x810/0xe40
• Page fault address is in usable memory entry when boot,
but in reserved memory entry when resume boot.
23
e820 shift (3)
0
max_pfn
Boot
ACPI NVS
reserved
ACPI data
reserved
useable
useable
useable
useable
useable
useable
max_pfn
Boot
ACPI NVS
reserved
ACPI data
reserved
useable
useable
useable
useable
useable
useable
0
Boot Resume Boot
Useable address
in reserved region
24
Checking e820 shift:
• Lee, Chun-Yi [PATCH] PM / hibernate: avoid unsafe pages
in e820 reserved regions:
• 84c91b7ae commit in v3.17-rc1
• https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=84c91b7
• Reverted by f82daee49 commit in v4.0
• Waiting “Yinghai Lu<> [PATCH]x86: Kill E820_RESERVED_KERN”
• Lee, Chun-Yi [PATCH] Hibernate: save e820 table to
snapshot header for comparison
• https://lkml.org/lkml/2014/8/11/166
25
Platform vs. Shutdown (1)
• Different modes of hibernation:
• cat /sys/power/disk
[platform] shutdown reboot suspend
• Platform mode depends on _S4 support by BIOS:
[ 1.080004] ACPI Exception: AE_NOT_FOUND, While evaluating Sleep State [_S4_]
(20130725/hwxface-571)
• ACPI spec 6.0:
• Table 7-234 BIOS-Supplied Control Methods for System-Level Functions
• _S4: Package that defines system _S4 state mode.
• 16.3.2 BIOS Initialization of Memory (since ACPI v1.0):
• Note: The memory information returned from the system address map
reporting interfaces should be the same before and after an S4 sleep.
OSPM will invoke E820 interfaces on IA-PC-based legacy systems or the
GetMemoryMap() interface on UEFI-enabled systems
26
Platform vs. Shutdown (2)
• Documentation/power/swsusp.txt in kernel
• Q: What is the difference between "platform" and "shutdown"?
• A: "platform" is actually right thing to do where supported, but
"shutdown" is most reliable (except on ACPI systems).
• Linux Kernel bug #77571:
• https://bugzilla.kernel.org/show_bug.cgi?id=77571
• The same page fault when writing snapshot image to page buffer.
• Bug reporter uses “shutdown” but not “platform”.
After using “platform”, bug reporter can not reproduce issue.
• That's better using platform when BIOS support _S4.
User should aware that has risk when using “shutdown”.
27
Memory size mismatch (1)
• PM: Loading and decompressing image data (495448 pages)...
[ 3.834831] PM: Image mismatch: memory size
[ 3.834851] PM: Read 1981792 kbytes in 0.01 seconds (198179.20 MB/s)
[ 3.836147] PM: Error -1 resuming
[ 3.836162] PM: Failed to load hibernation image, recovering.
• Normally: On node 0 totalpages: 4177255
When issue happened: On node 0 totalpages: 4177256 <== mismatch
• for_each_online_node(nid)
phys_pages += node_present_pages(nid);
• kernel/power/snapshot.c::check_header()
if (!reason && info->num_physpages != get_num_physpages())
reason = "memory size";
if (reason) {
printk(KERN_ERR "PM: Image mismatch: %sn", reason);
return -EPERM;
}
28
Memory size mismatch (2)
• Boot Memory map of Boot
29
Memory size mismatch (3)
• Resume Boot
Memory map of Resume Boot
EFI memmap shiftEFI memmap shift
31
Misidentification of nosave region (1)
1 page
In usable
Not align
EFI_LOADER_DATA
32
setup_data and E820_RESERVED_KERN
• setup_data: a linked list for carrying data with boot_params
to later boot stage.
• Allocated in EFI stub, reserved via memblock and e820.
• Yinghai Lu<> [PATCH] x86, boot: clean up setup_data
handling
• https://lkml.org/lkml/2015/2/28/272
• SETUP_E820_EXT, SETUP_EFI SETUP_DTB,
SETUP_PCI SETUP_KASLR
• Those setup_data chunks are not page align when
allocating. That causes hole between e820 entries, then
kernel register it as 1 page nosave regions. <== random
address per boot!
33
Misidentification of nosave region (2)
• arch/x86/kernel/e820.c
Register hole between two
e820 region to nosave as
1 page region
34
Kill E820_RESERVED_KERN
• Yinghai Lu [PATCH] x86: Kill E820_RESERVED_KERN
• https://lkml.org/lkml/2015/2/28/274
• Cleaning setup_data handler, remove E820_RESERVED_KERN from
e820 regions because setup_data are already protected by memblock.
• Avoid wasting memory, fix page align problem in e820.
• Linux Kernel bug #96111 Unreliable hibernation on Lenovo X230
• https://bugzilla.kernel.org/show_bug.cgi?id=96111
• 84c91b7ae commit in v3.17-rc1
Reverted by f82daee49 commit in v4.0
• Chen, Yu C [RFC PATCH] PM / hibernate: make sure each resuming
page is in current memory zones
• Waiting Yinghai Lu's patch for kill E820_RESERVED_KERN
35
EFI runtime services broken after S4 (1)
On some machines
36
EFI runtime services broken after S4 (2)
• Resume Boot:
VA 0xffffffefd244e60 is in Runtime Data region after hibernate resumed:
[ 0.125865] efi: mem26: [Runtime Data |RUN| | | | |WB|WT|WC|UC]
pa=[0x00000000bb3e5000-0x00000000bb445000) va=[0xfffffffefd1e5000-
0xfffffffefd245000) (0MB)
• Boot:
VA 0xffffffefd244e60 didn't mapped to any PA in hibernating kernel (image kernel):
[ 0.111002] efi: mem24: [Runtime Code |RUN| | | | |WB|WT|WC|UC]
pa=[0x00000000bb385000-0x00000000bb3e5000) va=[0xfffffffefd585000-
0xfffffffefd5e5000) (0MB)
[ 0.125883] efi: mem25: [Runtime Data |RUN| | | | |WB|WT|WC|UC]
pa=[0x00000000bb3e5000-0x00000000bb445000) va=[0xfffffffefd3e5000-
0xfffffffefd445000) (0MB)
[ 0.140764] efi: mem29: [Boot Data | | | | | |WB|WT|WC|UC]
pa=[0x00000000bb7ff000-0x00000000bb800000) va=[0xfffffffefd1ff000-
0xfffffffefd200000) (0MB)
37
Memory mapping of EFI runtime services (1)
• Borislav Petkov [PATCH] EFI: Runtime services virtual mapping
• d2f7cbe7 merged since v3.14 kernel
• We map the EFI regions needed for runtime services non-
contiguously, with preserved alignment on virtual addresses
starting from -4G down for a total max space of 64G.
• Documentation/x86/x86_64/mm.txt
->trampoline_pgd:
We map EFI runtime services in the aforementioned PGD in the
virtual range of 64Gb (arbitrarily set, can be raised if needed)
0xffffffef00000000 - 0xffffffff00000000
38
Memory mapping of EFI runtime services (2)
• Virtual memory map x86_64 of runtime service –
trampoline_pgd
Runtime Code
Runtime Data
0xffffffffffffffff
0x0000000000000000
0x00000000bb385000
0xffffffff00000000
4 G
64 G
0x00000000bb3e5000
0xffffffef00000000
Boot Data
Boot Code1:1 mapping
workaround
1:1 mapping
workaround
1:1 mapping
workaround
1:1 mapping
workaround
Boot Data
Boot Data
arch/x86/platform/efi/efi_64.c::efi_map_region()
39
Memory mapping of EFI runtime services (3)
• In -4G area:
Runtime Code
Runtime Data
0xffffffff00000000
0xffffffef00000000
Boot Data
Boot Code
64 G
Boot Data
Boot Data
2M-aligned
arch/x86/platform/efi/efi_64.c::efi_map_region()
40
Should fix runtime services address after S4
• Lee, Chun-Yi [PATCH] x86_64/efi: Mapping Boot and
Runtime EFI memory regions to different starting virtual
address
• VA of EFI runtime services should may changed
between hibernation, but that's fine when PA doesn't
change.
• Should checking more detail about EFI page table when
hibernation recovery.
ChallengesChallenges
42
Hibernation's Challenge
• KASLR (Kernel address space layout randomization)
• Exclusive with hibernation
• Intel Rapid Start
• A replacement of kernel hibernation
• May also conflict with KASLR
• NVDIMM
• Do not need hibernation anymore
Q&AQ&A
SUSE is HiringSUSE is Hiring
Please search “SUSE Careers”Please search “SUSE Careers”
andand
http://www.104.com.tw/http://www.104.com.tw/
SUMMIT 2015
OPENSUSE ASIA
Taipei,R.O.C(Taiwan)
Bring you to the free world
46
47
48
Join us on:
www.opensuse.org

Mais conteúdo relacionado

Mais procurados

/proc/irq/&lt;irq>/smp_affinity
/proc/irq/&lt;irq>/smp_affinity/proc/irq/&lt;irq>/smp_affinity
/proc/irq/&lt;irq>/smp_affinity
Takuya ASADA
 

Mais procurados (20)

Linux Memory Management with CMA (Contiguous Memory Allocator)
Linux Memory Management with CMA (Contiguous Memory Allocator)Linux Memory Management with CMA (Contiguous Memory Allocator)
Linux Memory Management with CMA (Contiguous Memory Allocator)
 
spinlock.pdf
spinlock.pdfspinlock.pdf
spinlock.pdf
 
Linux Kernel Module - For NLKB
Linux Kernel Module - For NLKBLinux Kernel Module - For NLKB
Linux Kernel Module - For NLKB
 
Linux Network Stack
Linux Network StackLinux Network Stack
Linux Network Stack
 
The Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast StorageThe Linux Block Layer - Built for Fast Storage
The Linux Block Layer - Built for Fast Storage
 
qemu + gdb + sample_code: Run sample code in QEMU OS and observe Linux Kernel...
qemu + gdb + sample_code: Run sample code in QEMU OS and observe Linux Kernel...qemu + gdb + sample_code: Run sample code in QEMU OS and observe Linux Kernel...
qemu + gdb + sample_code: Run sample code in QEMU OS and observe Linux Kernel...
 
Decompressed vmlinux: linux kernel initialization from page table configurati...
Decompressed vmlinux: linux kernel initialization from page table configurati...Decompressed vmlinux: linux kernel initialization from page table configurati...
Decompressed vmlinux: linux kernel initialization from page table configurati...
 
Linux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKBLinux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKB
 
Physical Memory Management.pdf
Physical Memory Management.pdfPhysical Memory Management.pdf
Physical Memory Management.pdf
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debugging
 
Static partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-VStatic partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-V
 
Launch the First Process in Linux System
Launch the First Process in Linux SystemLaunch the First Process in Linux System
Launch the First Process in Linux System
 
/proc/irq/&lt;irq>/smp_affinity
/proc/irq/&lt;irq>/smp_affinity/proc/irq/&lt;irq>/smp_affinity
/proc/irq/&lt;irq>/smp_affinity
 
New Ways to Find Latency in Linux Using Tracing
New Ways to Find Latency in Linux Using TracingNew Ways to Find Latency in Linux Using Tracing
New Ways to Find Latency in Linux Using Tracing
 
Linux Kernel - Virtual File System
Linux Kernel - Virtual File SystemLinux Kernel - Virtual File System
Linux Kernel - Virtual File System
 
Accelerating Virtual Machine Access with the Storage Performance Development ...
Accelerating Virtual Machine Access with the Storage Performance Development ...Accelerating Virtual Machine Access with the Storage Performance Development ...
Accelerating Virtual Machine Access with the Storage Performance Development ...
 
Memory management in Linux kernel
Memory management in Linux kernelMemory management in Linux kernel
Memory management in Linux kernel
 
Anatomy of the loadable kernel module (lkm)
Anatomy of the loadable kernel module (lkm)Anatomy of the loadable kernel module (lkm)
Anatomy of the loadable kernel module (lkm)
 
Reverse Mapping (rmap) in Linux Kernel
Reverse Mapping (rmap) in Linux KernelReverse Mapping (rmap) in Linux Kernel
Reverse Mapping (rmap) in Linux Kernel
 
NVMe Over Fabrics Support in Linux
NVMe Over Fabrics Support in LinuxNVMe Over Fabrics Support in Linux
NVMe Over Fabrics Support in Linux
 

Destaque

Comp tia flashcards set 1 (15 cards) acpi cmos
Comp tia flashcards set 1 (15 cards) acpi   cmosComp tia flashcards set 1 (15 cards) acpi   cmos
Comp tia flashcards set 1 (15 cards) acpi cmos
Sue Long Smith
 
Note - (EDK2) Acpi Tables Compile and Install
Note - (EDK2) Acpi Tables Compile and InstallNote - (EDK2) Acpi Tables Compile and Install
Note - (EDK2) Acpi Tables Compile and Install
boyw165
 

Destaque (17)

LCU13: ACPI power state mapping
LCU13: ACPI power state mappingLCU13: ACPI power state mapping
LCU13: ACPI power state mapping
 
Status update-qemu-pcie
Status update-qemu-pcieStatus update-qemu-pcie
Status update-qemu-pcie
 
70 271 Stu Chap07
70 271 Stu Chap0770 271 Stu Chap07
70 271 Stu Chap07
 
Extracting Linux kernel feature model changes with FMDiff
Extracting Linux kernel feature model changes with FMDiff Extracting Linux kernel feature model changes with FMDiff
Extracting Linux kernel feature model changes with FMDiff
 
Comp tia flashcards set 1 (15 cards) acpi cmos
Comp tia flashcards set 1 (15 cards) acpi   cmosComp tia flashcards set 1 (15 cards) acpi   cmos
Comp tia flashcards set 1 (15 cards) acpi cmos
 
Kernel Recipes 2015: Representing device-tree peripherals in ACPI
Kernel Recipes 2015: Representing device-tree peripherals in ACPIKernel Recipes 2015: Representing device-tree peripherals in ACPI
Kernel Recipes 2015: Representing device-tree peripherals in ACPI
 
Q2.12: Power Management Across OSs
Q2.12: Power Management Across OSsQ2.12: Power Management Across OSs
Q2.12: Power Management Across OSs
 
Note - (EDK2) Acpi Tables Compile and Install
Note - (EDK2) Acpi Tables Compile and InstallNote - (EDK2) Acpi Tables Compile and Install
Note - (EDK2) Acpi Tables Compile and Install
 
BIOS, Linux and Firmware Test Suite in-between
BIOS, Linux and  Firmware Test Suite in-betweenBIOS, Linux and  Firmware Test Suite in-between
BIOS, Linux and Firmware Test Suite in-between
 
Las16 200 - firmware summit - ras what is it- why do we need it
Las16 200 - firmware summit - ras what is it- why do we need itLas16 200 - firmware summit - ras what is it- why do we need it
Las16 200 - firmware summit - ras what is it- why do we need it
 
Linux Initialization Process (1)
Linux Initialization Process (1)Linux Initialization Process (1)
Linux Initialization Process (1)
 
Hardware Probing in the Linux Kernel
Hardware Probing in the Linux KernelHardware Probing in the Linux Kernel
Hardware Probing in the Linux Kernel
 
High Performance Storage Devices in the Linux Kernel
High Performance Storage Devices in the Linux KernelHigh Performance Storage Devices in the Linux Kernel
High Performance Storage Devices in the Linux Kernel
 
Linux Initialization Process (2)
Linux Initialization Process (2)Linux Initialization Process (2)
Linux Initialization Process (2)
 
UEFI presentation
UEFI presentationUEFI presentation
UEFI presentation
 
Power aware operating system
Power aware operating systemPower aware operating system
Power aware operating system
 
BUD17-TR01: Philosophy of Open Source
BUD17-TR01: Philosophy of Open SourceBUD17-TR01: Philosophy of Open Source
BUD17-TR01: Philosophy of Open Source
 

Semelhante a The e820 trap of Linux kernel hibernation

Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1
Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1
Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1
Jagadisha Maiya
 
Cpu And Memory Events
Cpu And Memory EventsCpu And Memory Events
Cpu And Memory Events
Aero Plane
 

Semelhante a The e820 trap of Linux kernel hibernation (20)

Когда предрелизный не только софт
Когда предрелизный не только софтКогда предрелизный не только софт
Когда предрелизный не только софт
 
Troubleshooting Linux Kernel Modules And Device Drivers
Troubleshooting Linux Kernel Modules And Device DriversTroubleshooting Linux Kernel Modules And Device Drivers
Troubleshooting Linux Kernel Modules And Device Drivers
 
Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1
Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1
Troubleshooting linux-kernel-modules-and-device-drivers-1233050713693744-1
 
Operating Systems (slides)
Operating Systems (slides)Operating Systems (slides)
Operating Systems (slides)
 
My First 100 days with an Exadata (PPT)
My First 100 days with an Exadata (PPT)My First 100 days with an Exadata (PPT)
My First 100 days with an Exadata (PPT)
 
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
 
Computer Organization and Architecture 10th - William Stallings, Ch01.pdf
Computer Organization and Architecture 10th - William Stallings, Ch01.pdfComputer Organization and Architecture 10th - William Stallings, Ch01.pdf
Computer Organization and Architecture 10th - William Stallings, Ch01.pdf
 
Advanced Root Cause Analysis
Advanced Root Cause AnalysisAdvanced Root Cause Analysis
Advanced Root Cause Analysis
 
Analisis_avanzado_vmware
Analisis_avanzado_vmwareAnalisis_avanzado_vmware
Analisis_avanzado_vmware
 
C C N A Day2
C C N A  Day2C C N A  Day2
C C N A Day2
 
Defense_Presentation
Defense_PresentationDefense_Presentation
Defense_Presentation
 
Trusted firmware deep_dive_v1.0_
Trusted firmware deep_dive_v1.0_Trusted firmware deep_dive_v1.0_
Trusted firmware deep_dive_v1.0_
 
Introduction to Modern U-Boot
Introduction to Modern U-BootIntroduction to Modern U-Boot
Introduction to Modern U-Boot
 
intel_x86_pm.pptx
intel_x86_pm.pptxintel_x86_pm.pptx
intel_x86_pm.pptx
 
linux-memory-explained.pdf
linux-memory-explained.pdflinux-memory-explained.pdf
linux-memory-explained.pdf
 
OS_Intro_Chap_1.ppt
OS_Intro_Chap_1.pptOS_Intro_Chap_1.ppt
OS_Intro_Chap_1.ppt
 
1 study of motherboard
1 study of motherboard1 study of motherboard
1 study of motherboard
 
How to-boot-linuxl-on-your-soc-boards
How to-boot-linuxl-on-your-soc-boardsHow to-boot-linuxl-on-your-soc-boards
How to-boot-linuxl-on-your-soc-boards
 
Cpu And Memory Events
Cpu And Memory EventsCpu And Memory Events
Cpu And Memory Events
 
Embedded Fest 2019. Руслан Биловол. Linux Boot: The Big Bang theory
Embedded Fest 2019. Руслан Биловол. Linux Boot: The Big Bang theoryEmbedded Fest 2019. Руслан Биловол. Linux Boot: The Big Bang theory
Embedded Fest 2019. Руслан Биловол. Linux Boot: The Big Bang theory
 

Último

Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
masabamasaba
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
masabamasaba
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 

Último (20)

Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 

The e820 trap of Linux kernel hibernation

  • 1. The e820 trap of Linux kernelThe e820 trap of Linux kernel hibernationhibernation AugAug, 2015, COSCUP 2015, Taipei, 2015, COSCUP 2015, Taipei Joey Lee, SUSE Labs Taipei
  • 2. 2 Agenda • Fundamental • Hibernation (suspen to disk) • e820, EFI memmap • e820 shift • Platform vs. Shutdown • memory size changing • EFI memmap shift • setup_data and nosave regions • EFI runtime services broken after S4 • Challenges • Q&A
  • 4. 4 Memory (physical) pfn = 0 pfn = max_pfn
  • 6. 6 Hibernation (suspend to disk) • Create snapshot image of runtime memory. • Store snapshot image to swap partition or file. • Restore snapshot image to memory.
  • 8. 8 Memory (physical) pfn = 0 pfn = max_pfn
  • 9. 9 Memory (BIOS memory map) 0 max_pfn 0 max_pfn Boot Boot
  • 10. 10 e820 • Wikipedia: • e820 is shorthand to refer to the facility by which the BIOS of x86-based computer systems reports the memory map to the operating system or boot loader. • It is accessed via the int 15h call, by setting the AX register to value E820 in hexadecimal. It reports which memory address ranges are usable and which are reserved for use by the BIOS.
  • 11. 11
  • 12. 12 e820 entry type Type Kernel Define String in dmesg Description Type 1 E820_RAM usable, System RAM Usable (normal) RAM Type 2 E820_RESERVED reserved, reserved Reserved - unusable Type 3 E820_ACPI ACPI data, ACPI Tables ACPI reclaimable memory Type 4 E820_NVS* ACPI NVS, ACPI Non-volatile Storage ACPI NVS memory, ACPI Non-Volatile-Sleeping Memory (NVS) Type 5 E820_UNUSABLE Unusable, Unusable memory Area containing bad memory * drivers/acpi/nvs.c::suspend_nvs_*() handle ACPI NVS for S4
  • 13. 13 Memory (BIOS memory map) 0 max_pfn 0 max_pfn Boot Boot
  • 14. 14 Memory (runtime) 0 max_pfn 0 max_pfn Boot ACPI NVS reserved ACPI data reserved Boot useable useable useable useable useable useable 0 max_pfn Boot ACPI NVS reserved ACPI data reserved useable useable useable useable useable useable OS
  • 15. 15 EFI memory map • EFI spec v2.5 • EFI_BOOT_SERVICES.GetMemoryMap() • Returns the current memory map. • 6.2 Memory Allocation Services • Table 25. Memory Type Usage before ExitBootServices() • Table 26. Memory Type Usage after ExitBootServices()
  • 16. 16
  • 17. 17 e820 entry type vs. EFI memory region type E820 Type E820 entry type EFI memory region type Type 1 E820_RAM EFI_LOADER_CODE (type 1) EFI_LOADER_DATA (type 2) EFI_BOOT_SERVICES_CODE (type 3) EFI_BOOT_SERVICES_DATA (type 4) EFI_CONVENTIONAL_MEMORY (type 7) Type 2 E820_RESERVED EFI_RESERVED_TYPE (type 0) EFI_RUNTIME_SERVICES_CODE (type 5) EFI_RUNTIME_SERVICES_DATA (type 6) EFI_MEMORY_MAPPED_IO (type 11) EFI_MEMORY_MAPPED_IO_PORT_SPACE (type 12) EFI_PAL_CODE (type 13) Type 3 E820_ACPI EFI_ACPI_RECLAIM_MEMORY (type 9) Type 4 E820_NVS EFI_ACPI_MEMORY_NVS (type 10) Type 5 E820_UNUSABLE EFI_UNUSABLE_MEMORY (type 8) New* E820_PMEM EFI_PERSISTENT_MEMORY (type 14) * v4.2-rc4 arch/x86/boot/compressed/eboot.c::setup_e820()
  • 19. 19
  • 20. 20
  • 22. 22 e820 shift (2) • Boot: • [ 0.000000] BIOS-e820: [mem 0x0000000068f45000-0x0000000069d4ffff] usable • Resume Boot: • [ 0.000000] BIOS-e820: [mem 0x0000000069d4f000-0x0000000069e12fff] reserved • [ 0.000000] PM: Registered nosave memory: [mem 0x69d4f000-0x69e12fff] • [ 17.410733] PM: Image loading progress: 0% • [ 17.929495] BUG: unable to handle kernel paging request at ffff880069d4f000 • [ 17.933469] IP: [<ffffffff810a1cf0>] load_image_lzo+0x810/0xe40 • Page fault address is in usable memory entry when boot, but in reserved memory entry when resume boot.
  • 23. 23 e820 shift (3) 0 max_pfn Boot ACPI NVS reserved ACPI data reserved useable useable useable useable useable useable max_pfn Boot ACPI NVS reserved ACPI data reserved useable useable useable useable useable useable 0 Boot Resume Boot Useable address in reserved region
  • 24. 24 Checking e820 shift: • Lee, Chun-Yi [PATCH] PM / hibernate: avoid unsafe pages in e820 reserved regions: • 84c91b7ae commit in v3.17-rc1 • https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=84c91b7 • Reverted by f82daee49 commit in v4.0 • Waiting “Yinghai Lu<> [PATCH]x86: Kill E820_RESERVED_KERN” • Lee, Chun-Yi [PATCH] Hibernate: save e820 table to snapshot header for comparison • https://lkml.org/lkml/2014/8/11/166
  • 25. 25 Platform vs. Shutdown (1) • Different modes of hibernation: • cat /sys/power/disk [platform] shutdown reboot suspend • Platform mode depends on _S4 support by BIOS: [ 1.080004] ACPI Exception: AE_NOT_FOUND, While evaluating Sleep State [_S4_] (20130725/hwxface-571) • ACPI spec 6.0: • Table 7-234 BIOS-Supplied Control Methods for System-Level Functions • _S4: Package that defines system _S4 state mode. • 16.3.2 BIOS Initialization of Memory (since ACPI v1.0): • Note: The memory information returned from the system address map reporting interfaces should be the same before and after an S4 sleep. OSPM will invoke E820 interfaces on IA-PC-based legacy systems or the GetMemoryMap() interface on UEFI-enabled systems
  • 26. 26 Platform vs. Shutdown (2) • Documentation/power/swsusp.txt in kernel • Q: What is the difference between "platform" and "shutdown"? • A: "platform" is actually right thing to do where supported, but "shutdown" is most reliable (except on ACPI systems). • Linux Kernel bug #77571: • https://bugzilla.kernel.org/show_bug.cgi?id=77571 • The same page fault when writing snapshot image to page buffer. • Bug reporter uses “shutdown” but not “platform”. After using “platform”, bug reporter can not reproduce issue. • That's better using platform when BIOS support _S4. User should aware that has risk when using “shutdown”.
  • 27. 27 Memory size mismatch (1) • PM: Loading and decompressing image data (495448 pages)... [ 3.834831] PM: Image mismatch: memory size [ 3.834851] PM: Read 1981792 kbytes in 0.01 seconds (198179.20 MB/s) [ 3.836147] PM: Error -1 resuming [ 3.836162] PM: Failed to load hibernation image, recovering. • Normally: On node 0 totalpages: 4177255 When issue happened: On node 0 totalpages: 4177256 <== mismatch • for_each_online_node(nid) phys_pages += node_present_pages(nid); • kernel/power/snapshot.c::check_header() if (!reason && info->num_physpages != get_num_physpages()) reason = "memory size"; if (reason) { printk(KERN_ERR "PM: Image mismatch: %sn", reason); return -EPERM; }
  • 28. 28 Memory size mismatch (2) • Boot Memory map of Boot
  • 29. 29 Memory size mismatch (3) • Resume Boot Memory map of Resume Boot
  • 30. EFI memmap shiftEFI memmap shift
  • 31. 31 Misidentification of nosave region (1) 1 page In usable Not align EFI_LOADER_DATA
  • 32. 32 setup_data and E820_RESERVED_KERN • setup_data: a linked list for carrying data with boot_params to later boot stage. • Allocated in EFI stub, reserved via memblock and e820. • Yinghai Lu<> [PATCH] x86, boot: clean up setup_data handling • https://lkml.org/lkml/2015/2/28/272 • SETUP_E820_EXT, SETUP_EFI SETUP_DTB, SETUP_PCI SETUP_KASLR • Those setup_data chunks are not page align when allocating. That causes hole between e820 entries, then kernel register it as 1 page nosave regions. <== random address per boot!
  • 33. 33 Misidentification of nosave region (2) • arch/x86/kernel/e820.c Register hole between two e820 region to nosave as 1 page region
  • 34. 34 Kill E820_RESERVED_KERN • Yinghai Lu [PATCH] x86: Kill E820_RESERVED_KERN • https://lkml.org/lkml/2015/2/28/274 • Cleaning setup_data handler, remove E820_RESERVED_KERN from e820 regions because setup_data are already protected by memblock. • Avoid wasting memory, fix page align problem in e820. • Linux Kernel bug #96111 Unreliable hibernation on Lenovo X230 • https://bugzilla.kernel.org/show_bug.cgi?id=96111 • 84c91b7ae commit in v3.17-rc1 Reverted by f82daee49 commit in v4.0 • Chen, Yu C [RFC PATCH] PM / hibernate: make sure each resuming page is in current memory zones • Waiting Yinghai Lu's patch for kill E820_RESERVED_KERN
  • 35. 35 EFI runtime services broken after S4 (1) On some machines
  • 36. 36 EFI runtime services broken after S4 (2) • Resume Boot: VA 0xffffffefd244e60 is in Runtime Data region after hibernate resumed: [ 0.125865] efi: mem26: [Runtime Data |RUN| | | | |WB|WT|WC|UC] pa=[0x00000000bb3e5000-0x00000000bb445000) va=[0xfffffffefd1e5000- 0xfffffffefd245000) (0MB) • Boot: VA 0xffffffefd244e60 didn't mapped to any PA in hibernating kernel (image kernel): [ 0.111002] efi: mem24: [Runtime Code |RUN| | | | |WB|WT|WC|UC] pa=[0x00000000bb385000-0x00000000bb3e5000) va=[0xfffffffefd585000- 0xfffffffefd5e5000) (0MB) [ 0.125883] efi: mem25: [Runtime Data |RUN| | | | |WB|WT|WC|UC] pa=[0x00000000bb3e5000-0x00000000bb445000) va=[0xfffffffefd3e5000- 0xfffffffefd445000) (0MB) [ 0.140764] efi: mem29: [Boot Data | | | | | |WB|WT|WC|UC] pa=[0x00000000bb7ff000-0x00000000bb800000) va=[0xfffffffefd1ff000- 0xfffffffefd200000) (0MB)
  • 37. 37 Memory mapping of EFI runtime services (1) • Borislav Petkov [PATCH] EFI: Runtime services virtual mapping • d2f7cbe7 merged since v3.14 kernel • We map the EFI regions needed for runtime services non- contiguously, with preserved alignment on virtual addresses starting from -4G down for a total max space of 64G. • Documentation/x86/x86_64/mm.txt ->trampoline_pgd: We map EFI runtime services in the aforementioned PGD in the virtual range of 64Gb (arbitrarily set, can be raised if needed) 0xffffffef00000000 - 0xffffffff00000000
  • 38. 38 Memory mapping of EFI runtime services (2) • Virtual memory map x86_64 of runtime service – trampoline_pgd Runtime Code Runtime Data 0xffffffffffffffff 0x0000000000000000 0x00000000bb385000 0xffffffff00000000 4 G 64 G 0x00000000bb3e5000 0xffffffef00000000 Boot Data Boot Code1:1 mapping workaround 1:1 mapping workaround 1:1 mapping workaround 1:1 mapping workaround Boot Data Boot Data arch/x86/platform/efi/efi_64.c::efi_map_region()
  • 39. 39 Memory mapping of EFI runtime services (3) • In -4G area: Runtime Code Runtime Data 0xffffffff00000000 0xffffffef00000000 Boot Data Boot Code 64 G Boot Data Boot Data 2M-aligned arch/x86/platform/efi/efi_64.c::efi_map_region()
  • 40. 40 Should fix runtime services address after S4 • Lee, Chun-Yi [PATCH] x86_64/efi: Mapping Boot and Runtime EFI memory regions to different starting virtual address • VA of EFI runtime services should may changed between hibernation, but that's fine when PA doesn't change. • Should checking more detail about EFI page table when hibernation recovery.
  • 42. 42 Hibernation's Challenge • KASLR (Kernel address space layout randomization) • Exclusive with hibernation • Intel Rapid Start • A replacement of kernel hibernation • May also conflict with KASLR • NVDIMM • Do not need hibernation anymore
  • 44. SUSE is HiringSUSE is Hiring Please search “SUSE Careers”Please search “SUSE Careers” andand http://www.104.com.tw/http://www.104.com.tw/
  • 46. 46
  • 47. 47

Notas do Editor

  1. Theory Mathematics