SlideShare uma empresa Scribd logo
1 de 25
How to debug OCFS2 hang problem
- L3 bug handling experience sharing
Gang He <ghe@suse.com>
Apr 26th, 2019
Understand the problem
3
Problem description
The customer has setup a new SLES11sp4 2 node
cluster and is running some application tests on it,
they see the file system periodically hangs up and
processes get into a "D" state.
All processes stuck in "D" state were in the ocfs2_cluster_lock code. for example,
[<ffffffffa066f800>] __ocfs2_cluster_lock+0x3b0/0xa60 [ocfs2]
[<ffffffffa0677528>] ocfs2_inode_lock_full_nested+0x178/0x510 [ocfs2]
[<ffffffffa06ec791>] ocfs2_get_acl+0x61/0x120 [ocfs2]
[<ffffffffa06ec95a>] ocfs2_acl_chmod+0x6a/0xe0 [ocfs2]
[<ffffffffa0681121>] ocfs2_setattr+0x671/0xab0 [ocfs2]
[<ffffffff8117de8e>] notify_change+0x17e/0x2d0
[<ffffffff8116136c>] sys_fchmodat+0xdc/0x150
[<ffffffff8147c187>] sysenter_dispatch+0x7/0x32
[<ffffffffffffffff>] 0xffffffffffffffff
4
Interact with the customer
• Mail communication
Make sure the ocfs2 cluster setup is correct.
Understand the customer application scenarios.
Provide tentative suggestions/patches.
• Remote session with the customer
Reproduce bug.
Find ocfs2 related hung processes.
Collect the related data.
5
Collect data from the customer site
• supportconfig/hb_report
SLES HA cluster related data.
• dlm_tool
DLM lock related dump.
• o2image
OCFS2 file system meta-data image.
• echo "c" > /proc/sysrq-trigger
Linux core dump file.
6
Generate core dump in HA cluster
• Why is no Linux core dump left after trigger panic?
Since the fence mechanism resets the machine when
it is doing the Kdump.
• Solutions
1) use stonith:fence_kdump resource agent
please refer to SLE-HA-guide document for more
details.
2) disable hardware watchdog and use soft watchdog
see the detailed steps on the next page.
7
Use soft watchdog temporarily
• Disable hardware watchdog
edit /etc/modprobe.conf file, to add two lines to disable
loading the related kernel modules. (Note: this step
depends on your machine's hardware watchdog
configuration)
blacklist iTCO_wdt
blacklist iTCO_vendor_support
• Enable soft watchdog
edit /etc/init.d/boot.local file, to add one line to load
soft watchdog kernel module at boot.
modprobe softdog
• Reboot the machine to take effect
Analyze the problem
9
Prepare crash analysis environment
• Crash-setup
This tools can help you set up a crash analysis environment quickly in L3 server according
to the vmcore file, but the access speed is very slow from Beijing site, and HA related
KMP debuginfo/debugsource rpms are missed.
• By yourself
Install the related debuginfo/debugsource rpms
kernel-default-3.0.101-108.68.1
kernel-default-devel-3.0.101-108.68.1
kernel-default-base-3.0.101-108.68.1
kernel-default-debugsource-3.0.101-108.68.1
kernel-default-debuginfo-3.0.101-108.68.1
ocfs2-kmp-default-1.6_3.0.101_63-0.23.40
ocfs2-debugsource-1.6-3.0.101_63-0.23.40
ocfs2-debuginfo-1.6-3.0.101_63-0.23.40
10
Basic crash analysis skills
11
Verify the problematic directories/files
12
Analyze the hung processes - I
13
Analyze the hung processes - II
14
Check DLM lock dump
From DLM lock dumps of two nodes, we can find
node04(this DLM lock resource master) has given a
PR Meta lock grant of inode 14797221(0xe1c9a5) to
one process.
15
Analyze the hung processes - III
16
Analyze the hung processes - IV
17
Analyze the hung processes - V
18
Analyze the hung processes - VI
19
Root cause
The root cause is the process 31017, which had got
the inode(14797222) DLM EX lock at ocfs2_setattr(),
then the process tried to get the inode DLM PR lock at
ocfs2_get_acl() again, the recursive lock recursive led
to a dead-lock. Then, the related processes among
the cluster were blocked.
The fix patches are as below,
commit 439a36b8ef38657f765b80b775e2885338d72451
Author: Eric Ren <zren@suse.com>
Date: Wed Feb 22 15:40:41 2017 -0800
ocfs2/dlmglue: prepare tracking logic to avoid recursive cluster lock
commit b891fa5024a95c77e0d6fd6655cb74af6fb77f46
Author: Eric Ren <zren@suse.com>
Date: Wed Feb 22 15:40:44 2017 -0800
ocfs2: fix deadlock issue when taking inode lock at vfs entry points
commit 8818efaaacb78c60a9d90c5705b6c99b75d7d442
Author: Eric Ren <zren@suse.com>
Date: Fri Jun 23 15:08:55 2017 -0700
ocfs2: fix deadlock caused by recursive locking in xattr
Solve the problem
21
The fix process
• Find kernel patches (from the upstream/yourself).
• Test the patches based on the customer version.
Pass ocfs2 test suits.
• Create the fix branch.
e.g. origin/users/ghe/SLE12-SP4/bsc1128902
• L3 creates the corresponding PTF rpm.
• The customer verifies the PTF rpm.
• Submit the patches to the upstream if they are new.
• Add the patches to SUSE kernel-source.
• Close the bug from SUSE bugzilla.
22
SUSE kernel source maintenance
• Kernel-source
url: user@kerncvs.suse.de:/home/git/kernel-source.git
Linux tarball plus lots of patches
• Kernel
url: git://kerncvs.suse.de/kernel.git
SUSE Linux kernel source (patches applied)
• Code branches for various SLES versions.
origin/SLE12-SP4
origin/SLE15-SP1
origin/SLE15-SP1-UPDATE
...
• Automatically propagate among branches.
http://kerncvs.suse.de/
23
Automatically propagate among
branches
24
Add patch to SUSE kernel-source
• Format patch from the Linus git
cd /torvalds
git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
git format-patch commit-id -1
• Add three keywords to the patch, e.g.
Patch-mainline: v4.11-rc1
Git-commit: b891fa5024a95c77e0d6fd6655cb74af6fb77f46
References: bsc#1086695
Note: the patch must include at least one SUSE related e-mail address.
• Set LINUX_GIT environment variable
This variable points to your local Linus git directory, e.g. LINUX_GIT=/torvalds/linux
• Push the patch to SUSE kernel-source, e.g.
git checkout -b users/ghe/SLE12-SP2/for-next origin/SLE12-SP2
./scripts/git_sort/series_insert.py patches.fixes/ocfs2-try-to-reuse-extent-block-in-dealloc-without-m.patch
git add patches.fixes/ocfs2-try-to-reuse-extent-block-in-dealloc-without-m.patch
./scripts/log
git push -v ssh://ghe@kerncvs.suse.de/srv/git/kernel-source.git users/ghe/SLE12-SP2/for-next
• Reference
https://pes.suse.de/L3/Kernel_git_repositories/
How to debug ocfs2 hang problem

Mais conteúdo relacionado

Mais procurados

Kernel Recipes 2015: Introduction to Kernel Power Management
Kernel Recipes 2015: Introduction to Kernel Power ManagementKernel Recipes 2015: Introduction to Kernel Power Management
Kernel Recipes 2015: Introduction to Kernel Power ManagementAnne Nicolas
 
CLUG 2010 09 - systemd - the new init system
CLUG 2010 09 - systemd - the new init systemCLUG 2010 09 - systemd - the new init system
CLUG 2010 09 - systemd - the new init systemPaulWay
 
1.3 runlevels, shutdown, and reboot v3
1.3 runlevels, shutdown, and reboot v31.3 runlevels, shutdown, and reboot v3
1.3 runlevels, shutdown, and reboot v3Acácio Oliveira
 
Embedded Recipes 2018 - Finding sources of Latency In your system - Steven Ro...
Embedded Recipes 2018 - Finding sources of Latency In your system - Steven Ro...Embedded Recipes 2018 - Finding sources of Latency In your system - Steven Ro...
Embedded Recipes 2018 - Finding sources of Latency In your system - Steven Ro...Anne Nicolas
 
101 1.3 runlevels, shutdown, and reboot v2
101 1.3 runlevels, shutdown, and reboot v2101 1.3 runlevels, shutdown, and reboot v2
101 1.3 runlevels, shutdown, and reboot v2Acácio Oliveira
 
101 1.3 runlevels , shutdown, and reboot
101 1.3 runlevels , shutdown, and reboot101 1.3 runlevels , shutdown, and reboot
101 1.3 runlevels , shutdown, and rebootAcácio Oliveira
 
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPFUSENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPFBrendan Gregg
 
On-Demand Image Resizing
On-Demand Image ResizingOn-Demand Image Resizing
On-Demand Image ResizingJonathan Lee
 
How to assign unowned disk in the netapp cluster 8.3
How to assign unowned disk in the netapp cluster 8.3 How to assign unowned disk in the netapp cluster 8.3
How to assign unowned disk in the netapp cluster 8.3 Saroj Sahu
 
Kernel Recipes 2019 - BPF at Facebook
Kernel Recipes 2019 - BPF at FacebookKernel Recipes 2019 - BPF at Facebook
Kernel Recipes 2019 - BPF at FacebookAnne Nicolas
 
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecksKernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecksAnne Nicolas
 
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all startedKernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all startedAnne Nicolas
 
BSides Denver: Stealthy, hypervisor-based malware analysis
BSides Denver: Stealthy, hypervisor-based malware analysisBSides Denver: Stealthy, hypervisor-based malware analysis
BSides Denver: Stealthy, hypervisor-based malware analysisTamas K Lengyel
 
Whitepaper MS SQL Server on Linux
Whitepaper MS SQL Server on LinuxWhitepaper MS SQL Server on Linux
Whitepaper MS SQL Server on LinuxRoger Eisentrager
 
HKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with CoresightHKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with CoresightLinaro
 

Mais procurados (20)

Systemd cheatsheet
Systemd cheatsheetSystemd cheatsheet
Systemd cheatsheet
 
Kernel Recipes 2015: Introduction to Kernel Power Management
Kernel Recipes 2015: Introduction to Kernel Power ManagementKernel Recipes 2015: Introduction to Kernel Power Management
Kernel Recipes 2015: Introduction to Kernel Power Management
 
CLUG 2010 09 - systemd - the new init system
CLUG 2010 09 - systemd - the new init systemCLUG 2010 09 - systemd - the new init system
CLUG 2010 09 - systemd - the new init system
 
1.3 runlevels, shutdown, and reboot v3
1.3 runlevels, shutdown, and reboot v31.3 runlevels, shutdown, and reboot v3
1.3 runlevels, shutdown, and reboot v3
 
SystemV vs systemd
SystemV vs systemdSystemV vs systemd
SystemV vs systemd
 
Embedded Recipes 2018 - Finding sources of Latency In your system - Steven Ro...
Embedded Recipes 2018 - Finding sources of Latency In your system - Steven Ro...Embedded Recipes 2018 - Finding sources of Latency In your system - Steven Ro...
Embedded Recipes 2018 - Finding sources of Latency In your system - Steven Ro...
 
101 1.3 runlevels, shutdown, and reboot v2
101 1.3 runlevels, shutdown, and reboot v2101 1.3 runlevels, shutdown, and reboot v2
101 1.3 runlevels, shutdown, and reboot v2
 
101 1.3 runlevels , shutdown, and reboot
101 1.3 runlevels , shutdown, and reboot101 1.3 runlevels , shutdown, and reboot
101 1.3 runlevels , shutdown, and reboot
 
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPFUSENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
 
Rac introduction
Rac introductionRac introduction
Rac introduction
 
kdump: usage and_internals
kdump: usage and_internalskdump: usage and_internals
kdump: usage and_internals
 
On-Demand Image Resizing
On-Demand Image ResizingOn-Demand Image Resizing
On-Demand Image Resizing
 
How to assign unowned disk in the netapp cluster 8.3
How to assign unowned disk in the netapp cluster 8.3 How to assign unowned disk in the netapp cluster 8.3
How to assign unowned disk in the netapp cluster 8.3
 
Kernel Recipes 2019 - BPF at Facebook
Kernel Recipes 2019 - BPF at FacebookKernel Recipes 2019 - BPF at Facebook
Kernel Recipes 2019 - BPF at Facebook
 
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecksKernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks
 
First steps on CentOs7
First steps on CentOs7First steps on CentOs7
First steps on CentOs7
 
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all startedKernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
 
BSides Denver: Stealthy, hypervisor-based malware analysis
BSides Denver: Stealthy, hypervisor-based malware analysisBSides Denver: Stealthy, hypervisor-based malware analysis
BSides Denver: Stealthy, hypervisor-based malware analysis
 
Whitepaper MS SQL Server on Linux
Whitepaper MS SQL Server on LinuxWhitepaper MS SQL Server on Linux
Whitepaper MS SQL Server on Linux
 
HKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with CoresightHKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with Coresight
 

Semelhante a How to debug ocfs2 hang problem

Containers with systemd-nspawn
Containers with systemd-nspawnContainers with systemd-nspawn
Containers with systemd-nspawnGábor Nyers
 
Discoverer 11.1.1.7 web logic (10.3.6) & ebs r12 12.1.3) implementation guide...
Discoverer 11.1.1.7 web logic (10.3.6) & ebs r12 12.1.3) implementation guide...Discoverer 11.1.1.7 web logic (10.3.6) & ebs r12 12.1.3) implementation guide...
Discoverer 11.1.1.7 web logic (10.3.6) & ebs r12 12.1.3) implementation guide...ginniapps
 
Crash_Report_Mechanism_In_Tizen
Crash_Report_Mechanism_In_TizenCrash_Report_Mechanism_In_Tizen
Crash_Report_Mechanism_In_TizenLex Yu
 
Debugging linux issues with eBPF
Debugging linux issues with eBPFDebugging linux issues with eBPF
Debugging linux issues with eBPFIvan Babrou
 
Intel DPDK Step by Step instructions
Intel DPDK Step by Step instructionsIntel DPDK Step by Step instructions
Intel DPDK Step by Step instructionsHisaki Ohara
 
LCU14 114- Upstreaming 201
LCU14 114- Upstreaming 201LCU14 114- Upstreaming 201
LCU14 114- Upstreaming 201Linaro
 
Advanced Root Cause Analysis
Advanced Root Cause AnalysisAdvanced Root Cause Analysis
Advanced Root Cause AnalysisEric Sloof
 
How to-mount-3 par-san-virtual-copy-onto-rhel-servers-by-Dusan-Baljevic
How to-mount-3 par-san-virtual-copy-onto-rhel-servers-by-Dusan-BaljevicHow to-mount-3 par-san-virtual-copy-onto-rhel-servers-by-Dusan-Baljevic
How to-mount-3 par-san-virtual-copy-onto-rhel-servers-by-Dusan-BaljevicCircling Cycle
 
Reverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande ModemReverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande ModemCyber Security Alliance
 
Docker and friends at Linux Days 2014 in Prague
Docker and friends at Linux Days 2014 in PragueDocker and friends at Linux Days 2014 in Prague
Docker and friends at Linux Days 2014 in Praguetomasbart
 
My old security advisories on HMI/SCADA and industrial software released betw...
My old security advisories on HMI/SCADA and industrial software released betw...My old security advisories on HMI/SCADA and industrial software released betw...
My old security advisories on HMI/SCADA and industrial software released betw...Luigi Auriemma
 
Armboot process zeelogic
Armboot process zeelogicArmboot process zeelogic
Armboot process zeelogicAleem Shariff
 
Oracle 12c RAC On your laptop Step by Step Implementation Guide 1.0
Oracle 12c RAC On your laptop Step by Step Implementation Guide 1.0Oracle 12c RAC On your laptop Step by Step Implementation Guide 1.0
Oracle 12c RAC On your laptop Step by Step Implementation Guide 1.0Yury Velikanov
 
Network Automation Tools
Network Automation ToolsNetwork Automation Tools
Network Automation ToolsEdwin Beekman
 
Powervc upgrade from_1.3.0.2_to_1.3.2.0
Powervc upgrade from_1.3.0.2_to_1.3.2.0Powervc upgrade from_1.3.0.2_to_1.3.2.0
Powervc upgrade from_1.3.0.2_to_1.3.2.0Gobinath Panchavarnam
 
NFD9 - Matt Peterson, Data Center Operations
NFD9 - Matt Peterson, Data Center OperationsNFD9 - Matt Peterson, Data Center Operations
NFD9 - Matt Peterson, Data Center OperationsCumulus Networks
 
图文详解安装Net backup 6.5备份恢复oracle 10g rac 数据库
图文详解安装Net backup 6.5备份恢复oracle 10g rac 数据库图文详解安装Net backup 6.5备份恢复oracle 10g rac 数据库
图文详解安装Net backup 6.5备份恢复oracle 10g rac 数据库maclean liu
 
hacking-embedded-devices.pptx
hacking-embedded-devices.pptxhacking-embedded-devices.pptx
hacking-embedded-devices.pptxssuserfcf43f
 

Semelhante a How to debug ocfs2 hang problem (20)

Rac 12c optimization
Rac 12c optimizationRac 12c optimization
Rac 12c optimization
 
Containers with systemd-nspawn
Containers with systemd-nspawnContainers with systemd-nspawn
Containers with systemd-nspawn
 
Discoverer 11.1.1.7 web logic (10.3.6) & ebs r12 12.1.3) implementation guide...
Discoverer 11.1.1.7 web logic (10.3.6) & ebs r12 12.1.3) implementation guide...Discoverer 11.1.1.7 web logic (10.3.6) & ebs r12 12.1.3) implementation guide...
Discoverer 11.1.1.7 web logic (10.3.6) & ebs r12 12.1.3) implementation guide...
 
Crash_Report_Mechanism_In_Tizen
Crash_Report_Mechanism_In_TizenCrash_Report_Mechanism_In_Tizen
Crash_Report_Mechanism_In_Tizen
 
Debugging linux issues with eBPF
Debugging linux issues with eBPFDebugging linux issues with eBPF
Debugging linux issues with eBPF
 
Intel DPDK Step by Step instructions
Intel DPDK Step by Step instructionsIntel DPDK Step by Step instructions
Intel DPDK Step by Step instructions
 
LCU14 114- Upstreaming 201
LCU14 114- Upstreaming 201LCU14 114- Upstreaming 201
LCU14 114- Upstreaming 201
 
Advanced Root Cause Analysis
Advanced Root Cause AnalysisAdvanced Root Cause Analysis
Advanced Root Cause Analysis
 
Analisis_avanzado_vmware
Analisis_avanzado_vmwareAnalisis_avanzado_vmware
Analisis_avanzado_vmware
 
How to-mount-3 par-san-virtual-copy-onto-rhel-servers-by-Dusan-Baljevic
How to-mount-3 par-san-virtual-copy-onto-rhel-servers-by-Dusan-BaljevicHow to-mount-3 par-san-virtual-copy-onto-rhel-servers-by-Dusan-Baljevic
How to-mount-3 par-san-virtual-copy-onto-rhel-servers-by-Dusan-Baljevic
 
Reverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande ModemReverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande Modem
 
Docker and friends at Linux Days 2014 in Prague
Docker and friends at Linux Days 2014 in PragueDocker and friends at Linux Days 2014 in Prague
Docker and friends at Linux Days 2014 in Prague
 
My old security advisories on HMI/SCADA and industrial software released betw...
My old security advisories on HMI/SCADA and industrial software released betw...My old security advisories on HMI/SCADA and industrial software released betw...
My old security advisories on HMI/SCADA and industrial software released betw...
 
Armboot process zeelogic
Armboot process zeelogicArmboot process zeelogic
Armboot process zeelogic
 
Oracle 12c RAC On your laptop Step by Step Implementation Guide 1.0
Oracle 12c RAC On your laptop Step by Step Implementation Guide 1.0Oracle 12c RAC On your laptop Step by Step Implementation Guide 1.0
Oracle 12c RAC On your laptop Step by Step Implementation Guide 1.0
 
Network Automation Tools
Network Automation ToolsNetwork Automation Tools
Network Automation Tools
 
Powervc upgrade from_1.3.0.2_to_1.3.2.0
Powervc upgrade from_1.3.0.2_to_1.3.2.0Powervc upgrade from_1.3.0.2_to_1.3.2.0
Powervc upgrade from_1.3.0.2_to_1.3.2.0
 
NFD9 - Matt Peterson, Data Center Operations
NFD9 - Matt Peterson, Data Center OperationsNFD9 - Matt Peterson, Data Center Operations
NFD9 - Matt Peterson, Data Center Operations
 
图文详解安装Net backup 6.5备份恢复oracle 10g rac 数据库
图文详解安装Net backup 6.5备份恢复oracle 10g rac 数据库图文详解安装Net backup 6.5备份恢复oracle 10g rac 数据库
图文详解安装Net backup 6.5备份恢复oracle 10g rac 数据库
 
hacking-embedded-devices.pptx
hacking-embedded-devices.pptxhacking-embedded-devices.pptx
hacking-embedded-devices.pptx
 

Último

How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 

Último (20)

How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 

How to debug ocfs2 hang problem

  • 1. How to debug OCFS2 hang problem - L3 bug handling experience sharing Gang He <ghe@suse.com> Apr 26th, 2019
  • 3. 3 Problem description The customer has setup a new SLES11sp4 2 node cluster and is running some application tests on it, they see the file system periodically hangs up and processes get into a "D" state. All processes stuck in "D" state were in the ocfs2_cluster_lock code. for example, [<ffffffffa066f800>] __ocfs2_cluster_lock+0x3b0/0xa60 [ocfs2] [<ffffffffa0677528>] ocfs2_inode_lock_full_nested+0x178/0x510 [ocfs2] [<ffffffffa06ec791>] ocfs2_get_acl+0x61/0x120 [ocfs2] [<ffffffffa06ec95a>] ocfs2_acl_chmod+0x6a/0xe0 [ocfs2] [<ffffffffa0681121>] ocfs2_setattr+0x671/0xab0 [ocfs2] [<ffffffff8117de8e>] notify_change+0x17e/0x2d0 [<ffffffff8116136c>] sys_fchmodat+0xdc/0x150 [<ffffffff8147c187>] sysenter_dispatch+0x7/0x32 [<ffffffffffffffff>] 0xffffffffffffffff
  • 4. 4 Interact with the customer • Mail communication Make sure the ocfs2 cluster setup is correct. Understand the customer application scenarios. Provide tentative suggestions/patches. • Remote session with the customer Reproduce bug. Find ocfs2 related hung processes. Collect the related data.
  • 5. 5 Collect data from the customer site • supportconfig/hb_report SLES HA cluster related data. • dlm_tool DLM lock related dump. • o2image OCFS2 file system meta-data image. • echo "c" > /proc/sysrq-trigger Linux core dump file.
  • 6. 6 Generate core dump in HA cluster • Why is no Linux core dump left after trigger panic? Since the fence mechanism resets the machine when it is doing the Kdump. • Solutions 1) use stonith:fence_kdump resource agent please refer to SLE-HA-guide document for more details. 2) disable hardware watchdog and use soft watchdog see the detailed steps on the next page.
  • 7. 7 Use soft watchdog temporarily • Disable hardware watchdog edit /etc/modprobe.conf file, to add two lines to disable loading the related kernel modules. (Note: this step depends on your machine's hardware watchdog configuration) blacklist iTCO_wdt blacklist iTCO_vendor_support • Enable soft watchdog edit /etc/init.d/boot.local file, to add one line to load soft watchdog kernel module at boot. modprobe softdog • Reboot the machine to take effect
  • 9. 9 Prepare crash analysis environment • Crash-setup This tools can help you set up a crash analysis environment quickly in L3 server according to the vmcore file, but the access speed is very slow from Beijing site, and HA related KMP debuginfo/debugsource rpms are missed. • By yourself Install the related debuginfo/debugsource rpms kernel-default-3.0.101-108.68.1 kernel-default-devel-3.0.101-108.68.1 kernel-default-base-3.0.101-108.68.1 kernel-default-debugsource-3.0.101-108.68.1 kernel-default-debuginfo-3.0.101-108.68.1 ocfs2-kmp-default-1.6_3.0.101_63-0.23.40 ocfs2-debugsource-1.6-3.0.101_63-0.23.40 ocfs2-debuginfo-1.6-3.0.101_63-0.23.40
  • 11. 11 Verify the problematic directories/files
  • 12. 12 Analyze the hung processes - I
  • 13. 13 Analyze the hung processes - II
  • 14. 14 Check DLM lock dump From DLM lock dumps of two nodes, we can find node04(this DLM lock resource master) has given a PR Meta lock grant of inode 14797221(0xe1c9a5) to one process.
  • 15. 15 Analyze the hung processes - III
  • 16. 16 Analyze the hung processes - IV
  • 17. 17 Analyze the hung processes - V
  • 18. 18 Analyze the hung processes - VI
  • 19. 19 Root cause The root cause is the process 31017, which had got the inode(14797222) DLM EX lock at ocfs2_setattr(), then the process tried to get the inode DLM PR lock at ocfs2_get_acl() again, the recursive lock recursive led to a dead-lock. Then, the related processes among the cluster were blocked. The fix patches are as below, commit 439a36b8ef38657f765b80b775e2885338d72451 Author: Eric Ren <zren@suse.com> Date: Wed Feb 22 15:40:41 2017 -0800 ocfs2/dlmglue: prepare tracking logic to avoid recursive cluster lock commit b891fa5024a95c77e0d6fd6655cb74af6fb77f46 Author: Eric Ren <zren@suse.com> Date: Wed Feb 22 15:40:44 2017 -0800 ocfs2: fix deadlock issue when taking inode lock at vfs entry points commit 8818efaaacb78c60a9d90c5705b6c99b75d7d442 Author: Eric Ren <zren@suse.com> Date: Fri Jun 23 15:08:55 2017 -0700 ocfs2: fix deadlock caused by recursive locking in xattr
  • 21. 21 The fix process • Find kernel patches (from the upstream/yourself). • Test the patches based on the customer version. Pass ocfs2 test suits. • Create the fix branch. e.g. origin/users/ghe/SLE12-SP4/bsc1128902 • L3 creates the corresponding PTF rpm. • The customer verifies the PTF rpm. • Submit the patches to the upstream if they are new. • Add the patches to SUSE kernel-source. • Close the bug from SUSE bugzilla.
  • 22. 22 SUSE kernel source maintenance • Kernel-source url: user@kerncvs.suse.de:/home/git/kernel-source.git Linux tarball plus lots of patches • Kernel url: git://kerncvs.suse.de/kernel.git SUSE Linux kernel source (patches applied) • Code branches for various SLES versions. origin/SLE12-SP4 origin/SLE15-SP1 origin/SLE15-SP1-UPDATE ... • Automatically propagate among branches. http://kerncvs.suse.de/
  • 24. 24 Add patch to SUSE kernel-source • Format patch from the Linus git cd /torvalds git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git git format-patch commit-id -1 • Add three keywords to the patch, e.g. Patch-mainline: v4.11-rc1 Git-commit: b891fa5024a95c77e0d6fd6655cb74af6fb77f46 References: bsc#1086695 Note: the patch must include at least one SUSE related e-mail address. • Set LINUX_GIT environment variable This variable points to your local Linus git directory, e.g. LINUX_GIT=/torvalds/linux • Push the patch to SUSE kernel-source, e.g. git checkout -b users/ghe/SLE12-SP2/for-next origin/SLE12-SP2 ./scripts/git_sort/series_insert.py patches.fixes/ocfs2-try-to-reuse-extent-block-in-dealloc-without-m.patch git add patches.fixes/ocfs2-try-to-reuse-extent-block-in-dealloc-without-m.patch ./scripts/log git push -v ssh://ghe@kerncvs.suse.de/srv/git/kernel-source.git users/ghe/SLE12-SP2/for-next • Reference https://pes.suse.de/L3/Kernel_git_repositories/