Bug 5687 - [ANCK-5.10-15][rc1.1][anolis8.8][x86]执行stress-ng io子系统测试12h,触发crash:"Kernel panic - not syncing: Fatal exception"
Summary: [ANCK-5.10-15][rc1.1][anolis8.8][x86]执行stress-ng io子系统测试12h,触发crash:"Kernel p...
Status: REOPENED
Alias: None
Product: ANCK 5.10 Dev
Classification: ANCK
Component: general/others (show other bugs) general/others
Version: 5.10.y-15
Hardware: x86_64 Linux
: P3-Medium S3-normal
Target Milestone: ---
Assignee: Ferry Meng
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-06-30 10:25 UTC by liuyaqing
Modified: 2023-07-18 14:21 UTC (History)
5 users (show)

See Also:


Attachments
rc1.1内核触发crash (9.93 MB, image/bmp)
2023-06-30 10:25 UTC, liuyaqing
Details
vmcore-dmesg.txt (2.14 MB, text/plain)
2023-06-30 10:30 UTC, liuyaqing
Details
复测无异常 (851.27 KB, image/jpeg)
2023-07-12 10:01 UTC, Ferry Meng
Details

Note You need to log in before you can comment on or make changes to this bug.
Description liuyaqing alibaba_cloud_group 2023-06-30 10:25:07 UTC
Created attachment 815 [details]
rc1.1内核触发crash

Description of problem:
使用5.10.134-15_rc1.1.an8.x86_64内核运行stress-ng io子系统测试12h,触发crash:Kernel panic - not syncing: Fatal exception

# cat /etc/os-release
NAME="Anolis OS"
VERSION="8.8"
ID="anolis"
ID_LIKE="rhel fedora centos"
VERSION_ID="8.8"
PLATFORM_ID="platform:an8"
PRETTY_NAME="Anolis OS 8.8"
ANSI_COLOR="0;31"
HOME_URL="https://openanolis.cn/"

# free -mh
              total        used        free      shared  buff/cache   available
Mem:          7.4Gi       254Mi       4.4Gi       1.0Mi       2.8Gi       6.9Gi
Swap:            0B          0B          0B



Version-Release number of selected component (if applicable):
# uname -r
5.10.134-15_rc1.1.an8.x86_64

# cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-5.10.134-15_rc1.1.an8.x86_64 root=UUID=5430caa2-16ed-402b-afd3-f2e7f9baa552 ro cryptomgr.notests rcupdate.rcu_cpu_stall_timeout=300 vring_force_dma_api rhgb quiet biosdevname=0 net.ifnames=0 console=tty0 console=ttyS0,115200n8 noibrs nvme_core.io_timeout=4294967295 nvme_core.admin_timeout=4294967295 mem_encrypt=on kvm_amd.sev=1 kvm_amd.sev_es=1 cgroup.memory=nokmem crashkernel=0M-2G:0M,2G-8G:192M,8G-:256M

# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              2
On-line CPU(s) list: 0,1
Thread(s) per core:  2
Core(s) per socket:  1
Socket(s):           1
NUMA node(s):        1
Vendor ID:           AuthenticAMD
BIOS Vendor ID:      Alibaba Cloud
CPU family:          25
Model:               17
Model name:          AMD EPYC 9T24 96-Core Processor
BIOS Model name:     pc-i440fx-2.1
Stepping:            1
CPU MHz:             3697.621
BogoMIPS:            5399.99
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            32768K
NUMA node0 CPU(s):   0,1
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext invpcid_single vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr rdpru wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm arch_capabilities

How reproducible:
待复现

Steps to Reproduce:
1.下载rc1.1内核,安装并重启使内核生效

2. git clone https://github.com/ColinIanKing/stress-ng.git && cd stress-ng && make && make install

3.echo 1  > /proc/sys/kernel/panic
echo 1  > /proc/sys/kernel/hardlockup_panic
echo 60 > /proc/sys/kernel/watchdog_thresh
echo 1200 > /proc/sys/kernel/hung_task_timeout_secs
echo 0   > /proc/sys/kernel/hung_task_panic
echo 3 >/sys/kernel/mm/transparent_hugepage/hugetext_enabled
echo 1 >/sys/kernel/mm/duptext/enabled
echo 1 >/sys/fs/cgroup/memory/memory.allow_duptext
echo 1 > /proc/sys/kernel/sched_group_identity_enabled
grubby --update-kernel=ALL --args="mem_encrypt=on kvm_amd.sev=1 kvm_amd.sev_es=1"
ulimit -s unlimited

4. systemd-run --unit=stresstest --slice=test nohup stress-ng -a 1 --class io -t 12h --metrics --vm-bytes 90% --vm-hang 10 --oomable --times --verify -v -Y /disk1/tmpdir/stress-ng/stress-statistic-12.yaml --log-file /disk1/tmpdir/stress-ng/stress-logfile-12.txt --temp-path /disk1/tmpdir/stress-ng &
systemctl set-property stresstest.service CPUQuota=80% MemoryLimit=4G

Actual results:
io子系统测试运行12h触发crash:Kernel panic - not syncing: Fatal exception

Expected results:
io子系统测试正常完成,没有crash,hang,重启等情况

Additional info:
Comment 1 liuyaqing alibaba_cloud_group 2023-06-30 10:30:31 UTC
Created attachment 816 [details]
vmcore-dmesg.txt
Comment 2 Joseph Qi alibaba_cloud_group 2023-06-30 11:31:30 UTC
详细分析参见:
https://bugzilla.openanolis.cn/show_bug.cgi?id=5583

*** This bug has been marked as a duplicate of bug 5583 ***
Comment 3 shuancue alibaba_cloud_group 2023-07-07 16:54:48 UTC
请齐江同学确认下是不是和提到的softlockup问题属于同类问题
Comment 4 Joseph Qi alibaba_cloud_group 2023-07-07 17:14:43 UTC
(In reply to shuancue from comment #3)
> 请齐江同学确认下是不是和提到的softlockup问题属于同类问题

从 vmcore-dmesg 来看,之前一直都在触发 io_uring 的 soft lockup,但由于配置了 hung_task_panic 为 0,因此不会直接 crash。
最终触发 crash 的原因是:
general protection fault, probably for non-canonical address 0xffff2a805d23e190: 0000 [#1] SMP NOPTI
[215547.686554] RIP: 0010:mntget+0x11/0x20

需要再分析下。
Comment 5 Ferry Meng alibaba_cloud_group 2023-07-12 10:01:30 UTC
Created attachment 845 [details]
复测无异常

在相同机器上,当前为5.10.134-15_rc3.an8.x86_64内核。
复测了两次该测试,均无异常现象。