Created attachment 815 [details] rc1.1内核触发crash Description of problem: 使用5.10.134-15_rc1.1.an8.x86_64内核运行stress-ng io子系统测试12h,触发crash:Kernel panic - not syncing: Fatal exception # cat /etc/os-release NAME="Anolis OS" VERSION="8.8" ID="anolis" ID_LIKE="rhel fedora centos" VERSION_ID="8.8" PLATFORM_ID="platform:an8" PRETTY_NAME="Anolis OS 8.8" ANSI_COLOR="0;31" HOME_URL="https://openanolis.cn/" # free -mh total used free shared buff/cache available Mem: 7.4Gi 254Mi 4.4Gi 1.0Mi 2.8Gi 6.9Gi Swap: 0B 0B 0B Version-Release number of selected component (if applicable): # uname -r 5.10.134-15_rc1.1.an8.x86_64 # cat /proc/cmdline BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-5.10.134-15_rc1.1.an8.x86_64 root=UUID=5430caa2-16ed-402b-afd3-f2e7f9baa552 ro cryptomgr.notests rcupdate.rcu_cpu_stall_timeout=300 vring_force_dma_api rhgb quiet biosdevname=0 net.ifnames=0 console=tty0 console=ttyS0,115200n8 noibrs nvme_core.io_timeout=4294967295 nvme_core.admin_timeout=4294967295 mem_encrypt=on kvm_amd.sev=1 kvm_amd.sev_es=1 cgroup.memory=nokmem crashkernel=0M-2G:0M,2G-8G:192M,8G-:256M # lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 2 On-line CPU(s) list: 0,1 Thread(s) per core: 2 Core(s) per socket: 1 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD BIOS Vendor ID: Alibaba Cloud CPU family: 25 Model: 17 Model name: AMD EPYC 9T24 96-Core Processor BIOS Model name: pc-i440fx-2.1 Stepping: 1 CPU MHz: 3697.621 BogoMIPS: 5399.99 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 32768K NUMA node0 CPU(s): 0,1 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext invpcid_single vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr rdpru wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm arch_capabilities How reproducible: 待复现 Steps to Reproduce: 1.下载rc1.1内核,安装并重启使内核生效 2. git clone https://github.com/ColinIanKing/stress-ng.git && cd stress-ng && make && make install 3.echo 1 > /proc/sys/kernel/panic echo 1 > /proc/sys/kernel/hardlockup_panic echo 60 > /proc/sys/kernel/watchdog_thresh echo 1200 > /proc/sys/kernel/hung_task_timeout_secs echo 0 > /proc/sys/kernel/hung_task_panic echo 3 >/sys/kernel/mm/transparent_hugepage/hugetext_enabled echo 1 >/sys/kernel/mm/duptext/enabled echo 1 >/sys/fs/cgroup/memory/memory.allow_duptext echo 1 > /proc/sys/kernel/sched_group_identity_enabled grubby --update-kernel=ALL --args="mem_encrypt=on kvm_amd.sev=1 kvm_amd.sev_es=1" ulimit -s unlimited 4. systemd-run --unit=stresstest --slice=test nohup stress-ng -a 1 --class io -t 12h --metrics --vm-bytes 90% --vm-hang 10 --oomable --times --verify -v -Y /disk1/tmpdir/stress-ng/stress-statistic-12.yaml --log-file /disk1/tmpdir/stress-ng/stress-logfile-12.txt --temp-path /disk1/tmpdir/stress-ng & systemctl set-property stresstest.service CPUQuota=80% MemoryLimit=4G Actual results: io子系统测试运行12h触发crash:Kernel panic - not syncing: Fatal exception Expected results: io子系统测试正常完成,没有crash,hang,重启等情况 Additional info:
Created attachment 816 [details] vmcore-dmesg.txt
详细分析参见: https://bugzilla.openanolis.cn/show_bug.cgi?id=5583 *** This bug has been marked as a duplicate of bug 5583 ***
请齐江同学确认下是不是和提到的softlockup问题属于同类问题
(In reply to shuancue from comment #3) > 请齐江同学确认下是不是和提到的softlockup问题属于同类问题 从 vmcore-dmesg 来看,之前一直都在触发 io_uring 的 soft lockup,但由于配置了 hung_task_panic 为 0,因此不会直接 crash。 最终触发 crash 的原因是: general protection fault, probably for non-canonical address 0xffff2a805d23e190: 0000 [#1] SMP NOPTI [215547.686554] RIP: 0010:mntget+0x11/0x20 需要再分析下。
Created attachment 845 [details] 复测无异常 在相同机器上,当前为5.10.134-15_rc3.an8.x86_64内核。 复测了两次该测试,均无异常现象。