[缺陷描述]: 执行stress-ng class压力测试,发生crash:"Kernel panic - not syncing: Fatal hardware error!" 在ali6000内核倚天机器上出现过同样问题,定位为硬件问题,非内核在此处记录; 部分vmcore-dmesg如下: KERNEL: /usr/lib/debug/usr/lib/modules/6.6.71-3_rc2.an23.aarch64/vmlinux [TAINTED] DUMPFILE: vmcore [PARTIAL DUMP] CPUS: 128 [OFFLINE: 5] DATE: Fri Mar 14 20:46:14 CST 2025 UPTIME: 4 days, 03:58:40 LOAD AVERAGE: 87.64, 86.98, 86.94 TASKS: 1793 NODENAME: 16f5Lab15 RELEASE: 6.6.71-3_rc2.an23.aarch64 VERSION: #1 SMP PREEMPT_DYNAMIC Fri Mar 7 12:23:12 CST 2025 MACHINE: aarch64 (unknown Mhz) MEMORY: 128 GB PANIC: "Kernel panic - not syncing: Fatal hardware error!" PID: 40160 COMMAND: "read_all" TASK: ffff00081173d000 [THREAD_INFO: ffff00081173d000] CPU: 46 STATE: TASK_RUNNING (PANIC) [359921.405094] Call trace: [359921.405094] machine_kexec+0x40/0x200 [359921.405096] __crash_kexec+0x70/0xd8 [359921.405099] panic+0x308/0x388 [359921.405102] __ghes_panic+0x7c/0x88 [359921.405104] ghes_in_nmi_queue_one_entry+0x404/0x468 [359921.405106] ghes_sdei_critical_callback+0x34/0x70 [359921.405108] sdei_event_handler+0x24/0x98 [359921.405110] do_sdei_event+0x88/0x170 [359921.405112] __sdei_handler+0x54/0x208 [359921.405113] __sdei_asm_handler+0xe8/0x188 [359921.405115] pci_get_rom_size+0x44/0x1b8 [359921.405117] pci_map_rom+0xa8/0x170 [359921.405119] pci_read_rom+0x50/0xf8 [359921.405121] sysfs_kf_bin_read+0x70/0x98 [359921.405123] kernfs_file_read_iter+0x98/0x198 [359921.405124] kernfs_fop_read_iter+0x2c/0x48 [359921.405125] vfs_read+0x200/0x2c0 [359921.405128] ksys_read+0x70/0x108 [359921.405130] __arm64_sys_read+0x20/0x30 [359921.405132] el0_svc_common.constprop.0+0x60/0x138 [359921.405134] do_el0_svc+0x20/0x30 [359921.405136] el0_svc+0x44/0x1a8 [359921.405137] el0t_64_sync_handler+0xf8/0x128 [359921.405139] el0t_64_sync+0x17c/0x180 [359921.405139] ---[ end trace 0000000000000000 ]--- [359921.405140] Bye! [重现概率] 目前仅出现一次 [重现环境] 内核: 6.6.71-3_rc2.an23.aarch64 # cat /proc/cmdline BOOT_IMAGE=(hd0,gpt2)/vmlinuz-6.6.71-3_rc1.an23.aarch64 root=UUID=bedec06f-d570-431d-bce1-749030567aeb ro rhgb selinux=0 console=tty0 cgroup.memory=nokmem iommu.passthrough=1 iommu.strict=0 nospectre_bhb ssbd=force-off no_hash_pointers crashkernel=0M-2G:0M,2G-64G:256M,64G-:512M # cat /etc/os-release NAME="Anolis OS" VERSION="23.2" ID="anolis" VERSION_ID="23.2" PLATFORM_ID="platform:an23" PRETTY_NAME="Anolis OS 23.2" ANSI_COLOR="0;31" HOME_URL="https://openanolis.cn/" BUG_REPORT_URL="https://bugzilla.openanolis.cn/" 内存信息: # free -h total used free shared buff/cache available Mem: 7.3Gi 290Mi 7.0Gi 716Ki 231Mi 7.0Gi Swap: 0B 0B 0B CPU信息: # lscpu Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 2 On-line CPU(s) list: 0,1 Vendor ID: ARM BIOS Vendor ID: Alibaba Cloud Model name: Neoverse-N2 BIOS Model name: virt-rhel7.6.0 CPU @ 3.0GHz BIOS CPU family: 1 Model: 0 Thread(s) per core: 1 Core(s) per socket: 2 Socket(s): 1 Stepping: r0p0 Frequency boost: disabled CPU(s) scaling MHz: 100% CPU max MHz: 3000.0000 CPU min MHz: 3000.0000 BogoMIPS: 100.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt f cma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb dcp odp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm b f16 dgh Caches (sum of all): L1d: 128 KiB (2 instances) L1i: 128 KiB (2 instances) L2: 2 MiB (2 instances) L3: 64 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0,1 Vulnerabilities: Gather data sampling: Not affected Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Reg file data sampling: Not affected Retbleed: Not affected Spec rstack overflow: Not affected Spec store bypass: Vulnerable Spectre v1: Mitigation; __user pointer sanitization Spectre v2: Mitigation; CSV2, but not BHB Srbds: Not affected Tsx async abort: Not affected [重现步骤]: 1、安装测试内核,reboot 2、# 下载并编译测试套 git clone http://code.alibaba-inc.com/alikernel/ltp.git export CFLAGS="-fcommon" # gcc 10 需要添加这个 make autotools ./configure make make install 环境设置: echo 1 > /proc/sys/kernel/panic echo 1 > /proc/sys/kernel/hardlockup_panic echo 1 > /proc/sys/kernel/softlockup_panic echo 60 > /proc/sys/kernel/watchdog_thresh echo 150 > /proc/sys/kernel/watchdog_thresh echo 1200 > /proc/sys/kernel/hung_task_timeout_secs echo 0 > /proc/sys/kernel/hung_task_panic echo '0 4 0 7' > /proc/sys/kernel/printk echo 1 > /proc/sys/kernel/sched_group_balancer # 准备测试脚本 cat <<-EOF > /opt/ltp/load.sh #!/bin/bash nr_cpu=$(nproc) mem_kb=$(grep ^MemTotal /proc/meminfo | awk '{print $2}') ./runltp \ -c $((nr_cpu / 2)) \ -m $((nr_cpu / 4)),4,$((mem_kb / nr_cpu / 2 * 1024)),1 \ -D $((nr_cpu / 10)),1,0,1 \ -i 2 \ -B ext4 \ -R -p -q \ -t 72h \ -d /disk1/tmpdir/ltp \ -b /dev/vdb1 -B ext4 -z /dev/vdb2 -Z ext4 EOF chmod a+x /opt/ltp/load.sh # 执行测试 nohup ./load.sh &> ltp-stress.log & [期望结果]: ltp stress正常执行结束 [实际结果]: ltp-stress运行过程中,产生crash