Bug 19528 - [ANCK-6.6-3][aarch64]ltp-stress压力测试时,发生crash:"Kernel panic - not syncing: Fatal hardware error!"
Summary: [ANCK-6.6-3][aarch64]ltp-stress压力测试时,发生crash:"Kernel panic - not syncing: Fat...
Status: NEW
Alias: None
Product: Antest
Classification: Infrastructures
Component: 测试用例 (show other bugs) 测试用例
Version: unspecified
Hardware: All Linux
: P3-Medium S3-normal
Target Milestone: ---
Assignee: shuming
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2025-03-17 14:22 UTC by wangpingping
Modified: 2025-03-17 14:24 UTC (History)
5 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description wangpingping alibaba_cloud_group 2025-03-17 14:22:54 UTC
[缺陷描述]:
执行stress-ng class压力测试,发生crash:"Kernel panic - not syncing: Fatal hardware error!"
在ali6000内核倚天机器上出现过同样问题,定位为硬件问题,非内核在此处记录;


部分vmcore-dmesg如下:
 KERNEL: /usr/lib/debug/usr/lib/modules/6.6.71-3_rc2.an23.aarch64/vmlinux  [TAINTED]
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 128 [OFFLINE: 5]
        DATE: Fri Mar 14 20:46:14 CST 2025
      UPTIME: 4 days, 03:58:40
LOAD AVERAGE: 87.64, 86.98, 86.94
       TASKS: 1793
    NODENAME: 16f5Lab15
     RELEASE: 6.6.71-3_rc2.an23.aarch64
     VERSION: #1 SMP PREEMPT_DYNAMIC Fri Mar  7 12:23:12 CST 2025
     MACHINE: aarch64  (unknown Mhz)
      MEMORY: 128 GB
       PANIC: "Kernel panic - not syncing: Fatal hardware error!"
         PID: 40160
     COMMAND: "read_all"
        TASK: ffff00081173d000  [THREAD_INFO: ffff00081173d000]
         CPU: 46
       STATE: TASK_RUNNING (PANIC)

[359921.405094] Call trace:
[359921.405094]  machine_kexec+0x40/0x200
[359921.405096]  __crash_kexec+0x70/0xd8
[359921.405099]  panic+0x308/0x388
[359921.405102]  __ghes_panic+0x7c/0x88
[359921.405104]  ghes_in_nmi_queue_one_entry+0x404/0x468
[359921.405106]  ghes_sdei_critical_callback+0x34/0x70
[359921.405108]  sdei_event_handler+0x24/0x98
[359921.405110]  do_sdei_event+0x88/0x170
[359921.405112]  __sdei_handler+0x54/0x208
[359921.405113]  __sdei_asm_handler+0xe8/0x188
[359921.405115]  pci_get_rom_size+0x44/0x1b8
[359921.405117]  pci_map_rom+0xa8/0x170
[359921.405119]  pci_read_rom+0x50/0xf8
[359921.405121]  sysfs_kf_bin_read+0x70/0x98
[359921.405123]  kernfs_file_read_iter+0x98/0x198
[359921.405124]  kernfs_fop_read_iter+0x2c/0x48
[359921.405125]  vfs_read+0x200/0x2c0
[359921.405128]  ksys_read+0x70/0x108
[359921.405130]  __arm64_sys_read+0x20/0x30
[359921.405132]  el0_svc_common.constprop.0+0x60/0x138
[359921.405134]  do_el0_svc+0x20/0x30
[359921.405136]  el0_svc+0x44/0x1a8
[359921.405137]  el0t_64_sync_handler+0xf8/0x128
[359921.405139]  el0t_64_sync+0x17c/0x180
[359921.405139] ---[ end trace 0000000000000000 ]---
[359921.405140] Bye!


[重现概率]
目前仅出现一次
 
[重现环境]
内核:
6.6.71-3_rc2.an23.aarch64


# cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt2)/vmlinuz-6.6.71-3_rc1.an23.aarch64 root=UUID=bedec06f-d570-431d-bce1-749030567aeb ro rhgb selinux=0 console=tty0 cgroup.memory=nokmem iommu.passthrough=1 iommu.strict=0 nospectre_bhb ssbd=force-off no_hash_pointers crashkernel=0M-2G:0M,2G-64G:256M,64G-:512M

# cat /etc/os-release
NAME="Anolis OS"
VERSION="23.2"
ID="anolis"
VERSION_ID="23.2"
PLATFORM_ID="platform:an23"
PRETTY_NAME="Anolis OS 23.2"
ANSI_COLOR="0;31"
HOME_URL="https://openanolis.cn/"
BUG_REPORT_URL="https://bugzilla.openanolis.cn/"


内存信息:
# free -h
               total        used        free      shared  buff/cache   available
Mem:           7.3Gi       290Mi       7.0Gi       716Ki       231Mi       7.0Gi
Swap:             0B          0B          0B

CPU信息:
# lscpu
Architecture:             aarch64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          48 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   2
  On-line CPU(s) list:    0,1
Vendor ID:                ARM
  BIOS Vendor ID:         Alibaba Cloud
  Model name:             Neoverse-N2
    BIOS Model name:      virt-rhel7.6.0  CPU @ 3.0GHz
    BIOS CPU family:      1
    Model:                0
    Thread(s) per core:   1
    Core(s) per socket:   2
    Socket(s):            1
    Stepping:             r0p0
    Frequency boost:      disabled
    CPU(s) scaling MHz:   100%
    CPU max MHz:          3000.0000
    CPU min MHz:          3000.0000
    BogoMIPS:             100.00
    Flags:                fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt f
                          cma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb dcp
                          odp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm b
                          f16 dgh
Caches (sum of all):
  L1d:                    128 KiB (2 instances)
  L1i:                    128 KiB (2 instances)
  L2:                     2 MiB (2 instances)
  L3:                     64 MiB (1 instance)
NUMA:
  NUMA node(s):           1
  NUMA node0 CPU(s):      0,1
Vulnerabilities:
  Gather data sampling:   Not affected
  Itlb multihit:          Not affected
  L1tf:                   Not affected
  Mds:                    Not affected
  Meltdown:               Not affected
  Mmio stale data:        Not affected
  Reg file data sampling: Not affected
  Retbleed:               Not affected
  Spec rstack overflow:   Not affected
  Spec store bypass:      Vulnerable
  Spectre v1:             Mitigation; __user pointer sanitization
  Spectre v2:             Mitigation; CSV2, but not BHB
  Srbds:                  Not affected
  Tsx async abort:        Not affected


[重现步骤]:
1、安装测试内核,reboot
2、# 下载并编译测试套
git clone http://code.alibaba-inc.com/alikernel/ltp.git 
export CFLAGS="-fcommon"               #  gcc 10 需要添加这个
make autotools
./configure
make
make install

环境设置:
echo 1 > /proc/sys/kernel/panic
echo 1 > /proc/sys/kernel/hardlockup_panic
echo 1 > /proc/sys/kernel/softlockup_panic
echo 60 > /proc/sys/kernel/watchdog_thresh
echo 150 > /proc/sys/kernel/watchdog_thresh
echo 1200 > /proc/sys/kernel/hung_task_timeout_secs
echo 0 > /proc/sys/kernel/hung_task_panic
echo '0 4 0 7' > /proc/sys/kernel/printk
echo 1 > /proc/sys/kernel/sched_group_balancer

# 准备测试脚本
cat <<-EOF > /opt/ltp/load.sh
#!/bin/bash
nr_cpu=$(nproc)
mem_kb=$(grep ^MemTotal /proc/meminfo | awk '{print $2}')
./runltp \
 -c $((nr_cpu / 2)) \
 -m $((nr_cpu / 4)),4,$((mem_kb / nr_cpu / 2 * 1024)),1 \
 -D $((nr_cpu / 10)),1,0,1 \
 -i 2 \
 -B ext4 \
 -R -p -q \
 -t 72h \
 -d /disk1/tmpdir/ltp \
 -b /dev/vdb1 -B ext4 -z /dev/vdb2 -Z ext4
EOF
chmod a+x /opt/ltp/load.sh

# 执行测试
nohup ./load.sh &> ltp-stress.log &


[期望结果]:
ltp stress正常执行结束
 
[实际结果]:
ltp-stress运行过程中,产生crash