[缺陷描述]: 执行ltp-stress压力测试,发生crash,机器卡主,未产生完整的vmcore 部分vmcore-dmesg如下: [41230.985762] Call trace: [41230.985763] machine_kexec+0x40/0x200 [41230.985766] __crash_kexec+0x70/0xd8 [41230.985768] panic+0x308/0x388 [41230.985771] watchdog_timer_fn+0x2cc/0x2d8 [41230.985773] __hrtimer_run_queues+0x19c/0x370 [41230.985775] hrtimer_interrupt+0xec/0x248 [41230.985776] arch_timer_handler_phys+0x30/0x50 [41230.985779] handle_percpu_devid_irq+0x8c/0x230 [41230.985782] generic_handle_domain_irq+0x30/0x50 [41230.985783] __gic_handle_irq_from_irqson.isra.0+0x140/0x260 [41230.985786] gic_handle_irq+0x2c/0xa0 [41230.985787] call_on_irq_stack+0x24/0x30 [41230.985789] do_interrupt_handler+0x80/0x90 [41230.985791] el1_interrupt+0x44/0xa8 [41230.985793] el1h_64_irq_handler+0x14/0x20 [41230.985794] el1h_64_irq+0x78/0x80 [41230.985795] arch_counter_get_cntpct+0x14/0x18 [41230.985797] ktime_get+0x48/0xa8 [41230.985799] memcg_lat_stat_start+0x24/0x50 [41230.985801] __alloc_pages_direct_compact+0x58/0x388 [41230.985804] __alloc_pages_slowpath+0x6b8/0x918 [41230.985805] __alloc_pages+0x34c/0x428 [41230.985807] alloc_pages+0x98/0x138 [41230.985809] folio_alloc+0x1c/0x40 [41230.985812] filemap_alloc_folio+0x3c/0xc0 [41230.985814] __filemap_get_folio+0x1e8/0x470 [41230.985816] iomap_get_folio+0x6c/0x88 [41230.985818] iomap_write_begin+0x1c0/0x308 [41230.985820] iomap_write_iter+0xf4/0x280 [41230.985822] iomap_file_buffered_write+0x88/0xf0 [41230.985823] xfs_file_buffered_write+0x98/0x2d0 [xfs] [41230.985868] xfs_file_write_iter+0x104/0x150 [xfs] [41230.985915] vfs_write+0x1a4/0x2f8 [41230.985918] ksys_write+0x70/0x108 [41230.985920] __arm64_sys_write+0x20/0x30 [41230.985923] el0_svc_common.constprop.0+0x60/0x138 [41230.985925] do_el0_svc+0x20/0x30 [41230.985928] el0_svc+0x44/0x1a8 [41230.985929] el0t_64_sync_handler+0xf8/0x128 [41230.985931] el0t_64_sync+0x17c/0x180 [41230.985932] ---[ end trace 0000000000000000 ]--- [41230.985934] Bye! [重现概率] 目前仅出现一次 [重现环境] 内核: 6.6.71-3_rc2.an23.aarch64 # cat /proc/cmdline BOOT_IMAGE=(hd0,gpt2)/vmlinuz-6.6.71-3_rc1.an23.aarch64 root=UUID=bedec06f-d570-431d-bce1-749030567aeb ro rhgb selinux=0 console=tty0 cgroup.memory=nokmem iommu.passthrough=1 iommu.strict=0 nospectre_bhb ssbd=force-off no_hash_pointers crashkernel=0M-2G:0M,2G-64G:256M,64G-:512M # cat /etc/os-release NAME="Anolis OS" VERSION="23.2" ID="anolis" VERSION_ID="23.2" PLATFORM_ID="platform:an23" PRETTY_NAME="Anolis OS 23.2" ANSI_COLOR="0;31" HOME_URL="https://openanolis.cn/" BUG_REPORT_URL="https://bugzilla.openanolis.cn/" 内存信息: # free -h total used free shared buff/cache available Mem: 7.3Gi 290Mi 7.0Gi 716Ki 231Mi 7.0Gi Swap: 0B 0B 0B CPU信息: # lscpu Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 2 On-line CPU(s) list: 0,1 Vendor ID: ARM BIOS Vendor ID: Alibaba Cloud Model name: Neoverse-N2 BIOS Model name: virt-rhel7.6.0 CPU @ 3.0GHz BIOS CPU family: 1 Model: 0 Thread(s) per core: 1 Core(s) per socket: 2 Socket(s): 1 Stepping: r0p0 Frequency boost: disabled CPU(s) scaling MHz: 100% CPU max MHz: 3000.0000 CPU min MHz: 3000.0000 BogoMIPS: 100.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt f cma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb dcp odp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm b f16 dgh Caches (sum of all): L1d: 128 KiB (2 instances) L1i: 128 KiB (2 instances) L2: 2 MiB (2 instances) L3: 64 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0,1 Vulnerabilities: Gather data sampling: Not affected Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Reg file data sampling: Not affected Retbleed: Not affected Spec rstack overflow: Not affected Spec store bypass: Vulnerable Spectre v1: Mitigation; __user pointer sanitization Spectre v2: Mitigation; CSV2, but not BHB Srbds: Not affected Tsx async abort: Not affected [重现步骤]: 1、安装测试内核,reboot 2、# 下载并编译测试套 git clone http://code.alibaba-inc.com/alikernel/ltp.git export CFLAGS="-fcommon" # gcc 10 需要添加这个 make autotools ./configure make make install 环境设置: echo 1 > /proc/sys/kernel/panic echo 1 > /proc/sys/kernel/hardlockup_panic echo 1 > /proc/sys/kernel/softlockup_panic echo 60 > /proc/sys/kernel/watchdog_thresh echo 150 > /proc/sys/kernel/watchdog_thresh echo 1200 > /proc/sys/kernel/hung_task_timeout_secs echo 0 > /proc/sys/kernel/hung_task_panic echo '0 4 0 7' > /proc/sys/kernel/printk echo 1 > /proc/sys/kernel/sched_group_balancer # 准备测试脚本 cat <<-EOF > /opt/ltp/load.sh #!/bin/bash nr_cpu=$(nproc) mem_kb=$(grep ^MemTotal /proc/meminfo | awk '{print $2}') ./runltp \ -c $((nr_cpu / 2)) \ -m $((nr_cpu / 4)),4,$((mem_kb / nr_cpu / 2 * 1024)),1 \ -D $((nr_cpu / 10)),1,0,1 \ -i 2 \ -B ext4 \ -R -p -q \ -t 72h \ -d /disk1/tmpdir/ltp \ -b /dev/vdb1 -B ext4 -z /dev/vdb2 -Z ext4 EOF chmod a+x /opt/ltp/load.sh # 执行测试 nohup ./load.sh &> ltp-stress.log & [期望结果]: ltp stress正常执行结束 [实际结果]: ltp-stress运行过程中,产生crash