Bug 19640 - [ANCK-6.6-3][6.6.71-3_rc2.an23.aarch64]ltp-stress压力测试中,发生crash:filemap_alloc_folio+0x3c/0xc0
Summary: [ANCK-6.6-3][6.6.71-3_rc2.an23.aarch64]ltp-stress压力测试中,发生crash:filemap_alloc_...
Status: NEW
Alias: None
Product: Antest
Classification: Infrastructures
Component: 测试用例 (show other bugs) 测试用例
Version: unspecified
Hardware: All Linux
: P3-Medium S3-normal
Target Milestone: ---
Assignee: shuancue
QA Contact: shuming
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2025-03-19 14:42 UTC by wangpingping
Modified: 2025-03-19 16:47 UTC (History)
7 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description wangpingping alibaba_cloud_group 2025-03-19 14:42:12 UTC
[缺陷描述]:
执行ltp-stress压力测试,发生crash,机器卡主,未产生完整的vmcore

部分vmcore-dmesg如下:
 [41230.985762] Call trace:
[41230.985763]  machine_kexec+0x40/0x200
[41230.985766]  __crash_kexec+0x70/0xd8
[41230.985768]  panic+0x308/0x388
[41230.985771]  watchdog_timer_fn+0x2cc/0x2d8
[41230.985773]  __hrtimer_run_queues+0x19c/0x370
[41230.985775]  hrtimer_interrupt+0xec/0x248
[41230.985776]  arch_timer_handler_phys+0x30/0x50
[41230.985779]  handle_percpu_devid_irq+0x8c/0x230
[41230.985782]  generic_handle_domain_irq+0x30/0x50
[41230.985783]  __gic_handle_irq_from_irqson.isra.0+0x140/0x260
[41230.985786]  gic_handle_irq+0x2c/0xa0
[41230.985787]  call_on_irq_stack+0x24/0x30
[41230.985789]  do_interrupt_handler+0x80/0x90
[41230.985791]  el1_interrupt+0x44/0xa8
[41230.985793]  el1h_64_irq_handler+0x14/0x20
[41230.985794]  el1h_64_irq+0x78/0x80
[41230.985795]  arch_counter_get_cntpct+0x14/0x18
[41230.985797]  ktime_get+0x48/0xa8
[41230.985799]  memcg_lat_stat_start+0x24/0x50
[41230.985801]  __alloc_pages_direct_compact+0x58/0x388
[41230.985804]  __alloc_pages_slowpath+0x6b8/0x918
[41230.985805]  __alloc_pages+0x34c/0x428
[41230.985807]  alloc_pages+0x98/0x138
[41230.985809]  folio_alloc+0x1c/0x40
[41230.985812]  filemap_alloc_folio+0x3c/0xc0
[41230.985814]  __filemap_get_folio+0x1e8/0x470
[41230.985816]  iomap_get_folio+0x6c/0x88
[41230.985818]  iomap_write_begin+0x1c0/0x308
[41230.985820]  iomap_write_iter+0xf4/0x280
[41230.985822]  iomap_file_buffered_write+0x88/0xf0
[41230.985823]  xfs_file_buffered_write+0x98/0x2d0 [xfs]
[41230.985868]  xfs_file_write_iter+0x104/0x150 [xfs]
[41230.985915]  vfs_write+0x1a4/0x2f8
[41230.985918]  ksys_write+0x70/0x108
[41230.985920]  __arm64_sys_write+0x20/0x30
[41230.985923]  el0_svc_common.constprop.0+0x60/0x138
[41230.985925]  do_el0_svc+0x20/0x30
[41230.985928]  el0_svc+0x44/0x1a8
[41230.985929]  el0t_64_sync_handler+0xf8/0x128
[41230.985931]  el0t_64_sync+0x17c/0x180
[41230.985932] ---[ end trace 0000000000000000 ]---
[41230.985934] Bye!


[重现概率]
目前仅出现一次
 
[重现环境]
内核:
6.6.71-3_rc2.an23.aarch64


# cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt2)/vmlinuz-6.6.71-3_rc1.an23.aarch64 root=UUID=bedec06f-d570-431d-bce1-749030567aeb ro rhgb selinux=0 console=tty0 cgroup.memory=nokmem iommu.passthrough=1 iommu.strict=0 nospectre_bhb ssbd=force-off no_hash_pointers crashkernel=0M-2G:0M,2G-64G:256M,64G-:512M

# cat /etc/os-release
NAME="Anolis OS"
VERSION="23.2"
ID="anolis"
VERSION_ID="23.2"
PLATFORM_ID="platform:an23"
PRETTY_NAME="Anolis OS 23.2"
ANSI_COLOR="0;31"
HOME_URL="https://openanolis.cn/"
BUG_REPORT_URL="https://bugzilla.openanolis.cn/"


内存信息:
# free -h
               total        used        free      shared  buff/cache   available
Mem:           7.3Gi       290Mi       7.0Gi       716Ki       231Mi       7.0Gi
Swap:             0B          0B          0B

CPU信息:
# lscpu
Architecture:             aarch64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          48 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   2
  On-line CPU(s) list:    0,1
Vendor ID:                ARM
  BIOS Vendor ID:         Alibaba Cloud
  Model name:             Neoverse-N2
    BIOS Model name:      virt-rhel7.6.0  CPU @ 3.0GHz
    BIOS CPU family:      1
    Model:                0
    Thread(s) per core:   1
    Core(s) per socket:   2
    Socket(s):            1
    Stepping:             r0p0
    Frequency boost:      disabled
    CPU(s) scaling MHz:   100%
    CPU max MHz:          3000.0000
    CPU min MHz:          3000.0000
    BogoMIPS:             100.00
    Flags:                fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt f
                          cma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb dcp
                          odp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm b
                          f16 dgh
Caches (sum of all):
  L1d:                    128 KiB (2 instances)
  L1i:                    128 KiB (2 instances)
  L2:                     2 MiB (2 instances)
  L3:                     64 MiB (1 instance)
NUMA:
  NUMA node(s):           1
  NUMA node0 CPU(s):      0,1
Vulnerabilities:
  Gather data sampling:   Not affected
  Itlb multihit:          Not affected
  L1tf:                   Not affected
  Mds:                    Not affected
  Meltdown:               Not affected
  Mmio stale data:        Not affected
  Reg file data sampling: Not affected
  Retbleed:               Not affected
  Spec rstack overflow:   Not affected
  Spec store bypass:      Vulnerable
  Spectre v1:             Mitigation; __user pointer sanitization
  Spectre v2:             Mitigation; CSV2, but not BHB
  Srbds:                  Not affected
  Tsx async abort:        Not affected


[重现步骤]:
1、安装测试内核,reboot
2、# 下载并编译测试套
git clone http://code.alibaba-inc.com/alikernel/ltp.git 
export CFLAGS="-fcommon"               #  gcc 10 需要添加这个
make autotools
./configure
make
make install

环境设置:
echo 1 > /proc/sys/kernel/panic
echo 1 > /proc/sys/kernel/hardlockup_panic
echo 1 > /proc/sys/kernel/softlockup_panic
echo 60 > /proc/sys/kernel/watchdog_thresh
echo 150 > /proc/sys/kernel/watchdog_thresh
echo 1200 > /proc/sys/kernel/hung_task_timeout_secs
echo 0 > /proc/sys/kernel/hung_task_panic
echo '0 4 0 7' > /proc/sys/kernel/printk
echo 1 > /proc/sys/kernel/sched_group_balancer

# 准备测试脚本
cat <<-EOF > /opt/ltp/load.sh
#!/bin/bash
nr_cpu=$(nproc)
mem_kb=$(grep ^MemTotal /proc/meminfo | awk '{print $2}')
./runltp \
 -c $((nr_cpu / 2)) \
 -m $((nr_cpu / 4)),4,$((mem_kb / nr_cpu / 2 * 1024)),1 \
 -D $((nr_cpu / 10)),1,0,1 \
 -i 2 \
 -B ext4 \
 -R -p -q \
 -t 72h \
 -d /disk1/tmpdir/ltp \
 -b /dev/vdb1 -B ext4 -z /dev/vdb2 -Z ext4
EOF
chmod a+x /opt/ltp/load.sh

# 执行测试
nohup ./load.sh &> ltp-stress.log &


[期望结果]:
ltp stress正常执行结束
 
[实际结果]:
ltp-stress运行过程中,产生crash