Bug 19414 - [ANCK-6.6-3][aarch64][倚天] 执行ltp-stress 48h左右,发生crash,Kernel panic - not syncing: softlockup: hung tasks
Summary: [ANCK-6.6-3][aarch64][倚天] 执行ltp-stress 48h左右,发生crash,Kernel panic - not synci...
Status: NEW
Alias: None
Product: Antest
Classification: Infrastructures
Component: 测试用例 (show other bugs) 测试用例
Version: unspecified
Hardware: All Linux
: P3-Medium S3-normal
Target Milestone: ---
Assignee: shuancue
QA Contact: shuming
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2025-03-10 10:32 UTC by Janos
Modified: 2025-03-17 21:10 UTC (History)
12 users (show)

See Also:


Attachments
vmcore_dmesg (1.04 MB, text/plain)
2025-03-10 10:32 UTC, Janos
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Janos alibaba_cloud_group 2025-03-10 10:32:27 UTC
Created attachment 1322 [details]
vmcore_dmesg

[缺陷描述]:
倚天执行ltp-stress 48h左右,发生crash,Kernel panic - not syncing: softlockup: hung tasks


[机器信息]:
环境:物理机
机型:倚天


内核版本: 
# uname -r
6.6.71-3_rc2.al8.aarch64

内存信息:
# free -h
              total        used        free      shared  buff/cache   available
Mem:          503Gi       3.0Gi       476Gi        12Mi        23Gi       497Gi
Swap:         2.0Gi          0B       2.0Gi

cpu信息:
# lscpu
Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              128
On-line CPU(s) list: 0-127
Thread(s) per core:  1
Core(s) per socket:  128
Socket(s):           1
NUMA node(s):        2
Vendor ID:           ARM
BIOS Vendor ID:      T-HEAD
Model:               0
Model name:          Neoverse-N2
BIOS Model name:     Yitian710-128
Stepping:            r0p0
CPU MHz:             2750.001
BogoMIPS:            100.00
L1d cache:           64K
L1i cache:           64K
L2 cache:            1024K
L3 cache:            65536K
NUMA node0 CPU(s):   0-63
NUMA node1 CPU(s):   64-127
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh


CMDLINE:
#cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt2)/boot/vmlinuz-6.6.71-3_rc2.al8.aarch64 root=UUID=1e1d9fc1-be93-4b6b-bb50-9f86448f8a4d ro biosdevname=0 rd.driver.pre=ahci console=ttyS0,115200 fsck.repair=yes cgroup.memory=nokmem crashkernel=0M-2G:0M,2G-64G:256M,64G-:384M iommu.passthrough=1 iommu.strict=0 ssbd=force-off nospectre_bhb no_hash_pointers transparent_hugepage_tmpfs=always thp_shmem=64K:always thp_anon=64K:always thp_file=2M:always+exec

[重现步骤]:
#稳定性前置配置:
echo 1 > /proc/sys/kernel/panic
echo 1 > /proc/sys/kernel/hardlockup_panic
echo 1 > /proc/sys/kernel/softlockup_panic
echo 150 > /proc/sys/kernel/watchdog_thresh
echo 1200 > /proc/sys/kernel/hung_task_timeout_secs
echo 0 > /proc/sys/kernel/hung_task_panic
echo '0 4 0 7' > /proc/sys/kernel/printk

#初始化数据盘
[ -d /disk1 ] || mkdir /disk1
wipefs -a --force /dev/nvme1n1p1
mkfs -t ext4 -q -F /dev/nvme1n1p1
mount -t ext4 /dev/nvme1n1p1 /disk1
mkdir -p /disk1/tmpdir/ltp


# 下载并编译测试套
git clone http://code.alibaba-inc.com/alikernel/ltp.git --branch LTP-20240417-6_6   # 6.6
export CFLAGS="-fcommon"               #  gcc 10 需要添加这个
make autotools
./configure
make
make install


# 准备测试脚本
cat <<-EOF > /opt/ltp/load.sh
#!/bin/bash
nr_cpu=$(nproc)
mem_kb=$(grep ^MemTotal /proc/meminfo | awk '{print $2}')
./runltp \
 -c $((nr_cpu / 2)) \
 -m $((nr_cpu / 4)),4,$((mem_kb / nr_cpu / 2 * 1024)),1 \
 -D $((nr_cpu / 10)),1,0,1 \
 -i 2 \
 -B ext4 \
 -R -p -q \
 -t 72h \
 -d /disk1/tmpdir/ltp \
 -b /dev/vdb1 -B ext4 -z /dev/vdb2 -Z ext4
EOF
chmod a+x /opt/ltp/load.sh

# 执行测试
nohup ./load.sh &> ltp-stress.log &

[期望结果]:
ltp-stress测试执行正常,系统不会发生crash

[实际结果]:
执行48h左右发生crash,crash解析如下, dmesg见附件:
crash /usr/lib/debug/usr/lib/modules/6.6.71-3_rc2.al8.aarch64/vmlinux vmcore
      KERNEL: /usr/lib/debug/usr/lib/modules/6.6.71-3_rc2.al8.aarch64/vmlinux  [TAINTED]
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 128
        DATE: Mon Mar 10 00:30:30 CST 2025
      UPTIME: 2 days, 08:43:02
LOAD AVERAGE: 142.03, 135.11, 121.11
       TASKS: 1613
    NODENAME: v43g11200.sqa.na131
     RELEASE: 6.6.71-3_rc2.al8.aarch64
     VERSION: #1 SMP PREEMPT_DYNAMIC Fri Mar  7 12:41:15 CST 2025
     MACHINE: aarch64  (unknown Mhz)
      MEMORY: 512 GB
       PANIC: "Kernel panic - not syncing: softlockup: hung tasks"
         PID: 670149
     COMMAND: "BackgroundWorke"
        TASK: ffff04000742a800  [THREAD_INFO: ffff04000742a800]
         CPU: 127
       STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 670149   TASK: ffff04000742a800  CPU: 127  COMMAND: "BackgroundWorke"
 #0 [ffff800082a43bb0] crash_setup_regs at ffff80008015a9b8
 #1 [ffff800082a43d30] panic at ffff8000800543bc
 #2 [ffff800082a43e10] watchdog_timer_fn at ffff80008019da18
 #3 [ffff800082a43e60] __hrtimer_run_queues at ffff800080138e30
 #4 [ffff800082a43ef0] hrtimer_interrupt at ffff800080139d40
 #5 [ffff800082a43f50] arch_timer_handler_phys at ffff800080a4d1b4
 #6 [ffff800082a43f60] handle_percpu_devid_irq at ffff8000800fde18
 #7 [ffff800082a43fa0] generic_handle_domain_irq at ffff8000800f5aa4
 #8 [ffff800082a43fb0] __gic_handle_irq_from_irqson at ffff8000806b2c1c
 #9 [ffff800082a43fe0] gic_handle_irq at ffff800080010100
--- <IRQ stack> ---
#10 [ffff80009fddba40] call_on_irq_stack at ffff8000800164f8
#11 [ffff80009fddba50] do_interrupt_handler at ffff80008001884c
#12 [ffff80009fddba70] el1_interrupt at ffff800080d2fe60
#13 [ffff80009fddba90] el1h_64_irq_handler at ffff800080d31ec0
#14 [ffff80009fddbbd0] el1h_64_irq at ffff800080011384
#15 [ffff80009fddbbf0] local_daif_inherit at ffff800080018794
#16 [ffff80009fddbc20] el1h_64_sync_handler at ffff800080d31ea4
#17 [ffff80009fddbd60] el1h_64_sync at ffff800080011304
#18 [ffff80009fddbd80] get_user_arg_ptr at ffff8000803dfa08
#19 [ffff80009fddbdc0] do_execveat_common at ffff8000803e0f3c
#20 [ffff80009fddbe20] __arm64_sys_execve at ffff8000803e1248
#21 [ffff80009fddbe40] el0_svc_common.constprop.0 at ffff80008002881c
#22 [ffff80009fddbe70] do_el0_svc at ffff800080028914
#23 [ffff80009fddbe80] el0_svc at ffff800080d317f8
#24 [ffff80009fddbea0] el0t_64_sync_handler at ffff800080d3204c
#25 [ffff80009fddbfe0] el0t_64_sync at ffff800080011608
     PC: 0000ffff9f3e7d4c   LR: 0000ffff9f46b7f8   SP: 0000ffff9c9fe8a0
    X29: 0000ffff9c9fe8b0  X28: 0000000000000000  X27: 000000000000000a
    X26: 0000ffff9c9fef98  X25: 0000000000000000  X24: 0000000000000001
    X23: aaaaaaaaaaaaaaab  X22: 000000000be7c380  X21: 000000000be9e8f0
    X20: 000000000bf44a98  X19: 0000ffff9f53f000  X18: 00a3d70a3d70a3d6
    X17: 00000000032a0180  X16: 0000ffff9f46b450  X15: 051eb851eb851eb0
    X14: 0000000000000001  X13: 0000000000000000  X12: 0000ffff9f3c7320
    X11: ffffffffffffffff  X10: 0000000000000000   X9: 0000000000000005
     X8: 00000000000000dd   X7: 0000000000000003   X6: 0000000000000003
     X5: 0000ffff9f6eb5d8   X4: 000000000000002f   X3: 000000002fd12a60
     X2: 000000000be7c380   X1: 000000000be9e8f0   X0: 000000000bf44a98
    ORIG_X0: 000000000bf44a98  SYSCALLNO: dd  PSTATE: 40001000
Comment 1 XueShuai alibaba_cloud_group 2025-03-13 10:35:15 UTC
干活的CPU有两类:
genload:内存的压力负载
BackgroundWorke:hang的CPU,从用户态拷贝内存到内核态

$cat bt.log | grep CPU
PID: 16716    TASK: ffff000828278000  CPU: 0    COMMAND: "genload"
PID: 16649    TASK: ffff040dfc080000  CPU: 1    COMMAND: "genload"
PID: 16645    TASK: ffff040dfc081400  CPU: 2    COMMAND: "genload"
PID: 16650    TASK: ffff040dfc086400  CPU: 3    COMMAND: "genload"
PID: 16602    TASK: ffff04000f10d000  CPU: 4    COMMAND: "genload"
PID: 16613    TASK: ffff040027b98000  CPU: 5    COMMAND: "genload"
PID: 16719    TASK: ffff04003819bc00  CPU: 6    COMMAND: "genload"
PID: 16595    TASK: ffff04000eb8a800  CPU: 7    COMMAND: "genload"
PID: 16599    TASK: ffff04000eb89400  CPU: 8    COMMAND: "genload"
PID: 16590    TASK: ffff0400193f3c00  CPU: 9    COMMAND: "genload"
PID: 16624    TASK: ffff040dfdce2800  CPU: 10   COMMAND: "genload"
PID: 16587    TASK: ffff0400193f6400  CPU: 11   COMMAND: "genload"
PID: 16637    TASK: ffff040dfdcf3c00  CPU: 12   COMMAND: "genload"
PID: 16653    TASK: ffff040dfc08e400  CPU: 13   COMMAND: "genload"
PID: 16619    TASK: ffff040dfdce3c00  CPU: 14   COMMAND: "genload"
PID: 16592    TASK: ffff0400077f3c00  CPU: 15   COMMAND: "genload"
PID: 16639    TASK: ffff040dfdcfbc00  CPU: 16   COMMAND: "genload"
PID: 16616    TASK: ffff040027b9e400  CPU: 17   COMMAND: "genload"
PID: 16630    TASK: ffff040dfdce8000  CPU: 18   COMMAND: "genload"
PID: 16617    TASK: ffff040027b99400  CPU: 19   COMMAND: "genload"
PID: 16621    TASK: ffff040dfdce0000  CPU: 20   COMMAND: "genload"
PID: 16635    TASK: ffff040dfdcf2800  CPU: 21   COMMAND: "genload"
PID: 16611    TASK: ffff0400426aa800  CPU: 22   COMMAND: "genload"
PID: 16591    TASK: ffff0400193f2800  CPU: 23   COMMAND: "genload"
PID: 16636    TASK: ffff040dfdcf5000  CPU: 24   COMMAND: "genload"
PID: 16714    TASK: ffff00082827a800  CPU: 25   COMMAND: "genload"
PID: 16615    TASK: ffff040027b9d000  CPU: 26   COMMAND: "genload"
PID: 16652    TASK: ffff040dfc088000  CPU: 27   COMMAND: "genload"
PID: 16628    TASK: ffff040dfdced000  CPU: 28   COMMAND: "genload"
PID: 16717    TASK: ffff00082827d000  CPU: 29   COMMAND: "genload"
PID: 16625    TASK: ffff040dfdce5000  CPU: 30   COMMAND: "genload"
PID: 0        TASK: ffff0008074c2800  CPU: 31   COMMAND: "swapper/31"
PID: 16601    TASK: ffff04000f109400  CPU: 32   COMMAND: "genload"
PID: 16646    TASK: ffff040dfc082800  CPU: 33   COMMAND: "genload"
PID: 16638    TASK: ffff040dfdcfd000  CPU: 34   COMMAND: "genload"
PID: 16623    TASK: ffff040dfdce1400  CPU: 35   COMMAND: "genload"
PID: 16648    TASK: ffff040dfc083c00  CPU: 36   COMMAND: "genload"
PID: 16610    TASK: ffff0400426a9400  CPU: 37   COMMAND: "genload"
PID: 16643    TASK: ffff040dfdcfa800  CPU: 38   COMMAND: "genload"
PID: 16627    TASK: ffff040dfdcea800  CPU: 39   COMMAND: "genload"
PID: 16633    TASK: ffff040dfdcf6400  CPU: 40   COMMAND: "genload"
PID: 16712    TASK: ffff00082827bc00  CPU: 41   COMMAND: "genload"
PID: 16641    TASK: ffff040dfdcfe400  CPU: 42   COMMAND: "genload"
PID: 16594    TASK: ffff04000eb8e400  CPU: 43   COMMAND: "genload"
PID: 16604    TASK: ffff04000f10a800  CPU: 44   COMMAND: "genload"
PID: 16607    TASK: ffff040015a8bc00  CPU: 45   COMMAND: "genload"
PID: 16632    TASK: ffff040dfdcf0000  CPU: 46   COMMAND: "genload"
PID: 16718    TASK: ffff040038198000  CPU: 47   COMMAND: "genload"
PID: 16631    TASK: ffff040dfdcee400  CPU: 48   COMMAND: "genload"
PID: 16618    TASK: ffff040027b9bc00  CPU: 49   COMMAND: "genload"
PID: 16634    TASK: ffff040dfdcf1400  CPU: 50   COMMAND: "genload"
PID: 16720    TASK: ffff04003a4f3c00  CPU: 51   COMMAND: "genload"
PID: 16622    TASK: ffff040dfdce6400  CPU: 52   COMMAND: "genload"
PID: 16606    TASK: ffff040015a89400  CPU: 53   COMMAND: "genload"
PID: 16600    TASK: ffff04000f10bc00  CPU: 54   COMMAND: "genload"
PID: 16597    TASK: ffff04000eb88000  CPU: 55   COMMAND: "genload"
PID: 16612    TASK: ffff0400426abc00  CPU: 56   COMMAND: "genload"
PID: 16651    TASK: ffff040dfc08bc00  CPU: 57   COMMAND: "genload"
PID: 16609    TASK: ffff0400426a8000  CPU: 58   COMMAND: "genload"
PID: 16647    TASK: ffff040dfc085000  CPU: 59   COMMAND: "genload"
PID: 16626    TASK: ffff040dfdce9400  CPU: 60   COMMAND: "genload"
PID: 16608    TASK: ffff0400426ad000  CPU: 61   COMMAND: "genload"
PID: 16723    TASK: ffff04003a4f0000  CPU: 62   COMMAND: "genload"
PID: 16721    TASK: ffff04003a4f1400  CPU: 63   COMMAND: "genload"
PID: 16588    TASK: ffff0400193f5000  CPU: 64   COMMAND: "genload"
PID: 0        TASK: ffff040006665000  CPU: 65   COMMAND: "swapper/65"
PID: 0        TASK: ffff040006663c00  CPU: 66   COMMAND: "swapper/66"
PID: 0        TASK: ffff040006660000  CPU: 67   COMMAND: "swapper/67"
PID: 0        TASK: ffff040006666400  CPU: 68   COMMAND: "swapper/68"
PID: 0        TASK: ffff040006661400  CPU: 69   COMMAND: "swapper/69"
PID: 0        TASK: ffff040006668000  CPU: 70   COMMAND: "swapper/70"
PID: 0        TASK: ffff04000666e400  CPU: 71   COMMAND: "swapper/71"
PID: 0        TASK: ffff040006669400  CPU: 72   COMMAND: "swapper/72"
PID: 0        TASK: ffff04000666a800  CPU: 73   COMMAND: "swapper/73"
PID: 0        TASK: ffff04000666d000  CPU: 74   COMMAND: "swapper/74"
PID: 0        TASK: ffff04000666bc00  CPU: 75   COMMAND: "swapper/75"
PID: 0        TASK: ffff040006679400  CPU: 76   COMMAND: "swapper/76"
PID: 0        TASK: ffff04000667a800  CPU: 77   COMMAND: "swapper/77"
PID: 0        TASK: ffff04000667d000  CPU: 78   COMMAND: "swapper/78"
PID: 0        TASK: ffff04000667bc00  CPU: 79   COMMAND: "swapper/79"
PID: 0        TASK: ffff040006678000  CPU: 80   COMMAND: "swapper/80"
PID: 0        TASK: ffff04000667e400  CPU: 81   COMMAND: "swapper/81"
PID: 0        TASK: ffff040006682800  CPU: 82   COMMAND: "swapper/82"
PID: 0        TASK: ffff040006685000  CPU: 83   COMMAND: "swapper/83"
PID: 0        TASK: ffff040006683c00  CPU: 84   COMMAND: "swapper/84"
PID: 0        TASK: ffff040006680000  CPU: 85   COMMAND: "swapper/85"
PID: 0        TASK: ffff040006686400  CPU: 86   COMMAND: "swapper/86"
PID: 0        TASK: ffff040006681400  CPU: 87   COMMAND: "swapper/87"
PID: 0        TASK: ffff040006696400  CPU: 88   COMMAND: "swapper/88"
PID: 0        TASK: ffff040006691400  CPU: 89   COMMAND: "swapper/89"
PID: 0        TASK: ffff040006692800  CPU: 90   COMMAND: "swapper/90"
PID: 0        TASK: ffff040006695000  CPU: 91   COMMAND: "swapper/91"
PID: 0        TASK: ffff040006693c00  CPU: 92   COMMAND: "swapper/92"
PID: 0        TASK: ffff040006690000  CPU: 93   COMMAND: "swapper/93"
PID: 0        TASK: ffff0400066ae400  CPU: 94   COMMAND: "swapper/94"
PID: 0        TASK: ffff0400066a9400  CPU: 95   COMMAND: "swapper/95"
PID: 0        TASK: ffff0400066aa800  CPU: 96   COMMAND: "swapper/96"
PID: 0        TASK: ffff0400066ad000  CPU: 97   COMMAND: "swapper/97"
PID: 0        TASK: ffff0400066abc00  CPU: 98   COMMAND: "swapper/98"
PID: 0        TASK: ffff0400066a8000  CPU: 99   COMMAND: "swapper/99"
PID: 0        TASK: ffff0400066b2800  CPU: 100  COMMAND: "swapper/100"
PID: 0        TASK: ffff0400066b5000  CPU: 101  COMMAND: "swapper/101"
PID: 0        TASK: ffff0400066b3c00  CPU: 102  COMMAND: "swapper/102"
PID: 0        TASK: ffff0400066b0000  CPU: 103  COMMAND: "swapper/103"
PID: 0        TASK: ffff0400066b6400  CPU: 104  COMMAND: "swapper/104"
PID: 0        TASK: ffff0400066b1400  CPU: 105  COMMAND: "swapper/105"
PID: 0        TASK: ffff0400066b8000  CPU: 106  COMMAND: "swapper/106"
PID: 0        TASK: ffff0400066be400  CPU: 107  COMMAND: "swapper/107"
PID: 0        TASK: ffff0400066b9400  CPU: 108  COMMAND: "swapper/108"
PID: 16713    TASK: ffff000828279400  CPU: 109  COMMAND: "genload"
PID: 0        TASK: ffff0400066bd000  CPU: 110  COMMAND: "swapper/110"
PID: 0        TASK: ffff0400066bbc00  CPU: 111  COMMAND: "swapper/111"
PID: 0        TASK: ffff0400066c0000  CPU: 112  COMMAND: "swapper/112"
PID: 0        TASK: ffff0400066c6400  CPU: 113  COMMAND: "swapper/113"
PID: 0        TASK: ffff0400066c1400  CPU: 114  COMMAND: "swapper/114"
PID: 16722    TASK: ffff04003a4f6400  CPU: 115  COMMAND: "genload"
PID: 16642    TASK: ffff040dfdcf9400  CPU: 116  COMMAND: "genload"
PID: 0        TASK: ffff0400066c3c00  CPU: 117  COMMAND: "swapper/117"
PID: 16589    TASK: ffff0400193f0000  CPU: 118  COMMAND: "genload"
PID: 16640    TASK: ffff040dfdcf8000  CPU: 119  COMMAND: "genload"
PID: 16629    TASK: ffff040dfdcebc00  CPU: 120  COMMAND: "genload"
PID: 16614    TASK: ffff040027b9a800  CPU: 121  COMMAND: "genload"
PID: 16593    TASK: ffff04000eb8bc00  CPU: 122  COMMAND: "genload"
PID: 0        TASK: ffff0400066ca800  CPU: 123  COMMAND: "swapper/123"
PID: 16605    TASK: ffff04000f10e400  CPU: 124  COMMAND: "genload"
PID: 16603    TASK: ffff04000f108000  CPU: 125  COMMAND: "genload"
PID: 16598    TASK: ffff04000eb8d000  CPU: 126  COMMAND: "genload"
PID: 670149   TASK: ffff04000742a800  CPU: 127  COMMAND: "BackgroundWorke"

genload调用栈如下:
PID: 16716    TASK: ffff000828278000  CPU: 0    COMMAND: "genload"
 #0 [ffff800080003d40] crash_save_cpu at ffff80008015c1dc
 #1 [ffff800080003ef0] ipi_cpu_crash_stop at ffff800080026a10
 #2 [ffff800080003f10] do_handle_IPI at ffff800080026e48
 #3 [ffff800080003f50] ipi_handler at ffff800080026f4c
 #4 [ffff800080003f60] handle_percpu_devid_irq at ffff8000800fde18
 #5 [ffff800080003fa0] generic_handle_domain_irq at ffff8000800f5aa4
 #6 [ffff800080003fb0] __gic_handle_irq_from_irqson at ffff8000806b2c1c
 #7 [ffff800080003fe0] gic_handle_irq at ffff800080010100
--- <IRQ stack> ---
 #8 [ffff8000afa8b550] call_on_irq_stack at ffff8000800164f8
 #9 [ffff8000afa8b560] do_interrupt_handler at ffff80008001884c
#10 [ffff8000afa8b580] el1_interrupt at ffff800080d2fe60
#11 [ffff8000afa8b5a0] el1h_64_irq_handler at ffff800080d31ec0
#12 [ffff8000afa8b6e0] el1h_64_irq at ffff800080011384
#13 [ffff8000afa8b700] _raw_spin_unlock_irqrestore at ffff800080d402e0
#14 [ffff8000afa8b770] __kfence_alloc at ffff800080384450
#15 [ffff8000afa8ba00] kmem_cache_alloc at ffff80008037cde8
#16 [ffff8000afa8ba70] alloc_buffer_head at ffff80008042c0e8
#17 [ffff8000afa8ba90] folio_alloc_buffers at ffff80008042d7d0
#18 [ffff8000afa8bae0] folio_create_empty_buffers at ffff80008042d9b4
#19 [ffff8000afa8bb10] folio_create_buffers at ffff80008042dbb4
#20 [ffff8000afa8bb30] __block_write_begin_int at ffff80008042f638
#21 [ffff8000afa8bbe0] __block_write_begin at ffff80008042fa84
#22 [ffff8000afa8bbf0] ext4_da_write_begin at ffff8000804cae64
#23 [ffff8000afa8bc70] generic_perform_write at ffff8000802ac200
#24 [ffff8000afa8bd10] ext4_buffered_write_iter at ffff8000804b4704
#25 [ffff8000afa8bd40] ext4_file_write_iter at ffff8000804b4bb8
#26 [ffff8000afa8bd50] vfs_write at ffff8000803d67d8
#27 [ffff8000afa8bdf0] ksys_write at ffff8000803d6ad4
#28 [ffff8000afa8be30] __arm64_sys_write at ffff8000803d6b8c
#29 [ffff8000afa8be40] el0_svc_common.constprop.0 at ffff80008002881c
#30 [ffff8000afa8be70] do_el0_svc at ffff800080028914
#31 [ffff8000afa8be80] el0_svc at ffff800080d317f8
#32 [ffff8000afa8bea0] el0t_64_sync_handler at ffff800080d3204c
#33 [ffff8000afa8bfe0] el0t_64_sync at ffff800080011608


请内存同学先帮忙排查一下
Comment 2 banye97 alibaba_cloud_group 2025-03-13 15:51:54 UTC
call trace 逻辑分析
1. 用户态进程调用 execve syscall
2. 内核在 do_execveat_common->count->get_user_arg_ptr 负责解析用户态参数 arg 时触发了同步异常 data abort.

/*
 * count() counts the number of strings in array ARGV.
 */
static int count(struct user_arg_ptr argv, int max)
{
	int i = 0;

	if (argv.ptr.native != NULL) {
		for (;;) {
			const char __user *p = get_user_arg_ptr(argv, i);

			if (!p)
				break;

			if (IS_ERR(p))
				return -EFAULT;

			if (i >= max)
				return -E2BIG;
			++i;

			if (fatal_signal_pending(current))
				return -ERESTARTNOHAND;
			cond_resched();
		}
	}
	return i;
}

3. 内核同步异常处理函数 el1h_64_sync_handler->el1_abort 进行异常处理。
  392   static void noinstr el1_abort(struct pt_regs *regs, unsigned long esr)
  393   {
* 394           unsigned long far = read_sysreg(far_el1);
  395
  396           enter_from_kernel_mode(regs);
  397           local_daif_inherit(regs);
  398           do_mem_abort(far, esr, regs);
  399           local_daif_mask();
  400           exit_to_kernel_mode(regs);
  401   }
4. 在执行 local_daif_inherit 时内核报了 softlockup。

目前关键路径上只有 count 函数存在loop,但是 count 的loop 中又有 cond_resched ,不太可能触发 softlockup。
Comment 3 banye97 alibaba_cloud_group 2025-03-13 17:10:19 UTC
(In reply to banye97 from comment #2)
> call trace 逻辑分析
> 1. 用户态进程调用 execve syscall
> 2. 内核在 do_execveat_common->count->get_user_arg_ptr 负责解析用户态参数 arg 时触发了同步异常
> data abort.
> 
> /*
>  * count() counts the number of strings in array ARGV.
>  */
> static int count(struct user_arg_ptr argv, int max)
> {
> 	int i = 0;
> 
> 	if (argv.ptr.native != NULL) {
> 		for (;;) {
> 			const char __user *p = get_user_arg_ptr(argv, i);
> 
> 			if (!p)
> 				break;
> 
> 			if (IS_ERR(p))
> 				return -EFAULT;
> 
> 			if (i >= max)
> 				return -E2BIG;
> 			++i;
> 
> 			if (fatal_signal_pending(current))
> 				return -ERESTARTNOHAND;
> 			cond_resched();
> 		}
> 	}
> 	return i;
> }
> 
> 3. 内核同步异常处理函数 el1h_64_sync_handler->el1_abort 进行异常处理。
>   392   static void noinstr el1_abort(struct pt_regs *regs, unsigned long
> esr)
>   393   {
> * 394           unsigned long far = read_sysreg(far_el1);
>   395
>   396           enter_from_kernel_mode(regs);
>   397           local_daif_inherit(regs);
>   398           do_mem_abort(far, esr, regs);
>   399           local_daif_mask();
>   400           exit_to_kernel_mode(regs);
>   401   }
> 4. 在执行 local_daif_inherit 时内核报了 softlockup。
> 
> 目前关键路径上只有 count 函数存在loop,但是 count 的loop 中又有 cond_resched ,不太可能触发 softlockup。


并非是在解析 arg 时出现的 softlockup,而是在解析 env 时出现 softlockup
Comment 4 banye97 alibaba_cloud_group 2025-03-13 17:11:27 UTC
同时系统日志中有大量的 ext4 文件系统错误,在 softlockup 前也出现过该错误:

‘’‘
[202920.365639] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm statvfs01: lblock 0 mapped to illegal pblock 9279 (length 1)
[202920.400001] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm cgroup_fj_stres: lblock 0 mapped to illegal pblock 9279 (length 1)
[202920.527839] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm mktemp: lblock 0 mapped to illegal pblock 9279 (length 1)
[203033.405731] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm cgroup_fj_stres: lblock 0 mapped to illegal pblock 9279 (length 1)
[203033.523268] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm fs_bind_rbind18: lblock 0 mapped to illegal pblock 9279 (length 1)
[203033.950514] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm mktemp: lblock 0 mapped to illegal pblock 9279 (length 1)
[203034.401607] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm fs_bind_rbind18: lblock 0 mapped to illegal pblock 9279 (length 1)
[203034.541679] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm fs_fill: lblock 0 mapped to illegal pblock 9279 (length 1)
[203034.553754] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm cgroup_fj_stres: lblock 0 mapped to illegal pblock 9279 (length 1)
[203034.652233] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm mktemp: lblock 0 mapped to illegal pblock 9279 (length 1)
[203034.716704] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm tst_cgctl: lblock 0 mapped to illegal pblock 9279 (length 1)
[203035.066671] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm cgroup_fj_stres: lblock 0 mapped to illegal pblock 9279 (length 1)
[203035.265391] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm sync01: lblock 0 mapped to illegal pblock 9279 (length 1)
[203042.853874] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm symlink01: lblock 0 mapped to illegal pblock 9279 (length 1)
[203043.302905] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm diotest3: lblock 0 mapped to illegal pblock 9279 (length 1)
[203043.514959] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm fs_bind_move21.: lblock 0 mapped to illegal pblock 9279 (length 1)
[203043.589833] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm mktemp: lblock 0 mapped to illegal pblock 9279 (length 1)
[203043.770971] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm fs_bind_move21.: lblock 0 mapped to illegal pblock 9279 (length 1)
[203043.920844] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm pids.sh: lblock 0 mapped to illegal pblock 9279 (length 1)
[203046.002426] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm pids.sh: lblock 0 mapped to illegal pblock 9279 (length 1)
[203046.430482] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm fsopen01: lblock 0 mapped to illegal pblock 9279 (length 1)
[203046.648957] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm fs_perms: lblock 0 mapped to illegal pblock 9279 (length 1)
[203047.098539] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm msgstress03: lblock 0 mapped to illegal pblock 9279 (length 1)
[203049.812385] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm cgroup_fj_stres: lblock 0 mapped to illegal pblock 9279 (length 1)
[203050.211071] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm epoll_wait05: lblock 0 mapped to illegal pblock 9279 (length 1)
[203050.653564] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm msgrcv02: lblock 0 mapped to illegal pblock 9279 (length 1)
[203050.999817] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm pidfd_getfd01: lblock 0 mapped to illegal pblock 9279 (length 1)
[203051.226434] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm prctl03: lblock 0 mapped to illegal pblock 9279 (length 1)
[203051.434233] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm sh: lblock 0 mapped to illegal pblock 9279 (length 1)
[203056.565032] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm mkdir: lblock 0 mapped to illegal pblock 9279 (length 1)
[203056.565224] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm mkdir: lblock 0 mapped to illegal pblock 9279 (length 1)
[203056.576675] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm rm: lblock 0 mapped to illegal pblock 9279 (length 1)
[203056.576881] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm rm: lblock 0 mapped to illegal pblock 9279 (length 1)
[203056.587317] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm rm: lblock 0 mapped to illegal pblock 9279 (length 1)
[203056.587372] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm rm: lblock 0 mapped to illegal pblock 9279 (length 1)
[203056.589696] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm fs_di: lblock 0 mapped to illegal pblock 9279 (length 1)
[203056.597525] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm rm: lblock 0 mapped to illegal pblock 9279 (length 1)
[203056.597685] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm rm: lblock 0 mapped to illegal pblock 9279 (length 1)
[203056.817837] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm getcwd04: lblock 0 mapped to illegal pblock 9279 (length 1)
[203303.814415]  ext4_da_write_begin+0xa4/0x2b8
[203303.814421]  ext4_buffered_write_iter+0x70/0x140
[203303.814424]  ext4_file_write_iter+0x3c/0x68
[203823.464947]  ext4_da_write_begin+0xa4/0x2b8
[203823.464951]  ext4_buffered_write_iter+0x70/0x140
[203823.464954]  ext4_file_write_iter+0x3c/0x68
[203830.911854] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm pwritev01: lblock 0 mapped to illegal pblock 9279 (length 1)
[203831.176277] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm cpuhotplug06.sh: lblock 0 mapped to illegal pblock 9279 (length 1)
[203831.919233] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm cpuhotplug06.sh: lblock 0 mapped to illegal pblock 9279 (length 1)

’‘’
Comment 5 Ferry Meng alibaba_cloud_group 2025-03-14 16:05:28 UTC
BackgroundWorker 的父进程是 logagent,这个东西和ltp无关,但是什么组件的进程也不清楚。
看起来是logagent磁盘读写不响应才导致的softlockup,诱因难以定位。fs-error “可能”是诱因。需要fsck修复一下磁盘。

能稳定复现吗?
Comment 6 Janos alibaba_cloud_group 2025-03-17 15:48:05 UTC
(In reply to Ferry Meng from comment #5)
> BackgroundWorker 的父进程是 logagent,这个东西和ltp无关,但是什么组件的进程也不清楚。
> 看起来是logagent磁盘读写不响应才导致的softlockup,诱因难以定位。fs-error “可能”是诱因。需要fsck修复一下磁盘。
> 
> 能稳定复现吗?

目前只出现了这一次,还没复出来