Created attachment 1322 [details] vmcore_dmesg [缺陷描述]: 倚天执行ltp-stress 48h左右,发生crash,Kernel panic - not syncing: softlockup: hung tasks [机器信息]: 环境:物理机 机型:倚天 内核版本: # uname -r 6.6.71-3_rc2.al8.aarch64 内存信息: # free -h total used free shared buff/cache available Mem: 503Gi 3.0Gi 476Gi 12Mi 23Gi 497Gi Swap: 2.0Gi 0B 2.0Gi cpu信息: # lscpu Architecture: aarch64 Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Thread(s) per core: 1 Core(s) per socket: 128 Socket(s): 1 NUMA node(s): 2 Vendor ID: ARM BIOS Vendor ID: T-HEAD Model: 0 Model name: Neoverse-N2 BIOS Model name: Yitian710-128 Stepping: r0p0 CPU MHz: 2750.001 BogoMIPS: 100.00 L1d cache: 64K L1i cache: 64K L2 cache: 1024K L3 cache: 65536K NUMA node0 CPU(s): 0-63 NUMA node1 CPU(s): 64-127 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh CMDLINE: #cat /proc/cmdline BOOT_IMAGE=(hd0,gpt2)/boot/vmlinuz-6.6.71-3_rc2.al8.aarch64 root=UUID=1e1d9fc1-be93-4b6b-bb50-9f86448f8a4d ro biosdevname=0 rd.driver.pre=ahci console=ttyS0,115200 fsck.repair=yes cgroup.memory=nokmem crashkernel=0M-2G:0M,2G-64G:256M,64G-:384M iommu.passthrough=1 iommu.strict=0 ssbd=force-off nospectre_bhb no_hash_pointers transparent_hugepage_tmpfs=always thp_shmem=64K:always thp_anon=64K:always thp_file=2M:always+exec [重现步骤]: #稳定性前置配置: echo 1 > /proc/sys/kernel/panic echo 1 > /proc/sys/kernel/hardlockup_panic echo 1 > /proc/sys/kernel/softlockup_panic echo 150 > /proc/sys/kernel/watchdog_thresh echo 1200 > /proc/sys/kernel/hung_task_timeout_secs echo 0 > /proc/sys/kernel/hung_task_panic echo '0 4 0 7' > /proc/sys/kernel/printk #初始化数据盘 [ -d /disk1 ] || mkdir /disk1 wipefs -a --force /dev/nvme1n1p1 mkfs -t ext4 -q -F /dev/nvme1n1p1 mount -t ext4 /dev/nvme1n1p1 /disk1 mkdir -p /disk1/tmpdir/ltp # 下载并编译测试套 git clone http://code.alibaba-inc.com/alikernel/ltp.git --branch LTP-20240417-6_6 # 6.6 export CFLAGS="-fcommon" # gcc 10 需要添加这个 make autotools ./configure make make install # 准备测试脚本 cat <<-EOF > /opt/ltp/load.sh #!/bin/bash nr_cpu=$(nproc) mem_kb=$(grep ^MemTotal /proc/meminfo | awk '{print $2}') ./runltp \ -c $((nr_cpu / 2)) \ -m $((nr_cpu / 4)),4,$((mem_kb / nr_cpu / 2 * 1024)),1 \ -D $((nr_cpu / 10)),1,0,1 \ -i 2 \ -B ext4 \ -R -p -q \ -t 72h \ -d /disk1/tmpdir/ltp \ -b /dev/vdb1 -B ext4 -z /dev/vdb2 -Z ext4 EOF chmod a+x /opt/ltp/load.sh # 执行测试 nohup ./load.sh &> ltp-stress.log & [期望结果]: ltp-stress测试执行正常,系统不会发生crash [实际结果]: 执行48h左右发生crash,crash解析如下, dmesg见附件: crash /usr/lib/debug/usr/lib/modules/6.6.71-3_rc2.al8.aarch64/vmlinux vmcore KERNEL: /usr/lib/debug/usr/lib/modules/6.6.71-3_rc2.al8.aarch64/vmlinux [TAINTED] DUMPFILE: vmcore [PARTIAL DUMP] CPUS: 128 DATE: Mon Mar 10 00:30:30 CST 2025 UPTIME: 2 days, 08:43:02 LOAD AVERAGE: 142.03, 135.11, 121.11 TASKS: 1613 NODENAME: v43g11200.sqa.na131 RELEASE: 6.6.71-3_rc2.al8.aarch64 VERSION: #1 SMP PREEMPT_DYNAMIC Fri Mar 7 12:41:15 CST 2025 MACHINE: aarch64 (unknown Mhz) MEMORY: 512 GB PANIC: "Kernel panic - not syncing: softlockup: hung tasks" PID: 670149 COMMAND: "BackgroundWorke" TASK: ffff04000742a800 [THREAD_INFO: ffff04000742a800] CPU: 127 STATE: TASK_RUNNING (PANIC) crash> bt PID: 670149 TASK: ffff04000742a800 CPU: 127 COMMAND: "BackgroundWorke" #0 [ffff800082a43bb0] crash_setup_regs at ffff80008015a9b8 #1 [ffff800082a43d30] panic at ffff8000800543bc #2 [ffff800082a43e10] watchdog_timer_fn at ffff80008019da18 #3 [ffff800082a43e60] __hrtimer_run_queues at ffff800080138e30 #4 [ffff800082a43ef0] hrtimer_interrupt at ffff800080139d40 #5 [ffff800082a43f50] arch_timer_handler_phys at ffff800080a4d1b4 #6 [ffff800082a43f60] handle_percpu_devid_irq at ffff8000800fde18 #7 [ffff800082a43fa0] generic_handle_domain_irq at ffff8000800f5aa4 #8 [ffff800082a43fb0] __gic_handle_irq_from_irqson at ffff8000806b2c1c #9 [ffff800082a43fe0] gic_handle_irq at ffff800080010100 --- <IRQ stack> --- #10 [ffff80009fddba40] call_on_irq_stack at ffff8000800164f8 #11 [ffff80009fddba50] do_interrupt_handler at ffff80008001884c #12 [ffff80009fddba70] el1_interrupt at ffff800080d2fe60 #13 [ffff80009fddba90] el1h_64_irq_handler at ffff800080d31ec0 #14 [ffff80009fddbbd0] el1h_64_irq at ffff800080011384 #15 [ffff80009fddbbf0] local_daif_inherit at ffff800080018794 #16 [ffff80009fddbc20] el1h_64_sync_handler at ffff800080d31ea4 #17 [ffff80009fddbd60] el1h_64_sync at ffff800080011304 #18 [ffff80009fddbd80] get_user_arg_ptr at ffff8000803dfa08 #19 [ffff80009fddbdc0] do_execveat_common at ffff8000803e0f3c #20 [ffff80009fddbe20] __arm64_sys_execve at ffff8000803e1248 #21 [ffff80009fddbe40] el0_svc_common.constprop.0 at ffff80008002881c #22 [ffff80009fddbe70] do_el0_svc at ffff800080028914 #23 [ffff80009fddbe80] el0_svc at ffff800080d317f8 #24 [ffff80009fddbea0] el0t_64_sync_handler at ffff800080d3204c #25 [ffff80009fddbfe0] el0t_64_sync at ffff800080011608 PC: 0000ffff9f3e7d4c LR: 0000ffff9f46b7f8 SP: 0000ffff9c9fe8a0 X29: 0000ffff9c9fe8b0 X28: 0000000000000000 X27: 000000000000000a X26: 0000ffff9c9fef98 X25: 0000000000000000 X24: 0000000000000001 X23: aaaaaaaaaaaaaaab X22: 000000000be7c380 X21: 000000000be9e8f0 X20: 000000000bf44a98 X19: 0000ffff9f53f000 X18: 00a3d70a3d70a3d6 X17: 00000000032a0180 X16: 0000ffff9f46b450 X15: 051eb851eb851eb0 X14: 0000000000000001 X13: 0000000000000000 X12: 0000ffff9f3c7320 X11: ffffffffffffffff X10: 0000000000000000 X9: 0000000000000005 X8: 00000000000000dd X7: 0000000000000003 X6: 0000000000000003 X5: 0000ffff9f6eb5d8 X4: 000000000000002f X3: 000000002fd12a60 X2: 000000000be7c380 X1: 000000000be9e8f0 X0: 000000000bf44a98 ORIG_X0: 000000000bf44a98 SYSCALLNO: dd PSTATE: 40001000
干活的CPU有两类: genload:内存的压力负载 BackgroundWorke:hang的CPU,从用户态拷贝内存到内核态 $cat bt.log | grep CPU PID: 16716 TASK: ffff000828278000 CPU: 0 COMMAND: "genload" PID: 16649 TASK: ffff040dfc080000 CPU: 1 COMMAND: "genload" PID: 16645 TASK: ffff040dfc081400 CPU: 2 COMMAND: "genload" PID: 16650 TASK: ffff040dfc086400 CPU: 3 COMMAND: "genload" PID: 16602 TASK: ffff04000f10d000 CPU: 4 COMMAND: "genload" PID: 16613 TASK: ffff040027b98000 CPU: 5 COMMAND: "genload" PID: 16719 TASK: ffff04003819bc00 CPU: 6 COMMAND: "genload" PID: 16595 TASK: ffff04000eb8a800 CPU: 7 COMMAND: "genload" PID: 16599 TASK: ffff04000eb89400 CPU: 8 COMMAND: "genload" PID: 16590 TASK: ffff0400193f3c00 CPU: 9 COMMAND: "genload" PID: 16624 TASK: ffff040dfdce2800 CPU: 10 COMMAND: "genload" PID: 16587 TASK: ffff0400193f6400 CPU: 11 COMMAND: "genload" PID: 16637 TASK: ffff040dfdcf3c00 CPU: 12 COMMAND: "genload" PID: 16653 TASK: ffff040dfc08e400 CPU: 13 COMMAND: "genload" PID: 16619 TASK: ffff040dfdce3c00 CPU: 14 COMMAND: "genload" PID: 16592 TASK: ffff0400077f3c00 CPU: 15 COMMAND: "genload" PID: 16639 TASK: ffff040dfdcfbc00 CPU: 16 COMMAND: "genload" PID: 16616 TASK: ffff040027b9e400 CPU: 17 COMMAND: "genload" PID: 16630 TASK: ffff040dfdce8000 CPU: 18 COMMAND: "genload" PID: 16617 TASK: ffff040027b99400 CPU: 19 COMMAND: "genload" PID: 16621 TASK: ffff040dfdce0000 CPU: 20 COMMAND: "genload" PID: 16635 TASK: ffff040dfdcf2800 CPU: 21 COMMAND: "genload" PID: 16611 TASK: ffff0400426aa800 CPU: 22 COMMAND: "genload" PID: 16591 TASK: ffff0400193f2800 CPU: 23 COMMAND: "genload" PID: 16636 TASK: ffff040dfdcf5000 CPU: 24 COMMAND: "genload" PID: 16714 TASK: ffff00082827a800 CPU: 25 COMMAND: "genload" PID: 16615 TASK: ffff040027b9d000 CPU: 26 COMMAND: "genload" PID: 16652 TASK: ffff040dfc088000 CPU: 27 COMMAND: "genload" PID: 16628 TASK: ffff040dfdced000 CPU: 28 COMMAND: "genload" PID: 16717 TASK: ffff00082827d000 CPU: 29 COMMAND: "genload" PID: 16625 TASK: ffff040dfdce5000 CPU: 30 COMMAND: "genload" PID: 0 TASK: ffff0008074c2800 CPU: 31 COMMAND: "swapper/31" PID: 16601 TASK: ffff04000f109400 CPU: 32 COMMAND: "genload" PID: 16646 TASK: ffff040dfc082800 CPU: 33 COMMAND: "genload" PID: 16638 TASK: ffff040dfdcfd000 CPU: 34 COMMAND: "genload" PID: 16623 TASK: ffff040dfdce1400 CPU: 35 COMMAND: "genload" PID: 16648 TASK: ffff040dfc083c00 CPU: 36 COMMAND: "genload" PID: 16610 TASK: ffff0400426a9400 CPU: 37 COMMAND: "genload" PID: 16643 TASK: ffff040dfdcfa800 CPU: 38 COMMAND: "genload" PID: 16627 TASK: ffff040dfdcea800 CPU: 39 COMMAND: "genload" PID: 16633 TASK: ffff040dfdcf6400 CPU: 40 COMMAND: "genload" PID: 16712 TASK: ffff00082827bc00 CPU: 41 COMMAND: "genload" PID: 16641 TASK: ffff040dfdcfe400 CPU: 42 COMMAND: "genload" PID: 16594 TASK: ffff04000eb8e400 CPU: 43 COMMAND: "genload" PID: 16604 TASK: ffff04000f10a800 CPU: 44 COMMAND: "genload" PID: 16607 TASK: ffff040015a8bc00 CPU: 45 COMMAND: "genload" PID: 16632 TASK: ffff040dfdcf0000 CPU: 46 COMMAND: "genload" PID: 16718 TASK: ffff040038198000 CPU: 47 COMMAND: "genload" PID: 16631 TASK: ffff040dfdcee400 CPU: 48 COMMAND: "genload" PID: 16618 TASK: ffff040027b9bc00 CPU: 49 COMMAND: "genload" PID: 16634 TASK: ffff040dfdcf1400 CPU: 50 COMMAND: "genload" PID: 16720 TASK: ffff04003a4f3c00 CPU: 51 COMMAND: "genload" PID: 16622 TASK: ffff040dfdce6400 CPU: 52 COMMAND: "genload" PID: 16606 TASK: ffff040015a89400 CPU: 53 COMMAND: "genload" PID: 16600 TASK: ffff04000f10bc00 CPU: 54 COMMAND: "genload" PID: 16597 TASK: ffff04000eb88000 CPU: 55 COMMAND: "genload" PID: 16612 TASK: ffff0400426abc00 CPU: 56 COMMAND: "genload" PID: 16651 TASK: ffff040dfc08bc00 CPU: 57 COMMAND: "genload" PID: 16609 TASK: ffff0400426a8000 CPU: 58 COMMAND: "genload" PID: 16647 TASK: ffff040dfc085000 CPU: 59 COMMAND: "genload" PID: 16626 TASK: ffff040dfdce9400 CPU: 60 COMMAND: "genload" PID: 16608 TASK: ffff0400426ad000 CPU: 61 COMMAND: "genload" PID: 16723 TASK: ffff04003a4f0000 CPU: 62 COMMAND: "genload" PID: 16721 TASK: ffff04003a4f1400 CPU: 63 COMMAND: "genload" PID: 16588 TASK: ffff0400193f5000 CPU: 64 COMMAND: "genload" PID: 0 TASK: ffff040006665000 CPU: 65 COMMAND: "swapper/65" PID: 0 TASK: ffff040006663c00 CPU: 66 COMMAND: "swapper/66" PID: 0 TASK: ffff040006660000 CPU: 67 COMMAND: "swapper/67" PID: 0 TASK: ffff040006666400 CPU: 68 COMMAND: "swapper/68" PID: 0 TASK: ffff040006661400 CPU: 69 COMMAND: "swapper/69" PID: 0 TASK: ffff040006668000 CPU: 70 COMMAND: "swapper/70" PID: 0 TASK: ffff04000666e400 CPU: 71 COMMAND: "swapper/71" PID: 0 TASK: ffff040006669400 CPU: 72 COMMAND: "swapper/72" PID: 0 TASK: ffff04000666a800 CPU: 73 COMMAND: "swapper/73" PID: 0 TASK: ffff04000666d000 CPU: 74 COMMAND: "swapper/74" PID: 0 TASK: ffff04000666bc00 CPU: 75 COMMAND: "swapper/75" PID: 0 TASK: ffff040006679400 CPU: 76 COMMAND: "swapper/76" PID: 0 TASK: ffff04000667a800 CPU: 77 COMMAND: "swapper/77" PID: 0 TASK: ffff04000667d000 CPU: 78 COMMAND: "swapper/78" PID: 0 TASK: ffff04000667bc00 CPU: 79 COMMAND: "swapper/79" PID: 0 TASK: ffff040006678000 CPU: 80 COMMAND: "swapper/80" PID: 0 TASK: ffff04000667e400 CPU: 81 COMMAND: "swapper/81" PID: 0 TASK: ffff040006682800 CPU: 82 COMMAND: "swapper/82" PID: 0 TASK: ffff040006685000 CPU: 83 COMMAND: "swapper/83" PID: 0 TASK: ffff040006683c00 CPU: 84 COMMAND: "swapper/84" PID: 0 TASK: ffff040006680000 CPU: 85 COMMAND: "swapper/85" PID: 0 TASK: ffff040006686400 CPU: 86 COMMAND: "swapper/86" PID: 0 TASK: ffff040006681400 CPU: 87 COMMAND: "swapper/87" PID: 0 TASK: ffff040006696400 CPU: 88 COMMAND: "swapper/88" PID: 0 TASK: ffff040006691400 CPU: 89 COMMAND: "swapper/89" PID: 0 TASK: ffff040006692800 CPU: 90 COMMAND: "swapper/90" PID: 0 TASK: ffff040006695000 CPU: 91 COMMAND: "swapper/91" PID: 0 TASK: ffff040006693c00 CPU: 92 COMMAND: "swapper/92" PID: 0 TASK: ffff040006690000 CPU: 93 COMMAND: "swapper/93" PID: 0 TASK: ffff0400066ae400 CPU: 94 COMMAND: "swapper/94" PID: 0 TASK: ffff0400066a9400 CPU: 95 COMMAND: "swapper/95" PID: 0 TASK: ffff0400066aa800 CPU: 96 COMMAND: "swapper/96" PID: 0 TASK: ffff0400066ad000 CPU: 97 COMMAND: "swapper/97" PID: 0 TASK: ffff0400066abc00 CPU: 98 COMMAND: "swapper/98" PID: 0 TASK: ffff0400066a8000 CPU: 99 COMMAND: "swapper/99" PID: 0 TASK: ffff0400066b2800 CPU: 100 COMMAND: "swapper/100" PID: 0 TASK: ffff0400066b5000 CPU: 101 COMMAND: "swapper/101" PID: 0 TASK: ffff0400066b3c00 CPU: 102 COMMAND: "swapper/102" PID: 0 TASK: ffff0400066b0000 CPU: 103 COMMAND: "swapper/103" PID: 0 TASK: ffff0400066b6400 CPU: 104 COMMAND: "swapper/104" PID: 0 TASK: ffff0400066b1400 CPU: 105 COMMAND: "swapper/105" PID: 0 TASK: ffff0400066b8000 CPU: 106 COMMAND: "swapper/106" PID: 0 TASK: ffff0400066be400 CPU: 107 COMMAND: "swapper/107" PID: 0 TASK: ffff0400066b9400 CPU: 108 COMMAND: "swapper/108" PID: 16713 TASK: ffff000828279400 CPU: 109 COMMAND: "genload" PID: 0 TASK: ffff0400066bd000 CPU: 110 COMMAND: "swapper/110" PID: 0 TASK: ffff0400066bbc00 CPU: 111 COMMAND: "swapper/111" PID: 0 TASK: ffff0400066c0000 CPU: 112 COMMAND: "swapper/112" PID: 0 TASK: ffff0400066c6400 CPU: 113 COMMAND: "swapper/113" PID: 0 TASK: ffff0400066c1400 CPU: 114 COMMAND: "swapper/114" PID: 16722 TASK: ffff04003a4f6400 CPU: 115 COMMAND: "genload" PID: 16642 TASK: ffff040dfdcf9400 CPU: 116 COMMAND: "genload" PID: 0 TASK: ffff0400066c3c00 CPU: 117 COMMAND: "swapper/117" PID: 16589 TASK: ffff0400193f0000 CPU: 118 COMMAND: "genload" PID: 16640 TASK: ffff040dfdcf8000 CPU: 119 COMMAND: "genload" PID: 16629 TASK: ffff040dfdcebc00 CPU: 120 COMMAND: "genload" PID: 16614 TASK: ffff040027b9a800 CPU: 121 COMMAND: "genload" PID: 16593 TASK: ffff04000eb8bc00 CPU: 122 COMMAND: "genload" PID: 0 TASK: ffff0400066ca800 CPU: 123 COMMAND: "swapper/123" PID: 16605 TASK: ffff04000f10e400 CPU: 124 COMMAND: "genload" PID: 16603 TASK: ffff04000f108000 CPU: 125 COMMAND: "genload" PID: 16598 TASK: ffff04000eb8d000 CPU: 126 COMMAND: "genload" PID: 670149 TASK: ffff04000742a800 CPU: 127 COMMAND: "BackgroundWorke" genload调用栈如下: PID: 16716 TASK: ffff000828278000 CPU: 0 COMMAND: "genload" #0 [ffff800080003d40] crash_save_cpu at ffff80008015c1dc #1 [ffff800080003ef0] ipi_cpu_crash_stop at ffff800080026a10 #2 [ffff800080003f10] do_handle_IPI at ffff800080026e48 #3 [ffff800080003f50] ipi_handler at ffff800080026f4c #4 [ffff800080003f60] handle_percpu_devid_irq at ffff8000800fde18 #5 [ffff800080003fa0] generic_handle_domain_irq at ffff8000800f5aa4 #6 [ffff800080003fb0] __gic_handle_irq_from_irqson at ffff8000806b2c1c #7 [ffff800080003fe0] gic_handle_irq at ffff800080010100 --- <IRQ stack> --- #8 [ffff8000afa8b550] call_on_irq_stack at ffff8000800164f8 #9 [ffff8000afa8b560] do_interrupt_handler at ffff80008001884c #10 [ffff8000afa8b580] el1_interrupt at ffff800080d2fe60 #11 [ffff8000afa8b5a0] el1h_64_irq_handler at ffff800080d31ec0 #12 [ffff8000afa8b6e0] el1h_64_irq at ffff800080011384 #13 [ffff8000afa8b700] _raw_spin_unlock_irqrestore at ffff800080d402e0 #14 [ffff8000afa8b770] __kfence_alloc at ffff800080384450 #15 [ffff8000afa8ba00] kmem_cache_alloc at ffff80008037cde8 #16 [ffff8000afa8ba70] alloc_buffer_head at ffff80008042c0e8 #17 [ffff8000afa8ba90] folio_alloc_buffers at ffff80008042d7d0 #18 [ffff8000afa8bae0] folio_create_empty_buffers at ffff80008042d9b4 #19 [ffff8000afa8bb10] folio_create_buffers at ffff80008042dbb4 #20 [ffff8000afa8bb30] __block_write_begin_int at ffff80008042f638 #21 [ffff8000afa8bbe0] __block_write_begin at ffff80008042fa84 #22 [ffff8000afa8bbf0] ext4_da_write_begin at ffff8000804cae64 #23 [ffff8000afa8bc70] generic_perform_write at ffff8000802ac200 #24 [ffff8000afa8bd10] ext4_buffered_write_iter at ffff8000804b4704 #25 [ffff8000afa8bd40] ext4_file_write_iter at ffff8000804b4bb8 #26 [ffff8000afa8bd50] vfs_write at ffff8000803d67d8 #27 [ffff8000afa8bdf0] ksys_write at ffff8000803d6ad4 #28 [ffff8000afa8be30] __arm64_sys_write at ffff8000803d6b8c #29 [ffff8000afa8be40] el0_svc_common.constprop.0 at ffff80008002881c #30 [ffff8000afa8be70] do_el0_svc at ffff800080028914 #31 [ffff8000afa8be80] el0_svc at ffff800080d317f8 #32 [ffff8000afa8bea0] el0t_64_sync_handler at ffff800080d3204c #33 [ffff8000afa8bfe0] el0t_64_sync at ffff800080011608 请内存同学先帮忙排查一下
call trace 逻辑分析 1. 用户态进程调用 execve syscall 2. 内核在 do_execveat_common->count->get_user_arg_ptr 负责解析用户态参数 arg 时触发了同步异常 data abort. /* * count() counts the number of strings in array ARGV. */ static int count(struct user_arg_ptr argv, int max) { int i = 0; if (argv.ptr.native != NULL) { for (;;) { const char __user *p = get_user_arg_ptr(argv, i); if (!p) break; if (IS_ERR(p)) return -EFAULT; if (i >= max) return -E2BIG; ++i; if (fatal_signal_pending(current)) return -ERESTARTNOHAND; cond_resched(); } } return i; } 3. 内核同步异常处理函数 el1h_64_sync_handler->el1_abort 进行异常处理。 392 static void noinstr el1_abort(struct pt_regs *regs, unsigned long esr) 393 { * 394 unsigned long far = read_sysreg(far_el1); 395 396 enter_from_kernel_mode(regs); 397 local_daif_inherit(regs); 398 do_mem_abort(far, esr, regs); 399 local_daif_mask(); 400 exit_to_kernel_mode(regs); 401 } 4. 在执行 local_daif_inherit 时内核报了 softlockup。 目前关键路径上只有 count 函数存在loop,但是 count 的loop 中又有 cond_resched ,不太可能触发 softlockup。
(In reply to banye97 from comment #2) > call trace 逻辑分析 > 1. 用户态进程调用 execve syscall > 2. 内核在 do_execveat_common->count->get_user_arg_ptr 负责解析用户态参数 arg 时触发了同步异常 > data abort. > > /* > * count() counts the number of strings in array ARGV. > */ > static int count(struct user_arg_ptr argv, int max) > { > int i = 0; > > if (argv.ptr.native != NULL) { > for (;;) { > const char __user *p = get_user_arg_ptr(argv, i); > > if (!p) > break; > > if (IS_ERR(p)) > return -EFAULT; > > if (i >= max) > return -E2BIG; > ++i; > > if (fatal_signal_pending(current)) > return -ERESTARTNOHAND; > cond_resched(); > } > } > return i; > } > > 3. 内核同步异常处理函数 el1h_64_sync_handler->el1_abort 进行异常处理。 > 392 static void noinstr el1_abort(struct pt_regs *regs, unsigned long > esr) > 393 { > * 394 unsigned long far = read_sysreg(far_el1); > 395 > 396 enter_from_kernel_mode(regs); > 397 local_daif_inherit(regs); > 398 do_mem_abort(far, esr, regs); > 399 local_daif_mask(); > 400 exit_to_kernel_mode(regs); > 401 } > 4. 在执行 local_daif_inherit 时内核报了 softlockup。 > > 目前关键路径上只有 count 函数存在loop,但是 count 的loop 中又有 cond_resched ,不太可能触发 softlockup。 并非是在解析 arg 时出现的 softlockup,而是在解析 env 时出现 softlockup
同时系统日志中有大量的 ext4 文件系统错误,在 softlockup 前也出现过该错误: ‘’‘ [202920.365639] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm statvfs01: lblock 0 mapped to illegal pblock 9279 (length 1) [202920.400001] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm cgroup_fj_stres: lblock 0 mapped to illegal pblock 9279 (length 1) [202920.527839] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm mktemp: lblock 0 mapped to illegal pblock 9279 (length 1) [203033.405731] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm cgroup_fj_stres: lblock 0 mapped to illegal pblock 9279 (length 1) [203033.523268] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm fs_bind_rbind18: lblock 0 mapped to illegal pblock 9279 (length 1) [203033.950514] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm mktemp: lblock 0 mapped to illegal pblock 9279 (length 1) [203034.401607] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm fs_bind_rbind18: lblock 0 mapped to illegal pblock 9279 (length 1) [203034.541679] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm fs_fill: lblock 0 mapped to illegal pblock 9279 (length 1) [203034.553754] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm cgroup_fj_stres: lblock 0 mapped to illegal pblock 9279 (length 1) [203034.652233] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm mktemp: lblock 0 mapped to illegal pblock 9279 (length 1) [203034.716704] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm tst_cgctl: lblock 0 mapped to illegal pblock 9279 (length 1) [203035.066671] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm cgroup_fj_stres: lblock 0 mapped to illegal pblock 9279 (length 1) [203035.265391] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm sync01: lblock 0 mapped to illegal pblock 9279 (length 1) [203042.853874] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm symlink01: lblock 0 mapped to illegal pblock 9279 (length 1) [203043.302905] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm diotest3: lblock 0 mapped to illegal pblock 9279 (length 1) [203043.514959] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm fs_bind_move21.: lblock 0 mapped to illegal pblock 9279 (length 1) [203043.589833] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm mktemp: lblock 0 mapped to illegal pblock 9279 (length 1) [203043.770971] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm fs_bind_move21.: lblock 0 mapped to illegal pblock 9279 (length 1) [203043.920844] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm pids.sh: lblock 0 mapped to illegal pblock 9279 (length 1) [203046.002426] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm pids.sh: lblock 0 mapped to illegal pblock 9279 (length 1) [203046.430482] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm fsopen01: lblock 0 mapped to illegal pblock 9279 (length 1) [203046.648957] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm fs_perms: lblock 0 mapped to illegal pblock 9279 (length 1) [203047.098539] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm msgstress03: lblock 0 mapped to illegal pblock 9279 (length 1) [203049.812385] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm cgroup_fj_stres: lblock 0 mapped to illegal pblock 9279 (length 1) [203050.211071] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm epoll_wait05: lblock 0 mapped to illegal pblock 9279 (length 1) [203050.653564] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm msgrcv02: lblock 0 mapped to illegal pblock 9279 (length 1) [203050.999817] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm pidfd_getfd01: lblock 0 mapped to illegal pblock 9279 (length 1) [203051.226434] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm prctl03: lblock 0 mapped to illegal pblock 9279 (length 1) [203051.434233] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm sh: lblock 0 mapped to illegal pblock 9279 (length 1) [203056.565032] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm mkdir: lblock 0 mapped to illegal pblock 9279 (length 1) [203056.565224] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm mkdir: lblock 0 mapped to illegal pblock 9279 (length 1) [203056.576675] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm rm: lblock 0 mapped to illegal pblock 9279 (length 1) [203056.576881] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm rm: lblock 0 mapped to illegal pblock 9279 (length 1) [203056.587317] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm rm: lblock 0 mapped to illegal pblock 9279 (length 1) [203056.587372] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm rm: lblock 0 mapped to illegal pblock 9279 (length 1) [203056.589696] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm fs_di: lblock 0 mapped to illegal pblock 9279 (length 1) [203056.597525] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm rm: lblock 0 mapped to illegal pblock 9279 (length 1) [203056.597685] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm rm: lblock 0 mapped to illegal pblock 9279 (length 1) [203056.817837] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm getcwd04: lblock 0 mapped to illegal pblock 9279 (length 1) [203303.814415] ext4_da_write_begin+0xa4/0x2b8 [203303.814421] ext4_buffered_write_iter+0x70/0x140 [203303.814424] ext4_file_write_iter+0x3c/0x68 [203823.464947] ext4_da_write_begin+0xa4/0x2b8 [203823.464951] ext4_buffered_write_iter+0x70/0x140 [203823.464954] ext4_file_write_iter+0x3c/0x68 [203830.911854] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm pwritev01: lblock 0 mapped to illegal pblock 9279 (length 1) [203831.176277] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm cpuhotplug06.sh: lblock 0 mapped to illegal pblock 9279 (length 1) [203831.919233] EXT4-fs error (device nvme0n1p1): ext4_map_blocks:606: inode #2: block 9279: comm cpuhotplug06.sh: lblock 0 mapped to illegal pblock 9279 (length 1) ’‘’
BackgroundWorker 的父进程是 logagent,这个东西和ltp无关,但是什么组件的进程也不清楚。 看起来是logagent磁盘读写不响应才导致的softlockup,诱因难以定位。fs-error “可能”是诱因。需要fsck修复一下磁盘。 能稳定复现吗?
(In reply to Ferry Meng from comment #5) > BackgroundWorker 的父进程是 logagent,这个东西和ltp无关,但是什么组件的进程也不清楚。 > 看起来是logagent磁盘读写不响应才导致的softlockup,诱因难以定位。fs-error “可能”是诱因。需要fsck修复一下磁盘。 > > 能稳定复现吗? 目前只出现了这一次,还没复出来