Created attachment 715 [details] vmcore-dmesg [缺陷描述]: stress-ng os子系统测试产生vmcore,Kernel panic - not syncing: softlockup: hung tasks - io_wqe_cancel_pending_work at ffff80001053d22c crash解析如下:(vmcore-dmesg日志见附件) KERNEL: /usr/lib/debug/lib/modules/5.10.134-14.al8.aarch64/vmlinux [TAINTED] DUMPFILE: vmcore [PARTIAL DUMP] CPUS: 64 DATE: Wed Apr 26 01:29:51 CST 2023 UPTIME: 12:03:33 LOAD AVERAGE: 77631.70, 77217.19, 76083.22 TASKS: 78684 NODENAME: qibo-anck014-al3-zx-g8y RELEASE: 5.10.134-14.al8.aarch64 VERSION: #1 SMP Thu Apr 6 16:20:35 CST 2023 MACHINE: aarch64 (unknown Mhz) MEMORY: 256 GB PANIC: "Kernel panic - not syncing: softlockup: hung tasks" PID: 3294 COMMAND: "stress-ng-io-ur" TASK: ffff00015d605c80 [THREAD_INFO: ffff00015d605c80] CPU: 7 STATE: TASK_RUNNING (PANIC) crash> bt PID: 3294 TASK: ffff00015d605c80 CPU: 7 COMMAND: "stress-ng-io-ur" #0 [ffff80001003bb60] __crash_kexec at ffff80001028a450 #1 [ffff80001003bd00] panic at ffff800010cae79c #2 [ffff80001003bde0] watchdog_timer_fn at ffff8000102cd7e8 #3 [ffff80001003be30] __run_hrtimer at ffff80001025e7f4 #4 [ffff80001003be80] __hrtimer_run_queues at ffff80001025ead8 #5 [ffff80001003bef0] hrtimer_interrupt at ffff80001025f684 #6 [ffff80001003bf60] arch_timer_handler_virt at ffff800010a2a7b4 #7 [ffff80001003bf70] handle_percpu_devid_irq at ffff800010234968 #8 [ffff80001003bfa0] __handle_domain_irq at ffff80001022c37c #9 [ffff80001003bfe0] gic_handle_irq at ffff80001011011c --- <IRQ stack> --- #10 [ffff80001944bbc0] el1_irq at ffff800010111bb8 #11 [ffff80001944bbe0] io_wqe_cancel_pending_work at ffff80001053d22c #12 [ffff80001944bc20] io_wq_cancel_cb at ffff80001053e764 #13 [ffff80001944bca0] io_uring_cancel_files at ffff800010533584 #14 [ffff80001944bd50] io_uring_cancel_task_requests at ffff800010536d5c #15 [ffff80001944bd80] __io_uring_files_cancel at ffff80001053b5f0 #16 [ffff80001944bdc0] do_exit at ffff80001019bb8c #17 [ffff80001944bdf0] do_group_exit at ffff80001019bf98 #18 [ffff80001944be20] __arm64_sys_exit_group at ffff80001019c038 #19 [ffff80001944be30] el0_svc_common at ffff800010128c4c #20 [ffff80001944be70] do_el0_svc at ffff800010128e88 #21 [ffff80001944be80] el0_svc at ffff800010cc5d08 #22 [ffff80001944bea0] el0_sync_handler at ffff800010cc65b4 #23 [ffff80001944bfe0] el0_sync at ffff800010111da4 PC: 0000400003ac8dcc LR: 000000000050bcfc SP: 0000ffffd900c730 X29: 0000ffffd900c730 X28: 00000000000186a0 X27: 0000000000000001 X26: 0000000000000000 X25: 00000000005be0e0 X24: 0000400003d1fbb0 X23: 000000000daffc80 X22: 00000000005b20c0 X21: 0000000000000cde X20: 0000000000000000 X19: 0000ffffd900c958 X18: 000000000dafb020 X17: 00000000005b0048 X16: 0000400003ac8db0 X15: 0000ffffd900b337 X14: 0000000000000000 X13: 0000000000000000 X12: 0000400003a27320 X11: 0000051bbda52aa1 X10: 0000000000000000 X9: 0000000000000018 X8: 000000000000005e X7: 0000400003907000 X6: 000000000138c4ea X5: 0000400003907000 X4: 0000000000000020 X3: 0000400003913bb0 X2: 0000000000000000 X1: 0000000000402cbf X0: 0000000000000000 ORIG_X0: 0000000000000000 SYSCALLNO: 5e PSTATE: 80001000 crash> [重现环境]: 环境信息:倚天ECS IP: 101.37.89.159 OS: # cat /etc/os-release NAME="Alibaba Cloud Linux" VERSION="3 (Soaring Falcon)" ID="alinux" ID_LIKE="rhel fedora centos anolis" VERSION_ID="3" PLATFORM_ID="platform:al8" PRETTY_NAME="Alibaba Cloud Linux 3 (Soaring Falcon)" ANSI_COLOR="0;31" HOME_URL="https://www.aliyun.com/" 内核版本: # uname -r 5.10.134-14.al8.aarch64 内存信息: # free -h total used free shared buff/cache available Mem: 245Gi 887Mi 240Gi 0.0Ki 3.7Gi 242Gi Swap: 1.0Gi 0B 1.0Gi CPU信息: # lscpu Architecture: aarch64 Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Thread(s) per core: 1 Core(s) per socket: 64 Socket(s): 1 NUMA node(s): 1 Vendor ID: ARM BIOS Vendor ID: Alibaba Cloud Model: 0 Model name: Neoverse-N2 BIOS Model name: virt-rhel7.6.0 Stepping: r0p0 CPU MHz: 2750.000 CPU max MHz: 2750.0000 CPU min MHz: 2750.0000 BogoMIPS: 100.00 L1d cache: 64K L1i cache: 64K L2 cache: 1024K L3 cache: 65536K NUMA node0 CPU(s): 0-63 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh [重现步骤]: # 下载并编译stress-ng git clone https://github.com/ColinIanKing/stress-ng.git cd stress-ng-master make && make install # 初始化数据盘 [ -d /disk1 ] || mkdir /disk1 wipefs -a --force /dev/nvme1n1p1 mkfs -t ext4 -q -F /dev/nvme1n1p1 mount -t ext4 /dev/nvme1n1p1 /disk1 mkdir -p /disk1/tmpdir/stress-ng # 设置前置参数 echo 1 > /proc/sys/kernel/panic echo 1 > /proc/sys/kernel/hardlockup_panic echo 1 > /proc/sys/kernel/softlockup_panic echo 60 > /proc/sys/kernel/watchdog_thresh echo 1200 > /proc/sys/kernel/hung_task_timeout_secs echo 0 > /proc/sys/kernel/hung_task_panic cho 3 >/sys/kernel/mm/transparent_hugepage/hugetext_enabled echo 1 >/sys/kernel/mm/duptext/enabled echo 1 >/sys/fs/cgroup/memory/memory.allow_duptext echo 1 > /proc/sys/kernel/sched_group_identity_enabled ulimit -s unlimited # 执行测试命令 nohup stress-ng -a 1 --class os -t 12h --metrics -x rlimit --times --verify -v --log-file /disk1/tmpdir/stress-ng/stress-logfile-11.txt --temp-path /disk1/tmpdir/stress-ng/ & [期望结果]: stress-ng --class os正常执行,不会发生crash [实际结果]: stress-ng --class os执行中触发crash
在上述相同环境复现2次,但在同规格倚天64c ecs,ip:121.40.160.10,运行12h未复现
小规格8c 倚天ecs,未复现。
(In reply to zhixin01 from comment #1) > 在上述相同环境复现2次,但在同规格倚天64c ecs,ip:121.40.160.10,运行12h未复现 在同规格倚天64c ecs,ip:121.40.160.10,再次发起stress-ng os子系统测试,运行12h未复现 ps:查看出问题环境101.37.89.159的根目录空间大小为99G,而未复现环境121.40.160.10的根目录空间大小为40G,貌似和根目录空间大小有关 121.40.160.10环境信息: # lscpu Architecture: aarch64 Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Thread(s) per core: 1 Core(s) per socket: 64 Socket(s): 1 NUMA node(s): 1 Vendor ID: ARM BIOS Vendor ID: Alibaba Cloud Model: 0 BIOS Model name: virt-rhel7.6.0 Stepping: r0p0 CPU max MHz: 2750.0000 CPU min MHz: 2750.0000 BogoMIPS: 100.00 L1d cache: 64K L1i cache: 64K L2 cache: 1024K L3 cache: 65536K NUMA node0 CPU(s): 0-63 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh # df -Th Filesystem Type Size Used Avail Use% Mounted on devtmpfs devtmpfs 123G 0 123G 0% /dev tmpfs tmpfs 123G 0 123G 0% /dev/shm tmpfs tmpfs 123G 964K 123G 1% /run tmpfs tmpfs 123G 0 123G 0% /sys/fs/cgroup /dev/nvme0n1p2 ext4 40G 17G 21G 46% / /dev/nvme1n1p2 ext4 20G 1.1G 18G 6% /swap /dev/nvme1n1p1 ext4 79G 220K 75G 1% /disk1 /dev/nvme0n1p1 vfat 200M 6.8M 194M 4% /boot/efi tmpfs tmpfs 25G 0 25G 0% /run/user/0 # free -h total used free shared buff/cache available Mem: 245Gi 1.5Gi 240Gi 0.0Ki 3.5Gi 241Gi Swap: 1.0Gi 136Mi 887Mi 补充下101.37.89.159的环境信息: # df -Th Filesystem Type Size Used Avail Use% Mounted on devtmpfs devtmpfs 123G 0 123G 0% /dev tmpfs tmpfs 123G 0 123G 0% /dev/shm tmpfs tmpfs 123G 960K 123G 1% /run tmpfs tmpfs 123G 0 123G 0% /sys/fs/cgroup /dev/nvme0n1p2 ext4 99G 17G 78G 18% / /dev/nvme1n1p2 ext4 20G 1.1G 18G 6% /swap /dev/nvme0n1p1 vfat 200M 6.8M 194M 4% /boot/efi /dev/nvme1n1p1 ext4 79G 149M 75G 1% /disk1 tmpfs tmpfs 25G 0 25G 0% /run/user/0