Bug 4838 - [ANCK-5.10-14][release][alinux3][release内核][倚天ecs]stress-ng os子系统测试产生vmcore,Kernel panic - not syncing: softlockup: hung tasks - io_wqe_cancel_pending_work at ffff80001053d22c
Summary: [ANCK-5.10-14][release][alinux3][release内核][倚天ecs]stress-ng os子系统测试产生vmcore,K...
Status: NEW
Alias: None
Product: Antest
Classification: Infrastructures
Component: 测试用例 (show other bugs) 测试用例
Version: unspecified
Hardware: aarch64 Linux
: P3-Medium S3-normal
Target Milestone: ---
Assignee: ZiyangZhang
QA Contact: shuming
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-04-26 14:23 UTC by zhixin01
Modified: 2023-04-27 16:16 UTC (History)
7 users (show)

See Also:


Attachments
vmcore-dmesg (1.08 MB, text/plain)
2023-04-26 14:23 UTC, zhixin01
Details

Note You need to log in before you can comment on or make changes to this bug.
Description zhixin01 alibaba_cloud_group 2023-04-26 14:23:39 UTC
Created attachment 715 [details]
vmcore-dmesg

[缺陷描述]:
stress-ng os子系统测试产生vmcore,Kernel panic - not syncing: softlockup: hung tasks - io_wqe_cancel_pending_work at ffff80001053d22c

crash解析如下:(vmcore-dmesg日志见附件)
      KERNEL: /usr/lib/debug/lib/modules/5.10.134-14.al8.aarch64/vmlinux  [TAINTED]
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 64
        DATE: Wed Apr 26 01:29:51 CST 2023
      UPTIME: 12:03:33
LOAD AVERAGE: 77631.70, 77217.19, 76083.22
       TASKS: 78684
    NODENAME: qibo-anck014-al3-zx-g8y
     RELEASE: 5.10.134-14.al8.aarch64
     VERSION: #1 SMP Thu Apr 6 16:20:35 CST 2023
     MACHINE: aarch64  (unknown Mhz)
      MEMORY: 256 GB
       PANIC: "Kernel panic - not syncing: softlockup: hung tasks"
         PID: 3294
     COMMAND: "stress-ng-io-ur"
        TASK: ffff00015d605c80  [THREAD_INFO: ffff00015d605c80]
         CPU: 7
       STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 3294   TASK: ffff00015d605c80  CPU: 7   COMMAND: "stress-ng-io-ur"
 #0 [ffff80001003bb60] __crash_kexec at ffff80001028a450
 #1 [ffff80001003bd00] panic at ffff800010cae79c
 #2 [ffff80001003bde0] watchdog_timer_fn at ffff8000102cd7e8
 #3 [ffff80001003be30] __run_hrtimer at ffff80001025e7f4
 #4 [ffff80001003be80] __hrtimer_run_queues at ffff80001025ead8
 #5 [ffff80001003bef0] hrtimer_interrupt at ffff80001025f684
 #6 [ffff80001003bf60] arch_timer_handler_virt at ffff800010a2a7b4
 #7 [ffff80001003bf70] handle_percpu_devid_irq at ffff800010234968
 #8 [ffff80001003bfa0] __handle_domain_irq at ffff80001022c37c
 #9 [ffff80001003bfe0] gic_handle_irq at ffff80001011011c
--- <IRQ stack> ---
#10 [ffff80001944bbc0] el1_irq at ffff800010111bb8
#11 [ffff80001944bbe0] io_wqe_cancel_pending_work at ffff80001053d22c
#12 [ffff80001944bc20] io_wq_cancel_cb at ffff80001053e764
#13 [ffff80001944bca0] io_uring_cancel_files at ffff800010533584
#14 [ffff80001944bd50] io_uring_cancel_task_requests at ffff800010536d5c
#15 [ffff80001944bd80] __io_uring_files_cancel at ffff80001053b5f0
#16 [ffff80001944bdc0] do_exit at ffff80001019bb8c
#17 [ffff80001944bdf0] do_group_exit at ffff80001019bf98
#18 [ffff80001944be20] __arm64_sys_exit_group at ffff80001019c038
#19 [ffff80001944be30] el0_svc_common at ffff800010128c4c
#20 [ffff80001944be70] do_el0_svc at ffff800010128e88
#21 [ffff80001944be80] el0_svc at ffff800010cc5d08
#22 [ffff80001944bea0] el0_sync_handler at ffff800010cc65b4
#23 [ffff80001944bfe0] el0_sync at ffff800010111da4
     PC: 0000400003ac8dcc   LR: 000000000050bcfc   SP: 0000ffffd900c730
    X29: 0000ffffd900c730  X28: 00000000000186a0  X27: 0000000000000001
    X26: 0000000000000000  X25: 00000000005be0e0  X24: 0000400003d1fbb0
    X23: 000000000daffc80  X22: 00000000005b20c0  X21: 0000000000000cde
    X20: 0000000000000000  X19: 0000ffffd900c958  X18: 000000000dafb020
    X17: 00000000005b0048  X16: 0000400003ac8db0  X15: 0000ffffd900b337
    X14: 0000000000000000  X13: 0000000000000000  X12: 0000400003a27320
    X11: 0000051bbda52aa1  X10: 0000000000000000   X9: 0000000000000018
     X8: 000000000000005e   X7: 0000400003907000   X6: 000000000138c4ea
     X5: 0000400003907000   X4: 0000000000000020   X3: 0000400003913bb0
     X2: 0000000000000000   X1: 0000000000402cbf   X0: 0000000000000000
    ORIG_X0: 0000000000000000  SYSCALLNO: 5e  PSTATE: 80001000
crash>   

[重现环境]:
环境信息:倚天ECS 
IP: 101.37.89.159

OS:
# cat /etc/os-release
NAME="Alibaba Cloud Linux"
VERSION="3 (Soaring Falcon)"
ID="alinux"
ID_LIKE="rhel fedora centos anolis"
VERSION_ID="3"
PLATFORM_ID="platform:al8"
PRETTY_NAME="Alibaba Cloud Linux 3 (Soaring Falcon)"
ANSI_COLOR="0;31"
HOME_URL="https://www.aliyun.com/"

内核版本: 
# uname -r
5.10.134-14.al8.aarch64

内存信息:
# free -h
              total        used        free      shared  buff/cache   available
Mem:          245Gi       887Mi       240Gi       0.0Ki       3.7Gi       242Gi
Swap:         1.0Gi          0B       1.0Gi

CPU信息:
# lscpu
Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              64
On-line CPU(s) list: 0-63
Thread(s) per core:  1
Core(s) per socket:  64
Socket(s):           1
NUMA node(s):        1
Vendor ID:           ARM
BIOS Vendor ID:      Alibaba Cloud
Model:               0
Model name:          Neoverse-N2
BIOS Model name:     virt-rhel7.6.0
Stepping:            r0p0
CPU MHz:             2750.000
CPU max MHz:         2750.0000
CPU min MHz:         2750.0000
BogoMIPS:            100.00
L1d cache:           64K
L1i cache:           64K
L2 cache:            1024K
L3 cache:            65536K
NUMA node0 CPU(s):   0-63
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh

[重现步骤]:
# 下载并编译stress-ng
git clone https://github.com/ColinIanKing/stress-ng.git
cd stress-ng-master
make && make install

# 初始化数据盘
[ -d /disk1 ] || mkdir /disk1
wipefs -a --force /dev/nvme1n1p1
mkfs -t ext4 -q -F /dev/nvme1n1p1
mount -t ext4 /dev/nvme1n1p1 /disk1
mkdir -p /disk1/tmpdir/stress-ng

# 设置前置参数
echo 1 > /proc/sys/kernel/panic
echo 1 > /proc/sys/kernel/hardlockup_panic
echo 1 > /proc/sys/kernel/softlockup_panic
echo 60 > /proc/sys/kernel/watchdog_thresh
echo 1200 > /proc/sys/kernel/hung_task_timeout_secs
echo 0 > /proc/sys/kernel/hung_task_panic

cho 3 >/sys/kernel/mm/transparent_hugepage/hugetext_enabled
echo 1 >/sys/kernel/mm/duptext/enabled
echo 1 >/sys/fs/cgroup/memory/memory.allow_duptext
echo 1 > /proc/sys/kernel/sched_group_identity_enabled
ulimit -s unlimited

# 执行测试命令
nohup stress-ng -a 1 --class os -t 12h --metrics -x rlimit --times --verify -v --log-file /disk1/tmpdir/stress-ng/stress-logfile-11.txt --temp-path /disk1/tmpdir/stress-ng/ &

[期望结果]:
stress-ng --class os正常执行,不会发生crash

[实际结果]:
stress-ng --class os执行中触发crash
Comment 1 zhixin01 alibaba_cloud_group 2023-04-26 14:27:38 UTC
在上述相同环境复现2次,但在同规格倚天64c ecs,ip:121.40.160.10,运行12h未复现
Comment 2 zhixin01 alibaba_cloud_group 2023-04-26 14:28:21 UTC
小规格8c 倚天ecs,未复现。
Comment 3 zhixin01 alibaba_cloud_group 2023-04-27 10:49:24 UTC
(In reply to zhixin01 from comment #1)
> 在上述相同环境复现2次,但在同规格倚天64c ecs,ip:121.40.160.10,运行12h未复现

在同规格倚天64c ecs,ip:121.40.160.10,再次发起stress-ng os子系统测试,运行12h未复现

ps:查看出问题环境101.37.89.159的根目录空间大小为99G,而未复现环境121.40.160.10的根目录空间大小为40G,貌似和根目录空间大小有关

121.40.160.10环境信息:
# lscpu
Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              64
On-line CPU(s) list: 0-63
Thread(s) per core:  1
Core(s) per socket:  64
Socket(s):           1
NUMA node(s):        1
Vendor ID:           ARM
BIOS Vendor ID:      Alibaba Cloud
Model:               0
BIOS Model name:     virt-rhel7.6.0
Stepping:            r0p0
CPU max MHz:         2750.0000
CPU min MHz:         2750.0000
BogoMIPS:            100.00
L1d cache:           64K
L1i cache:           64K
L2 cache:            1024K
L3 cache:            65536K
NUMA node0 CPU(s):   0-63
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh

# df -Th
Filesystem     Type      Size  Used Avail Use% Mounted on
devtmpfs       devtmpfs  123G     0  123G   0% /dev
tmpfs          tmpfs     123G     0  123G   0% /dev/shm
tmpfs          tmpfs     123G  964K  123G   1% /run
tmpfs          tmpfs     123G     0  123G   0% /sys/fs/cgroup
/dev/nvme0n1p2 ext4       40G   17G   21G  46% /
/dev/nvme1n1p2 ext4       20G  1.1G   18G   6% /swap
/dev/nvme1n1p1 ext4       79G  220K   75G   1% /disk1
/dev/nvme0n1p1 vfat      200M  6.8M  194M   4% /boot/efi
tmpfs          tmpfs      25G     0   25G   0% /run/user/0

# free -h
              total        used        free      shared  buff/cache   available
Mem:          245Gi       1.5Gi       240Gi       0.0Ki       3.5Gi       241Gi
Swap:         1.0Gi       136Mi       887Mi

补充下101.37.89.159的环境信息:
# df -Th
Filesystem     Type      Size  Used Avail Use% Mounted on
devtmpfs       devtmpfs  123G     0  123G   0% /dev
tmpfs          tmpfs     123G     0  123G   0% /dev/shm
tmpfs          tmpfs     123G  960K  123G   1% /run
tmpfs          tmpfs     123G     0  123G   0% /sys/fs/cgroup
/dev/nvme0n1p2 ext4       99G   17G   78G  18% /
/dev/nvme1n1p2 ext4       20G  1.1G   18G   6% /swap
/dev/nvme0n1p1 vfat      200M  6.8M  194M   4% /boot/efi
/dev/nvme1n1p1 ext4       79G  149M   75G   1% /disk1
tmpfs          tmpfs      25G     0   25G   0% /run/user/0