Description of problem: 安装anolis8.2 x64 anck镜像后,运行ltp-stress长时测试,出现实例hang住的现象 Version-Release number of selected component (if applicable): # cat /etc/image-id image_name="Anolis OS 8.2 ANCK 64 bit" image_id="anolisos_8_2_x64_20G_anck_alibase_20220413.vhd" release_date="20220413192232" # cat /proc/cmdline BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-4.19.91-25.8.an8.x86_64 root=UUID=5dd29192-4c3c-4d3c-8027-d5ad8a736d20 ro crashkernel=0M-2G:0M,2G-8G:192M,8G-:256M cryptomgr.notests cgroup.memory=nokmem rcupdate.rcu_cpu_stall_timeout=300 vring_force_dma_api rhgb quiet biosdevname=0 net.ifnames=0 console=tty0 console=ttyS0,115200n8 noibrs nvme_core.io_timeout=4294967295 nvme_core.admin_timeout=4294967295 crashkernel=0M-2G:0M,2G-8G:192M,8G-:256M # uname -a Linux iZbp135go40q5dwxe76ax7Z 4.19.91-25.8.an8.x86_64 #1 SMP Tue Apr 12 16:14:51 CST 2022 x86_64 x86_64 x86_64 GNU/Linux 部分系统日志: [41457.695939] INFO: task genload:272885 blocked for more than 1200 seconds. [41457.698252] Tainted: G OE 4.19.91-25.8.an8.x86_64 #1 [41457.700571] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [41457.703071] genload D 0 272885 272866 0x00000000 [41457.705297] Call Trace: [41457.707114] ? __schedule+0x29f/0x6c0 [41457.709093] schedule+0x29/0xc0 [41457.710972] wait_transaction_locked+0x76/0xa0 [41457.713003] ? wait_woken+0x80/0x80 [41457.714915] add_transaction_credits+0x106/0x280 [41457.716977] ? ext4_da_get_block_prep+0x232/0x3e0 [41457.719055] start_this_handle+0xf2/0x3a0 [41457.721041] ? account_page_dirtied+0x113/0x1e0 [41457.723075] ? kmem_cache_alloc+0x188/0x190 [41457.725041] jbd2__journal_start+0xab/0x1b0 [41457.726980] ext4_dirty_inode+0x2d/0x60 [41457.728843] __mark_inode_dirty+0x3f/0x380 [41457.730732] generic_write_end+0x30/0x90 [41457.732604] generic_perform_write+0xf5/0x190 [41457.734537] ext4_buffered_write_iter+0x8d/0x120 [41457.736505] ext4_file_write_iter+0x5c/0x650 [41457.738423] new_sync_write+0xf4/0x140 [41457.740263] vfs_write+0xa9/0x1a0 [41457.742037] ksys_write+0x43/0xb0 [41457.743804] do_syscall_64+0x5f/0x190 [41457.745610] ? async_page_fault+0x8/0x30 [41457.747449] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [41457.749463] RIP: 0033:0x7fe248aed648 [41457.751271] Code: Bad RIP value. [41457.753028] RSP: 002b:00007fff0a3b3558 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [41457.755386] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fe248aed648 [41457.757710] RDX: 00000000000fffff RSI: 00007fff0a3b3560 RDI: 0000000000000003 [41457.760060] RBP: 00007fff0a4b35d0 R08: 00007fff0a50d000 R09: 0000000000000000 [41457.762399] R10: 0000000000000180 R11: 0000000000000246 R12: 0000000040000000 [41457.764739] R13: 00007fff0a3b3560 R14: 000000003fffffff R15: 000000001e3ffe1c How reproducible: Steps to Reproduce: 1.安装anolis8.2 x64 anck镜像 2.部署ltp: git clone https://github.com/linux-test-project/ltp.git yum install -y gcc gcc-c++ make automake crash cd ltp/ make autotools ./configure make make install cd /opt/ltp [root@iZbp1dybdli87m10c6lbe3Z ltp]# cat load.sh #!/bin/bash echo 1 > /proc/sys/kernel/panic echo 1 > /proc/sys/kernel/hardlockup_panic echo 1 > /proc/sys/kernel/softlockup_panic echo 60 > /proc/sys/kernel/watchdog_thresh echo 1200 > /proc/sys/kernel/hung_task_timeout_secs echo 0 > /proc/sys/kernel/hung_task_panic #echo 0 > /proc/sys/kernel/panic_on_fatal_event #echo 1 > /proc/sys/kernel/panic_on_rcu_stall nr_cpu=$(nproc) mem_kb=$(grep ^MemTotal /proc/meminfo | awk '{print $2}') ./runltp \ -c $((nr_cpu / 2)) \ -m $((nr_cpu / 4)),4,$((mem_kb / nr_cpu / 2 * 1024)),1 \ -D $((nr_cpu / 10)),1,0,1 \ -i 2 \ -B ext4 \ -R -p -q \ -t 24h \ -d /disk1/tmpdir/ltp 3.运行ltp:nohup sh ./load.sh & Actual results: 实例ping不通,出现hang住的现象 Expected results: ltp正常运行24h,无vmcore生成,无hang住现象,正常结束 Additional info:
6月份月度镜像anolisos_7_9_x64_20G_anck_alibase_20220701.vhd 实例ecs.ebmg6.26xlarge,ecs.ebmhfc7.48xlarge出现相同情况,操作步骤相同,拉起长时ltp-stress会hang住无法ping通 部分dmesg err及warn如下: [root@iZbp15wq760jcf3kbawseaZ ~]# dmesg -l err -T [Thu Jul 14 04:48:44 2022] cgroup: cgroup2: unknown option "memory_recursiveprot" [Thu Jul 14 06:09:10 2022] INFO: task jbd2/vdb-8:3376 blocked for more than 1200 seconds. [Thu Jul 14 06:09:10 2022] Tainted: G OE 4.19.91-26.an7.x86_64 #1 [Thu Jul 14 06:09:10 2022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Thu Jul 14 06:09:10 2022] INFO: task genload:3723 blocked for more than 1200 seconds. [Thu Jul 14 06:09:10 2022] Tainted: G OE 4.19.91-26.an7.x86_64 #1 [Thu Jul 14 06:09:10 2022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [root@iZbp15wq760jcf3kbawseaZ ~]# dmesg -l warn -T | head -n 50 [Wed Jul 13 23:59:42 2022] systemd-journald[1180]: /dev/kmsg buffer overrun, some messages lost. [Wed Jul 13 23:59:42 2022] systemd-journald[1180]: /dev/kmsg buffer overrun, some messages lost. [Wed Jul 13 23:59:42 2022] systemd-journald[1180]: /dev/kmsg buffer overrun, some messages lost. [Wed Jul 13 23:59:42 2022] systemd-journald[1180]: /dev/kmsg buffer overrun, some messages lost. [Wed Jul 13 23:59:42 2022] systemd-journald[1180]: /dev/kmsg buffer overrun, some messages lost. [Wed Jul 13 23:59:42 2022] systemd-journald[1180]: /dev/kmsg buffer overrun, some messages lost. [Wed Jul 13 23:59:42 2022] systemd-journald[1180]: /dev/kmsg buffer overrun, some messages lost. [Wed Jul 13 23:59:42 2022] systemd-journald[1180]: /dev/kmsg buffer overrun, some messages lost. [Wed Jul 13 23:59:42 2022] systemd-journald[1180]: /dev/kmsg buffer overrun, some messages lost. [Wed Jul 13 23:59:42 2022] systemd-journald[1180]: /dev/kmsg buffer overrun, some messages lost. [Thu Jul 14 04:48:57 2022] Swap area shorter than signature indicates [Thu Jul 14 06:09:10 2022] Call Trace: [Thu Jul 14 06:09:10 2022] ? __schedule+0x303/0x6a0 [Thu Jul 14 06:09:10 2022] ? bit_wait+0x60/0x60 [Thu Jul 14 06:09:10 2022] schedule+0x33/0xd0 [Thu Jul 14 06:09:10 2022] io_schedule+0x12/0x40 [Thu Jul 14 06:09:10 2022] bit_wait_io+0xd/0x60 [Thu Jul 14 06:09:10 2022] __wait_on_bit+0x63/0x90 [Thu Jul 14 06:09:10 2022] out_of_line_wait_on_bit+0x80/0x90 [Thu Jul 14 06:09:10 2022] ? init_wait_var_entry+0x40/0x40 [Thu Jul 14 06:09:10 2022] jbd2_journal_commit_transaction+0x1040/0x1d00 [Thu Jul 14 06:09:10 2022] ? __switch_to_asm+0x40/0x70 [Thu Jul 14 06:09:10 2022] ? __switch_to_asm+0x34/0x70 [Thu Jul 14 06:09:10 2022] kjournald2+0xb5/0x220 [Thu Jul 14 06:09:10 2022] ? wait_woken+0x80/0x80 [Thu Jul 14 06:09:10 2022] kthread+0xf8/0x130 [Thu Jul 14 06:09:10 2022] ? jbd2_checkpoint_thread+0xe0/0xe0 [Thu Jul 14 06:09:10 2022] ? kthread_park+0xb0/0xb0 [Thu Jul 14 06:09:10 2022] ret_from_fork+0x35/0x40 [Thu Jul 14 06:09:10 2022] Call Trace: [Thu Jul 14 06:09:10 2022] ? __schedule+0x303/0x6a0 [Thu Jul 14 06:09:10 2022] ? __wake_up_common_lock+0x77/0x90 [Thu Jul 14 06:09:10 2022] schedule+0x33/0xd0 [Thu Jul 14 06:09:10 2022] jbd2_log_wait_commit+0x7e/0xf0 [Thu Jul 14 06:09:10 2022] ? wait_woken+0x80/0x80 [Thu Jul 14 06:09:10 2022] ? __ia32_sys_fdatasync+0x20/0x20 [Thu Jul 14 06:09:10 2022] ext4_sync_fs+0x192/0x1b0 [Thu Jul 14 06:09:10 2022] iterate_supers+0xa3/0x100 [Thu Jul 14 06:09:10 2022] ksys_sync+0x50/0x90 [Thu Jul 14 06:09:10 2022] __ia32_sys_sync+0xa/0x10 [Thu Jul 14 06:09:10 2022] do_syscall_64+0x5b/0x1d0 [Thu Jul 14 06:09:10 2022] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [Thu Jul 14 06:09:10 2022] RIP: 0033:0x7fb7698ddd27 [Thu Jul 14 06:09:10 2022] Code: Bad RIP value. [Thu Jul 14 06:09:10 2022] RSP: 002b:00007ffc7f11d768 EFLAGS: 00000217 ORIG_RAX: 00000000000000a2 [Thu Jul 14 06:09:10 2022] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb7698ddd27 [Thu Jul 14 06:09:10 2022] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007ffc7f11d750 [Thu Jul 14 06:09:10 2022] RBP: 0000000000000000 R08: 00007fb76a0cb740 R09: 0000000000000e8a [Thu Jul 14 06:09:10 2022] R10: 00007ffc7f11cb60 R11: 0000000000000217 R12: 0000000000001770 [Thu Jul 14 06:09:10 2022] R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000000