Bug 893 - 安装anolis8.2 x64 anck镜像后,运行ltp-stress长时测试,出现实例hang住的现象
Summary: 安装anolis8.2 x64 anck镜像后,运行ltp-stress长时测试,出现实例hang住的现象
Status: CONFIRMED
Alias: None
Product: Anolis OS 8
Classification: Anolis OS
Component: Images&Installations (show other bugs) Images&Installations
Version: 8.2
Hardware: All Linux
: P3-Medium S3-normal
Target Milestone: ---
Assignee: Jacob
QA Contact: shuming
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-04-24 14:16 UTC by chuyang_94
Modified: 2022-07-14 16:29 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description chuyang_94 alibaba_cloud_group 2022-04-24 14:16:26 UTC
Description of problem:
安装anolis8.2 x64 anck镜像后,运行ltp-stress长时测试,出现实例hang住的现象

Version-Release number of selected component (if applicable):

# cat /etc/image-id
image_name="Anolis OS 8.2 ANCK 64 bit"
image_id="anolisos_8_2_x64_20G_anck_alibase_20220413.vhd"
release_date="20220413192232"

# cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-4.19.91-25.8.an8.x86_64 root=UUID=5dd29192-4c3c-4d3c-8027-d5ad8a736d20 ro crashkernel=0M-2G:0M,2G-8G:192M,8G-:256M cryptomgr.notests cgroup.memory=nokmem rcupdate.rcu_cpu_stall_timeout=300 vring_force_dma_api rhgb quiet biosdevname=0 net.ifnames=0 console=tty0 console=ttyS0,115200n8 noibrs nvme_core.io_timeout=4294967295 nvme_core.admin_timeout=4294967295 crashkernel=0M-2G:0M,2G-8G:192M,8G-:256M

# uname -a
Linux iZbp135go40q5dwxe76ax7Z 4.19.91-25.8.an8.x86_64 #1 SMP Tue Apr 12 16:14:51 CST 2022 x86_64 x86_64 x86_64 GNU/Linux

部分系统日志:
[41457.695939] INFO: task genload:272885 blocked for more than 1200 seconds.
[41457.698252]       Tainted: G           OE     4.19.91-25.8.an8.x86_64 #1
[41457.700571] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[41457.703071] genload         D    0 272885 272866 0x00000000
[41457.705297] Call Trace:
[41457.707114]  ? __schedule+0x29f/0x6c0
[41457.709093]  schedule+0x29/0xc0
[41457.710972]  wait_transaction_locked+0x76/0xa0
[41457.713003]  ? wait_woken+0x80/0x80
[41457.714915]  add_transaction_credits+0x106/0x280
[41457.716977]  ? ext4_da_get_block_prep+0x232/0x3e0
[41457.719055]  start_this_handle+0xf2/0x3a0
[41457.721041]  ? account_page_dirtied+0x113/0x1e0
[41457.723075]  ? kmem_cache_alloc+0x188/0x190
[41457.725041]  jbd2__journal_start+0xab/0x1b0
[41457.726980]  ext4_dirty_inode+0x2d/0x60
[41457.728843]  __mark_inode_dirty+0x3f/0x380
[41457.730732]  generic_write_end+0x30/0x90
[41457.732604]  generic_perform_write+0xf5/0x190
[41457.734537]  ext4_buffered_write_iter+0x8d/0x120
[41457.736505]  ext4_file_write_iter+0x5c/0x650
[41457.738423]  new_sync_write+0xf4/0x140
[41457.740263]  vfs_write+0xa9/0x1a0
[41457.742037]  ksys_write+0x43/0xb0
[41457.743804]  do_syscall_64+0x5f/0x190
[41457.745610]  ? async_page_fault+0x8/0x30
[41457.747449]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[41457.749463] RIP: 0033:0x7fe248aed648
[41457.751271] Code: Bad RIP value.
[41457.753028] RSP: 002b:00007fff0a3b3558 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[41457.755386] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fe248aed648
[41457.757710] RDX: 00000000000fffff RSI: 00007fff0a3b3560 RDI: 0000000000000003
[41457.760060] RBP: 00007fff0a4b35d0 R08: 00007fff0a50d000 R09: 0000000000000000
[41457.762399] R10: 0000000000000180 R11: 0000000000000246 R12: 0000000040000000
[41457.764739] R13: 00007fff0a3b3560 R14: 000000003fffffff R15: 000000001e3ffe1c

How reproducible:


Steps to Reproduce:
1.安装anolis8.2 x64 anck镜像
2.部署ltp:
git clone https://github.com/linux-test-project/ltp.git
yum install -y gcc gcc-c++ make automake crash
cd ltp/
make autotools
./configure
make
make install
cd /opt/ltp
[root@iZbp1dybdli87m10c6lbe3Z ltp]# cat load.sh
#!/bin/bash
echo 1  > /proc/sys/kernel/panic
echo 1  > /proc/sys/kernel/hardlockup_panic
echo 1  > /proc/sys/kernel/softlockup_panic
echo 60 > /proc/sys/kernel/watchdog_thresh
echo 1200 > /proc/sys/kernel/hung_task_timeout_secs
echo 0   > /proc/sys/kernel/hung_task_panic
#echo 0  > /proc/sys/kernel/panic_on_fatal_event
#echo 1  > /proc/sys/kernel/panic_on_rcu_stall
nr_cpu=$(nproc)
mem_kb=$(grep ^MemTotal /proc/meminfo | awk '{print $2}')
./runltp \
 -c $((nr_cpu / 2)) \
 -m $((nr_cpu / 4)),4,$((mem_kb / nr_cpu / 2 * 1024)),1 \
 -D $((nr_cpu / 10)),1,0,1 \
 -i 2 \
 -B ext4 \
 -R -p -q \
 -t 24h \
 -d /disk1/tmpdir/ltp
3.运行ltp:nohup sh ./load.sh &

Actual results:
实例ping不通,出现hang住的现象

Expected results:
ltp正常运行24h,无vmcore生成,无hang住现象,正常结束

Additional info:
Comment 1 liuyaqing alibaba_cloud_group 2022-07-14 16:29:58 UTC
6月份月度镜像anolisos_7_9_x64_20G_anck_alibase_20220701.vhd
实例ecs.ebmg6.26xlarge,ecs.ebmhfc7.48xlarge出现相同情况,操作步骤相同,拉起长时ltp-stress会hang住无法ping通

部分dmesg err及warn如下:

[root@iZbp15wq760jcf3kbawseaZ ~]# dmesg -l err -T
[Thu Jul 14 04:48:44 2022] cgroup: cgroup2: unknown option "memory_recursiveprot"
[Thu Jul 14 06:09:10 2022] INFO: task jbd2/vdb-8:3376 blocked for more than 1200 seconds.
[Thu Jul 14 06:09:10 2022]       Tainted: G           OE     4.19.91-26.an7.x86_64 #1
[Thu Jul 14 06:09:10 2022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Jul 14 06:09:10 2022] INFO: task genload:3723 blocked for more than 1200 seconds.
[Thu Jul 14 06:09:10 2022]       Tainted: G           OE     4.19.91-26.an7.x86_64 #1
[Thu Jul 14 06:09:10 2022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.



[root@iZbp15wq760jcf3kbawseaZ ~]# dmesg -l warn -T | head -n 50
[Wed Jul 13 23:59:42 2022] systemd-journald[1180]: /dev/kmsg buffer overrun, some messages lost.
[Wed Jul 13 23:59:42 2022] systemd-journald[1180]: /dev/kmsg buffer overrun, some messages lost.
[Wed Jul 13 23:59:42 2022] systemd-journald[1180]: /dev/kmsg buffer overrun, some messages lost.
[Wed Jul 13 23:59:42 2022] systemd-journald[1180]: /dev/kmsg buffer overrun, some messages lost.
[Wed Jul 13 23:59:42 2022] systemd-journald[1180]: /dev/kmsg buffer overrun, some messages lost.
[Wed Jul 13 23:59:42 2022] systemd-journald[1180]: /dev/kmsg buffer overrun, some messages lost.
[Wed Jul 13 23:59:42 2022] systemd-journald[1180]: /dev/kmsg buffer overrun, some messages lost.
[Wed Jul 13 23:59:42 2022] systemd-journald[1180]: /dev/kmsg buffer overrun, some messages lost.
[Wed Jul 13 23:59:42 2022] systemd-journald[1180]: /dev/kmsg buffer overrun, some messages lost.
[Wed Jul 13 23:59:42 2022] systemd-journald[1180]: /dev/kmsg buffer overrun, some messages lost.
[Thu Jul 14 04:48:57 2022] Swap area shorter than signature indicates
[Thu Jul 14 06:09:10 2022] Call Trace:
[Thu Jul 14 06:09:10 2022]  ? __schedule+0x303/0x6a0
[Thu Jul 14 06:09:10 2022]  ? bit_wait+0x60/0x60
[Thu Jul 14 06:09:10 2022]  schedule+0x33/0xd0
[Thu Jul 14 06:09:10 2022]  io_schedule+0x12/0x40
[Thu Jul 14 06:09:10 2022]  bit_wait_io+0xd/0x60
[Thu Jul 14 06:09:10 2022]  __wait_on_bit+0x63/0x90
[Thu Jul 14 06:09:10 2022]  out_of_line_wait_on_bit+0x80/0x90
[Thu Jul 14 06:09:10 2022]  ? init_wait_var_entry+0x40/0x40
[Thu Jul 14 06:09:10 2022]  jbd2_journal_commit_transaction+0x1040/0x1d00
[Thu Jul 14 06:09:10 2022]  ? __switch_to_asm+0x40/0x70
[Thu Jul 14 06:09:10 2022]  ? __switch_to_asm+0x34/0x70
[Thu Jul 14 06:09:10 2022]  kjournald2+0xb5/0x220
[Thu Jul 14 06:09:10 2022]  ? wait_woken+0x80/0x80
[Thu Jul 14 06:09:10 2022]  kthread+0xf8/0x130
[Thu Jul 14 06:09:10 2022]  ? jbd2_checkpoint_thread+0xe0/0xe0
[Thu Jul 14 06:09:10 2022]  ? kthread_park+0xb0/0xb0
[Thu Jul 14 06:09:10 2022]  ret_from_fork+0x35/0x40
[Thu Jul 14 06:09:10 2022] Call Trace:
[Thu Jul 14 06:09:10 2022]  ? __schedule+0x303/0x6a0
[Thu Jul 14 06:09:10 2022]  ? __wake_up_common_lock+0x77/0x90
[Thu Jul 14 06:09:10 2022]  schedule+0x33/0xd0
[Thu Jul 14 06:09:10 2022]  jbd2_log_wait_commit+0x7e/0xf0
[Thu Jul 14 06:09:10 2022]  ? wait_woken+0x80/0x80
[Thu Jul 14 06:09:10 2022]  ? __ia32_sys_fdatasync+0x20/0x20
[Thu Jul 14 06:09:10 2022]  ext4_sync_fs+0x192/0x1b0
[Thu Jul 14 06:09:10 2022]  iterate_supers+0xa3/0x100
[Thu Jul 14 06:09:10 2022]  ksys_sync+0x50/0x90
[Thu Jul 14 06:09:10 2022]  __ia32_sys_sync+0xa/0x10
[Thu Jul 14 06:09:10 2022]  do_syscall_64+0x5b/0x1d0
[Thu Jul 14 06:09:10 2022]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[Thu Jul 14 06:09:10 2022] RIP: 0033:0x7fb7698ddd27
[Thu Jul 14 06:09:10 2022] Code: Bad RIP value.
[Thu Jul 14 06:09:10 2022] RSP: 002b:00007ffc7f11d768 EFLAGS: 00000217 ORIG_RAX: 00000000000000a2
[Thu Jul 14 06:09:10 2022] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb7698ddd27
[Thu Jul 14 06:09:10 2022] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007ffc7f11d750
[Thu Jul 14 06:09:10 2022] RBP: 0000000000000000 R08: 00007fb76a0cb740 R09: 0000000000000e8a
[Thu Jul 14 06:09:10 2022] R10: 00007ffc7f11cb60 R11: 0000000000000217 R12: 0000000000001770
[Thu Jul 14 06:09:10 2022] R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000000