Bug 894 - 安装anolis8.2 x64 anck镜像后,运行ltp-stress长时测试,出现实例hang住的现象
Summary: 安装anolis8.2 x64 anck镜像后,运行ltp-stress长时测试,出现实例hang住的现象
Status: RESOLVED FIXED
Alias: None
Product: Anolis OS 8
Classification: Anolis OS
Component: Images&Installations (show other bugs) Images&Installations
Version: 8.2
Hardware: All Linux
: P3-Medium S3-normal
Target Milestone: ---
Assignee: xiaoguangwang
QA Contact: shuming
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-04-24 14:17 UTC by chuyang_94
Modified: 2022-09-29 11:18 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description chuyang_94 alibaba_cloud_group 2022-04-24 14:17:10 UTC
Description of problem:
安装anolis8.2 x64 anck镜像后,运行ltp-stress长时测试,出现实例hang住的现象

Version-Release number of selected component (if applicable):

# cat /etc/image-id
image_name="Anolis OS 8.2 ANCK 64 bit"
image_id="anolisos_8_2_x64_20G_anck_alibase_20220413.vhd"
release_date="20220413192232"

# cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-4.19.91-25.8.an8.x86_64 root=UUID=5dd29192-4c3c-4d3c-8027-d5ad8a736d20 ro crashkernel=0M-2G:0M,2G-8G:192M,8G-:256M cryptomgr.notests cgroup.memory=nokmem rcupdate.rcu_cpu_stall_timeout=300 vring_force_dma_api rhgb quiet biosdevname=0 net.ifnames=0 console=tty0 console=ttyS0,115200n8 noibrs nvme_core.io_timeout=4294967295 nvme_core.admin_timeout=4294967295 crashkernel=0M-2G:0M,2G-8G:192M,8G-:256M

# uname -a
Linux iZbp135go40q5dwxe76ax7Z 4.19.91-25.8.an8.x86_64 #1 SMP Tue Apr 12 16:14:51 CST 2022 x86_64 x86_64 x86_64 GNU/Linux

部分系统日志:
[41457.695939] INFO: task genload:272885 blocked for more than 1200 seconds.
[41457.698252]       Tainted: G           OE     4.19.91-25.8.an8.x86_64 #1
[41457.700571] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[41457.703071] genload         D    0 272885 272866 0x00000000
[41457.705297] Call Trace:
[41457.707114]  ? __schedule+0x29f/0x6c0
[41457.709093]  schedule+0x29/0xc0
[41457.710972]  wait_transaction_locked+0x76/0xa0
[41457.713003]  ? wait_woken+0x80/0x80
[41457.714915]  add_transaction_credits+0x106/0x280
[41457.716977]  ? ext4_da_get_block_prep+0x232/0x3e0
[41457.719055]  start_this_handle+0xf2/0x3a0
[41457.721041]  ? account_page_dirtied+0x113/0x1e0
[41457.723075]  ? kmem_cache_alloc+0x188/0x190
[41457.725041]  jbd2__journal_start+0xab/0x1b0
[41457.726980]  ext4_dirty_inode+0x2d/0x60
[41457.728843]  __mark_inode_dirty+0x3f/0x380
[41457.730732]  generic_write_end+0x30/0x90
[41457.732604]  generic_perform_write+0xf5/0x190
[41457.734537]  ext4_buffered_write_iter+0x8d/0x120
[41457.736505]  ext4_file_write_iter+0x5c/0x650
[41457.738423]  new_sync_write+0xf4/0x140
[41457.740263]  vfs_write+0xa9/0x1a0
[41457.742037]  ksys_write+0x43/0xb0
[41457.743804]  do_syscall_64+0x5f/0x190
[41457.745610]  ? async_page_fault+0x8/0x30
[41457.747449]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[41457.749463] RIP: 0033:0x7fe248aed648
[41457.751271] Code: Bad RIP value.
[41457.753028] RSP: 002b:00007fff0a3b3558 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[41457.755386] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fe248aed648
[41457.757710] RDX: 00000000000fffff RSI: 00007fff0a3b3560 RDI: 0000000000000003
[41457.760060] RBP: 00007fff0a4b35d0 R08: 00007fff0a50d000 R09: 0000000000000000
[41457.762399] R10: 0000000000000180 R11: 0000000000000246 R12: 0000000040000000
[41457.764739] R13: 00007fff0a3b3560 R14: 000000003fffffff R15: 000000001e3ffe1c

How reproducible:


Steps to Reproduce:
1.安装anolis8.2 x64 anck镜像
2.部署ltp:
git clone https://github.com/linux-test-project/ltp.git
yum install -y gcc gcc-c++ make automake crash
cd ltp/
make autotools
./configure
make
make install
cd /opt/ltp
[root@iZbp1dybdli87m10c6lbe3Z ltp]# cat load.sh
#!/bin/bash
echo 1  > /proc/sys/kernel/panic
echo 1  > /proc/sys/kernel/hardlockup_panic
echo 1  > /proc/sys/kernel/softlockup_panic
echo 60 > /proc/sys/kernel/watchdog_thresh
echo 1200 > /proc/sys/kernel/hung_task_timeout_secs
echo 0   > /proc/sys/kernel/hung_task_panic
#echo 0  > /proc/sys/kernel/panic_on_fatal_event
#echo 1  > /proc/sys/kernel/panic_on_rcu_stall
nr_cpu=$(nproc)
mem_kb=$(grep ^MemTotal /proc/meminfo | awk '{print $2}')
./runltp \
 -c $((nr_cpu / 2)) \
 -m $((nr_cpu / 4)),4,$((mem_kb / nr_cpu / 2 * 1024)),1 \
 -D $((nr_cpu / 10)),1,0,1 \
 -i 2 \
 -B ext4 \
 -R -p -q \
 -t 24h \
 -d /disk1/tmpdir/ltp
3.运行ltp:nohup sh ./load.sh &

Actual results:
实例ping不通,出现hang住的现象

Expected results:
ltp正常运行24h,无vmcore生成,无hang住现象,正常结束

Additional info:
Comment 1 chuyang_94 alibaba_cloud_group 2022-04-25 10:23:15 UTC
仅在ecs.g6.26xlarge出现 相同测试场景在其他镜像、实例均ok  目前已跑三次  12h内都有出现实例hung住的现象
Comment 2 Shiloong admin 2022-05-05 14:18:39 UTC
请用最新的 CK26 内核复测一下看看,只一种实例的话,先不 block 月度镜像发布。
Comment 3 Shiloong admin 2022-05-05 14:19:02 UTC
请用最新的 CK26 内核复测一下看看,只一种实例的话,先不 block 月度镜像发布。
Comment 4 xiaoguangwang alibaba_cloud_group 2022-09-29 11:18:50 UTC
这里的 hung task 是等待 jbd2 提交线程造成的。
下次复现时可以通过 echo 1 > /proc/sys/kernel/hung_task_panic

使其发生panic, 生成 vmcore,从而我们可以知道 jbd2 提交线程阻塞住的原因,进而得到是哪些 jbd2 handle 没有完成导致 jbd2提交线程阻塞。

该 bugzilla 没有vmcore, 仅凭一个 hung task 栈无法进一步分析。

我先将该 bugzilla 标记为 resolved, 下次跑出 vmcore 再分析。