Bug 1880 - [ANCK-5.10 2208-rc1][anolis8][x86_64][debug内核]stress-ng压力测试刚启动,VM系统报错卡住
Summary: [ANCK-5.10 2208-rc1][anolis8][x86_64][debug内核]stress-ng压力测试刚启动,VM系统报错卡住
Status: RESOLVED WONTFIX
Alias: None
Product: ANCK 5.10 Dev
Classification: ANCK
Component: general/others (show other bugs) general/others
Version: 5.10.y-12
Hardware: All Linux
: P3-Medium S3-normal
Target Milestone: ---
Assignee: Alierwei
QA Contact: shuming
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-08-11 14:51 UTC by zhixin01
Modified: 2022-09-05 09:45 UTC (History)
8 users (show)

See Also:


Attachments
stress-ng启动串口报错日志 (5.43 KB, text/plain)
2022-08-11 14:51 UTC, zhixin01
Details

Note You need to log in before you can comment on or make changes to this bug.
Description zhixin01 2022-08-11 14:51:49 UTC
Created attachment 363 [details]
stress-ng启动串口报错日志

Description of problem:
线下VM,stress-ng压力测试刚启动,串口日志报错,系统卡住
串口部分日志如下:
root@VM20210305-16 ~]# [  447.000900] tun: Universal TUN/TAP device driver, 1.6
[  447.316570] signal: stress-ng[3973] overflowed sigaltstack
[  449.935907] sched: DL replenish lagged too much
[  452.448001] vsock: module verification failed: signature and/or required key missing - tainting kernel
[  452.789333] NET: Registered protocol family 40
[  463.131848] hrtimer: interrupt took 32658 ns

[root@VM20210305-16 ~]# [  559.644416] rcu: INFO: rcu_sched self-detected stall on CPU
[  559.645619] rcu:   3-...!: (65001 ticks this GP) idle=0a6/1/0x4000000000000000 softirq=119518/119518 fqs=0
[  559.647489]  (t=65004 jiffies g=112557 q=55306)
[  559.648305] rcu: rcu_sched kthread starved for 65004 jiffies! g112557 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=3
[  559.650061] rcu:   Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[  559.651590] rcu: RCU grace-period kthread stack dump:
[  559.652457] task:rcu_sched       state:I stack:28576 pid:   15 ppid:     2 flags:0x00004000
[  559.653894] Call Trace:
[  559.654366]  __schedule+0xada/0x1cc0
[  559.655017]  ? __sched_text_start+0x8/0x8
[  559.655745]  schedule+0xc4/0x280
[  559.656322]  schedule_timeout+0x32e/0x5c0
[  559.657038]  ? usleep_range+0x120/0x120
[  559.657741]  ? lockdep_hardirqs_on_prepare+0x293/0x3e0
[  559.658639]  ? _raw_spin_unlock_irqrestore+0x3d/0x40
[  559.659505]  ? trace_hardirqs_on+0x1c/0x150
[  559.660242]  ? __next_timer_interrupt+0x1e0/0x1e0
[  559.661091]  ? prepare_to_swait_exclusive+0x120/0x120
[  559.662002]  rcu_gp_kthread+0x9f8/0x1e00
[  559.662734]  ? force_qs_rnp+0x5b0/0x5b0
[  559.663419]  ? __kthread_parkme+0x52/0x1a0
[  559.664139]  ? lockdep_hardirqs_on_prepare+0x293/0x3e0
[  559.665027]  ? _raw_spin_unlock_irqrestore+0x3d/0x40
[  559.665892]  ? trace_hardirqs_on+0x1c/0x150
[  559.666664]  ? __kthread_parkme+0xd1/0x1a0
[  559.667372]  ? force_qs_rnp+0x5b0/0x5b0
[  559.668032]  kthread+0x35d/0x430
[  559.668601]  ? __kthread_cancel_work+0x170/0x170
[  559.669399]  ret_from_fork+0x1f/0x30
[  559.670109] NMI backtrace for cpu 3
[  559.670735] CPU: 3 PID: 4452 Comm: stress-ng Kdump: loaded Tainted: G            E     5.10.134-12_rc1.an8.x86_64+debug #1
[  559.672557] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS e623647 04/01/2014
[  559.673831] Call Trace:
[  559.674272]  <IRQ>
[  559.674669]  dump_stack+0x99/0xcf
[  559.675263]  nmi_cpu_backtrace.cold.10+0x13/0xdb
[  559.676067]  ? lapic_can_unplug_cpu+0x80/0x80
[  559.676809]  nmi_trigger_cpumask_backtrace+0x239/0x2a0
[  559.679535]  arch_trigger_cpumask_backtrace+0x15/0x20
[  559.682236]  rcu_dump_cpu_stacks+0x20f/0x26d
[  559.684827]  rcu_sched_clock_irq.cold.118+0x2ce/0x9ee
[  559.687540]  ? __raise_softirq_irqoff+0x1cb/0x260
[  559.690159]  ? tick_sched_do_timer+0x1a0/0x1a0
[  559.692698]  update_process_times+0x7a/0xb0
[  559.695202]  tick_sched_handle.isra.17+0x6a/0x130
[  559.697730]  tick_sched_timer+0xd1/0x100
[  559.700085]  __hrtimer_run_queues+0x4ed/0xb50
[  559.702516]  ? enqueue_hrtimer+0x360/0x360
[  559.704905]  ? ktime_get_update_offsets_now+0xdb/0x2c0
[  559.707453]  hrtimer_interrupt+0x2c7/0x770
[  559.709810]  __sysvec_apic_timer_interrupt+0x13e/0x540
[  559.712252]  asm_call_irq_on_stack+0xf/0x20
[  559.714510]  </IRQ>
[  559.716379]  sysvec_apic_timer_interrupt+0x85/0xa0
[  559.718668]  asm_sysvec_apic_timer_interrupt+0x12/0x20
[  559.720968] RIP: 0010:_raw_spin_unlock_irq+0x2c/0x40
[  559.723256] Code: 44 00 00 53 48 8b 74 24 08 48 89 fb 48 8d 7f 18 e8 09 b0 04 fe 48 89 df e8 01 87 05 fe e8 cc e1 28 fe fb 65 ff 0d a4 65d2 5b <5b> c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 0f 1f
[  559.729508] RSP: 0018:ffff888105a67db0 EFLAGS: 00000286
[  559.732062] RAX: 0000000002530a4b RBX: ffff8881284e9540 RCX: ffffffffa23613a4
[  559.734916] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffa4308c54
[  559.737791] RBP: 0000000200000002 R08: 0000000000000001 R09: 0000000000000001
[  559.740646] R10: ffff8881284e9543 R11: ffffed102509d2a8 R12: 0000000000000000
[  559.743528] R13: ffff888105a58040 R14: 0000000000000000 R15: 0000000000000022
[  559.746449]  ? do_raw_spin_unlock+0x54/0x260
[  559.748916]  ? _raw_spin_unlock_irq+0x24/0x40
[  559.751375]  signal_setup_done+0x1a7/0x230
[  559.753815]  ? force_sigsegv+0xf0/0xf0
[  559.756166]  ? fpu__clear_user_states+0xfd/0x190
[  559.758673]  ? fpu__clear_user_states+0xfd/0x190
[  559.761158]  ? __local_bh_enable_ip+0xa5/0x100
[  559.763630]  ? fpu__clear_user_states+0x113/0x190
[  559.766133]  arch_do_signal+0x3a5/0x6f0
[  559.768521]  ? get_sigframe_size+0x20/0x20
[  559.770979]  exit_to_user_mode_prepare+0x109/0x1b0
[  559.773513]  syscall_exit_to_user_mode+0x3d/0x280
[  559.776032]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[  559.778611] RIP: 0033:0x667580
[  559.780833] Code: 00 c6 05 72 80 49 00 00 c3 90 48 83 ec 08 be 01 00 00 00 bf 20 cb ce 00 e8 7d 3a da ff 66 2e 0f 1f 84 00 00 00 00 00 0f1f 00 <53> 48 81 ec a0 00 00 00 e8 d3 00 db ff 84 c0 75 37 c6 05 38 80 49
[  559.787417] RSP: 002b:00001486a4743378 EFLAGS: 00000246
[  559.790166] RAX: 0000000000000000 RBX: ffffffffffffff78 RCX: 00001486a38fa628
[  559.793215] RDX: 00001486a4743380 RSI: 00001486a47434b0 RDI: 0000000000000022
[  559.796281] RBP: 0000000000000004 R08: 00007ffc6830b080 R09: 00000000000be5e8
[  559.799370] R10: 0000000000000009 R11: 0000000000000246 R12: 0000000000000001
[  559.802445] R13: 00001486a472b000 R14: 0000000000000000 R15: 0000000000000001
Version-Release number of selected component (if applicable):
[root@VM20210305-16 ~]# cat /etc/os-release
NAME="Anolis OS"
VERSION="8.4"
ID="anolis"
ID_LIKE="rhel fedora centos"
VERSION_ID="8.4"
PLATFORM_ID="platform:an8"
PRETTY_NAME="Anolis OS 8.4"
ANSI_COLOR="0;31"
HOME_URL="https://openanolis.cn/"

[root@VM20210305-16 ~]# cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-5.10.134-12_rc1.an8.x86_64+debug root=UUID=169a0746-c62d-49a2-bd6b-0eaec098d42c ro crashkernel=2G crash_kexec_post_notifiers rhgb slub_debug=FPZU kmemleak=on console=tty0 console=ttyS0,115200 console=ttyAMA0,115200n8

How reproducible:
发起stress-ng并行压力测试后,查看串口日志信息

Steps to Reproduce:
1.git clone https://github.com/ColinIanKing/stress-ng.git
2.make && make install
3.发起stress-ng并行压力测试:nohup stress-ng  -a 1 -x softlockup,resources -t 72h --metrics --times --verify -v -Y /disk1/tmpdir/stress-ng/stress-statistic-12.yaml --log-file /disk1/tmpdir/stress-ng/stress-logfile-12.txt --temp-path /disk1/tmpdir/stress-ng/ &
4.查看串口日志,确定系统是否能ping通

Actual results:
并发压力测试启动后,串口报错,系统无法ping通

Expected results:
并发压力测试启动后,系统正常,串口无异常报错,可以ping通

Additional info:
具体信息查看附件串口日志
Comment 1 wuyihao66 alibaba_cloud_group 2022-08-24 18:27:54 UTC
并没有发生死锁,而是sys%压力太大而暂时无响应。

stress-ng -a 1 -x softlockup,resources会同时跑stress-ng所有的测试,总共会启19463个压测线程,这19463个压测线程跑在仅8个CPU上,压力太大,并且大量压力集中在sys%。

如果把压测时间调小,测试完毕之后,系统就会恢复正常。
或者将CPU数量调大到64以上,这个问题就不会复现了。
建议不要在小机器上,同时跑stress-ng的所有测试。

所以这个case不用修复
Comment 2 wuyihao66 alibaba_cloud_group 2022-08-25 14:49:02 UTC
根据上面的分析,置为WONTFIX