Created attachment 363 [details] stress-ng启动串口报错日志 Description of problem: 线下VM,stress-ng压力测试刚启动,串口日志报错,系统卡住 串口部分日志如下: root@VM20210305-16 ~]# [ 447.000900] tun: Universal TUN/TAP device driver, 1.6 [ 447.316570] signal: stress-ng[3973] overflowed sigaltstack [ 449.935907] sched: DL replenish lagged too much [ 452.448001] vsock: module verification failed: signature and/or required key missing - tainting kernel [ 452.789333] NET: Registered protocol family 40 [ 463.131848] hrtimer: interrupt took 32658 ns [root@VM20210305-16 ~]# [ 559.644416] rcu: INFO: rcu_sched self-detected stall on CPU [ 559.645619] rcu: 3-...!: (65001 ticks this GP) idle=0a6/1/0x4000000000000000 softirq=119518/119518 fqs=0 [ 559.647489] (t=65004 jiffies g=112557 q=55306) [ 559.648305] rcu: rcu_sched kthread starved for 65004 jiffies! g112557 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=3 [ 559.650061] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior. [ 559.651590] rcu: RCU grace-period kthread stack dump: [ 559.652457] task:rcu_sched state:I stack:28576 pid: 15 ppid: 2 flags:0x00004000 [ 559.653894] Call Trace: [ 559.654366] __schedule+0xada/0x1cc0 [ 559.655017] ? __sched_text_start+0x8/0x8 [ 559.655745] schedule+0xc4/0x280 [ 559.656322] schedule_timeout+0x32e/0x5c0 [ 559.657038] ? usleep_range+0x120/0x120 [ 559.657741] ? lockdep_hardirqs_on_prepare+0x293/0x3e0 [ 559.658639] ? _raw_spin_unlock_irqrestore+0x3d/0x40 [ 559.659505] ? trace_hardirqs_on+0x1c/0x150 [ 559.660242] ? __next_timer_interrupt+0x1e0/0x1e0 [ 559.661091] ? prepare_to_swait_exclusive+0x120/0x120 [ 559.662002] rcu_gp_kthread+0x9f8/0x1e00 [ 559.662734] ? force_qs_rnp+0x5b0/0x5b0 [ 559.663419] ? __kthread_parkme+0x52/0x1a0 [ 559.664139] ? lockdep_hardirqs_on_prepare+0x293/0x3e0 [ 559.665027] ? _raw_spin_unlock_irqrestore+0x3d/0x40 [ 559.665892] ? trace_hardirqs_on+0x1c/0x150 [ 559.666664] ? __kthread_parkme+0xd1/0x1a0 [ 559.667372] ? force_qs_rnp+0x5b0/0x5b0 [ 559.668032] kthread+0x35d/0x430 [ 559.668601] ? __kthread_cancel_work+0x170/0x170 [ 559.669399] ret_from_fork+0x1f/0x30 [ 559.670109] NMI backtrace for cpu 3 [ 559.670735] CPU: 3 PID: 4452 Comm: stress-ng Kdump: loaded Tainted: G E 5.10.134-12_rc1.an8.x86_64+debug #1 [ 559.672557] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS e623647 04/01/2014 [ 559.673831] Call Trace: [ 559.674272] <IRQ> [ 559.674669] dump_stack+0x99/0xcf [ 559.675263] nmi_cpu_backtrace.cold.10+0x13/0xdb [ 559.676067] ? lapic_can_unplug_cpu+0x80/0x80 [ 559.676809] nmi_trigger_cpumask_backtrace+0x239/0x2a0 [ 559.679535] arch_trigger_cpumask_backtrace+0x15/0x20 [ 559.682236] rcu_dump_cpu_stacks+0x20f/0x26d [ 559.684827] rcu_sched_clock_irq.cold.118+0x2ce/0x9ee [ 559.687540] ? __raise_softirq_irqoff+0x1cb/0x260 [ 559.690159] ? tick_sched_do_timer+0x1a0/0x1a0 [ 559.692698] update_process_times+0x7a/0xb0 [ 559.695202] tick_sched_handle.isra.17+0x6a/0x130 [ 559.697730] tick_sched_timer+0xd1/0x100 [ 559.700085] __hrtimer_run_queues+0x4ed/0xb50 [ 559.702516] ? enqueue_hrtimer+0x360/0x360 [ 559.704905] ? ktime_get_update_offsets_now+0xdb/0x2c0 [ 559.707453] hrtimer_interrupt+0x2c7/0x770 [ 559.709810] __sysvec_apic_timer_interrupt+0x13e/0x540 [ 559.712252] asm_call_irq_on_stack+0xf/0x20 [ 559.714510] </IRQ> [ 559.716379] sysvec_apic_timer_interrupt+0x85/0xa0 [ 559.718668] asm_sysvec_apic_timer_interrupt+0x12/0x20 [ 559.720968] RIP: 0010:_raw_spin_unlock_irq+0x2c/0x40 [ 559.723256] Code: 44 00 00 53 48 8b 74 24 08 48 89 fb 48 8d 7f 18 e8 09 b0 04 fe 48 89 df e8 01 87 05 fe e8 cc e1 28 fe fb 65 ff 0d a4 65d2 5b <5b> c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 0f 1f [ 559.729508] RSP: 0018:ffff888105a67db0 EFLAGS: 00000286 [ 559.732062] RAX: 0000000002530a4b RBX: ffff8881284e9540 RCX: ffffffffa23613a4 [ 559.734916] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffa4308c54 [ 559.737791] RBP: 0000000200000002 R08: 0000000000000001 R09: 0000000000000001 [ 559.740646] R10: ffff8881284e9543 R11: ffffed102509d2a8 R12: 0000000000000000 [ 559.743528] R13: ffff888105a58040 R14: 0000000000000000 R15: 0000000000000022 [ 559.746449] ? do_raw_spin_unlock+0x54/0x260 [ 559.748916] ? _raw_spin_unlock_irq+0x24/0x40 [ 559.751375] signal_setup_done+0x1a7/0x230 [ 559.753815] ? force_sigsegv+0xf0/0xf0 [ 559.756166] ? fpu__clear_user_states+0xfd/0x190 [ 559.758673] ? fpu__clear_user_states+0xfd/0x190 [ 559.761158] ? __local_bh_enable_ip+0xa5/0x100 [ 559.763630] ? fpu__clear_user_states+0x113/0x190 [ 559.766133] arch_do_signal+0x3a5/0x6f0 [ 559.768521] ? get_sigframe_size+0x20/0x20 [ 559.770979] exit_to_user_mode_prepare+0x109/0x1b0 [ 559.773513] syscall_exit_to_user_mode+0x3d/0x280 [ 559.776032] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 559.778611] RIP: 0033:0x667580 [ 559.780833] Code: 00 c6 05 72 80 49 00 00 c3 90 48 83 ec 08 be 01 00 00 00 bf 20 cb ce 00 e8 7d 3a da ff 66 2e 0f 1f 84 00 00 00 00 00 0f1f 00 <53> 48 81 ec a0 00 00 00 e8 d3 00 db ff 84 c0 75 37 c6 05 38 80 49 [ 559.787417] RSP: 002b:00001486a4743378 EFLAGS: 00000246 [ 559.790166] RAX: 0000000000000000 RBX: ffffffffffffff78 RCX: 00001486a38fa628 [ 559.793215] RDX: 00001486a4743380 RSI: 00001486a47434b0 RDI: 0000000000000022 [ 559.796281] RBP: 0000000000000004 R08: 00007ffc6830b080 R09: 00000000000be5e8 [ 559.799370] R10: 0000000000000009 R11: 0000000000000246 R12: 0000000000000001 [ 559.802445] R13: 00001486a472b000 R14: 0000000000000000 R15: 0000000000000001 Version-Release number of selected component (if applicable): [root@VM20210305-16 ~]# cat /etc/os-release NAME="Anolis OS" VERSION="8.4" ID="anolis" ID_LIKE="rhel fedora centos" VERSION_ID="8.4" PLATFORM_ID="platform:an8" PRETTY_NAME="Anolis OS 8.4" ANSI_COLOR="0;31" HOME_URL="https://openanolis.cn/" [root@VM20210305-16 ~]# cat /proc/cmdline BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-5.10.134-12_rc1.an8.x86_64+debug root=UUID=169a0746-c62d-49a2-bd6b-0eaec098d42c ro crashkernel=2G crash_kexec_post_notifiers rhgb slub_debug=FPZU kmemleak=on console=tty0 console=ttyS0,115200 console=ttyAMA0,115200n8 How reproducible: 发起stress-ng并行压力测试后,查看串口日志信息 Steps to Reproduce: 1.git clone https://github.com/ColinIanKing/stress-ng.git 2.make && make install 3.发起stress-ng并行压力测试:nohup stress-ng -a 1 -x softlockup,resources -t 72h --metrics --times --verify -v -Y /disk1/tmpdir/stress-ng/stress-statistic-12.yaml --log-file /disk1/tmpdir/stress-ng/stress-logfile-12.txt --temp-path /disk1/tmpdir/stress-ng/ & 4.查看串口日志,确定系统是否能ping通 Actual results: 并发压力测试启动后,串口报错,系统无法ping通 Expected results: 并发压力测试启动后,系统正常,串口无异常报错,可以ping通 Additional info: 具体信息查看附件串口日志
并没有发生死锁,而是sys%压力太大而暂时无响应。 stress-ng -a 1 -x softlockup,resources会同时跑stress-ng所有的测试,总共会启19463个压测线程,这19463个压测线程跑在仅8个CPU上,压力太大,并且大量压力集中在sys%。 如果把压测时间调小,测试完毕之后,系统就会恢复正常。 或者将CPU数量调大到64以上,这个问题就不会复现了。 建议不要在小机器上,同时跑stress-ng的所有测试。 所以这个case不用修复
根据上面的分析,置为WONTFIX