Bug 8774 - debug 内核 reboot后环境无法正常启动,mlx5 驱动异常
Summary: debug 内核 reboot后环境无法正常启动,mlx5 驱动异常
Status: RESOLVED FIXED
Alias: None
Product: ANCK 4.19 Dev
Classification: ANCK
Component: drivers (show other bugs) drivers
Version: 4.19-028
Hardware: All Linux
: P3-Medium S3-normal
Target Milestone: ---
Assignee: dust.li
QA Contact: shuming
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-04-17 14:11 UTC by dust.li
Modified: 2024-04-17 18:55 UTC (History)
2 users (show)

See Also:


Attachments
mlx5_eq_cq_get bug (3.22 MB, image/png)
2024-04-17 18:47 UTC, dust.li
Details

Note You need to log in before you can comment on or make changes to this bug.
Description dust.li alibaba_cloud_group 2024-04-17 14:11:26 UTC
Description of problem:

在有 mlx5 驱动的机器上,安装 debug 内核,启动失败。

[  468.238342] CPU: 23 PID: 0 Comm: swapper/23 Kdump: loaded Tainted: G        W   E     4.19.91-28_rc1.an8.x86_64+debug #1
[  468.238391] softirq: huh, entered softirq 7 SCHED 0000000092b6f6a6 with preempt_count 00000100, exited with 00000102?
[  468.250540] Hardware name: Foxconn AliServer Thor04-2U/Thunder, BIOS GB1A168F 12/16/2020
[  468.250541] Call Trace:
[  468.250548]  dump_stack+0xb7/0x110
[  468.280608]  __schedule_bug.cold.9+0x3a/0x60
[  468.286260]  __schedule+0x144b/0x1b60
[  468.291294]  ? firmware_map_remove+0x16e/0x16e
[  468.297076]  ? sched_set_stop_task+0x320/0x320
[  468.302813]  schedule_idle+0x45/0x80
[  468.307653]  do_idle+0x29d/0x480
[  468.312139]  ? lock_downgrade+0x630/0x630
[  468.317414]  ? arch_cpu_idle_exit+0x40/0x40
[  468.322870]  ? _raw_spin_unlock_irqrestore+0x4b/0x60
[  468.329108]  cpu_startup_entry+0xcb/0xd4
[  468.334305]  ? cpu_in_idle+0x20/0x20
[  468.339152]  ? _raw_spin_unlock_irqrestore+0x4b/0x60
[  468.345404]  ? lockdep_hardirqs_on+0x39a/0x580
[  468.351136]  start_secondary+0x462/0x5e0
[  468.356351]  ? set_cpu_sibling_map+0x3120/0x3120
[  468.362268]  secondary_startup_64+0xb5/0xc0




Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
Comment 1 小龙 admin 2024-04-17 14:54:59 UTC
The PR Link: https://gitee.com/anolis/cloud-kernel/pulls/3061
Comment 2 dust.li alibaba_cloud_group 2024-04-17 18:47:45 UTC
Created attachment 1132 [details]
mlx5_eq_cq_get bug
Comment 3 dust.li alibaba_cloud_group 2024-04-17 18:49:13 UTC
在带了 mlx5 驱动的机器上,用 debug 内核加载驱动时必现:


[  120.345761] INFO: lockdep is turned off.^M
[  120.346277]  do_idle+0x2ab/0x500^M
[  120.346730] WARNING: bad unlock balance detected!^M
[  120.346733] 4.19.91+ #57 Tainted: G            E    ^M
[  120.347304]  ? arch_cpu_idle_exit+0x40/0x40^M
[  120.347711] -------------------------------------^M
[  120.347714] systemd-udevd/987 is trying to release lock (&mm->mmap_sem) at:^M
[  120.348245]  ? _raw_spin_unlock_irqrestore+0x4b/0x60^M
[  120.349075] [<ffffffff87176f1c>] __do_page_fault+0x48c/0xa90^M
[  120.349077] but there are no more locks to release!^M
[  120.349892]  cpu_startup_entry+0xcb/0xd4^M
[  120.350339] ^M
[  120.350339] other info that might help us debug this:^M
[  120.350342] 3 locks held by systemd-udevd/987:^M
[  120.350721]  ? cpu_in_idle+0x20/0x20^M
[  120.351258]  #0: 0000000048e8a29f (&mm->mmap_sem){++++}, at: __do_page_fault+0x313/0xa90^M
[  120.351833]  ? _raw_spin_unlock_irqrestore+0x4b/0x60^M
[  120.352307]  #1: 000000003dc069f3 (rcu_read_lock){....}, at: mlx5_eq_cq_get+0x5/0x170 [mlx5_core]^M
[  120.352855]  ? lockdep_hardirqs_on+0x394/0x590^M
[  120.353643]  #2: 000000003dc069f3 (rcu_read_lock){....}, at: mlx5_eq_cq_get+0xa1/0x170 [mlx5_core]^M
[  120.354214]  start_secondary+0x449/0x5f0^M
[  120.354856] ^M
[  120.354856] stack backtrace:^M
[  120.362396]  ? set_cpu_sibling_map+0x2f10/0x2f10^M
[  120.362931]  secondary_startup_64+0xb5/0xc0^M
[  120.363414] CPU: 6 PID: 987 Comm: systemd-udevd Tainted: G            E     4.19.91+ #57^M
Comment 4 dust.li alibaba_cloud_group 2024-04-17 18:52:38 UTC
打上补丁后,问题消失
Comment 5 dust.li alibaba_cloud_group 2024-04-17 18:55:45 UTC
https://gitee.com/anolis/cloud-kernel/pulls/3061

补丁已合入,打上补丁后问题解决。