Description of problem: 在有 mlx5 驱动的机器上,安装 debug 内核,启动失败。 [ 468.238342] CPU: 23 PID: 0 Comm: swapper/23 Kdump: loaded Tainted: G W E 4.19.91-28_rc1.an8.x86_64+debug #1 [ 468.238391] softirq: huh, entered softirq 7 SCHED 0000000092b6f6a6 with preempt_count 00000100, exited with 00000102? [ 468.250540] Hardware name: Foxconn AliServer Thor04-2U/Thunder, BIOS GB1A168F 12/16/2020 [ 468.250541] Call Trace: [ 468.250548] dump_stack+0xb7/0x110 [ 468.280608] __schedule_bug.cold.9+0x3a/0x60 [ 468.286260] __schedule+0x144b/0x1b60 [ 468.291294] ? firmware_map_remove+0x16e/0x16e [ 468.297076] ? sched_set_stop_task+0x320/0x320 [ 468.302813] schedule_idle+0x45/0x80 [ 468.307653] do_idle+0x29d/0x480 [ 468.312139] ? lock_downgrade+0x630/0x630 [ 468.317414] ? arch_cpu_idle_exit+0x40/0x40 [ 468.322870] ? _raw_spin_unlock_irqrestore+0x4b/0x60 [ 468.329108] cpu_startup_entry+0xcb/0xd4 [ 468.334305] ? cpu_in_idle+0x20/0x20 [ 468.339152] ? _raw_spin_unlock_irqrestore+0x4b/0x60 [ 468.345404] ? lockdep_hardirqs_on+0x39a/0x580 [ 468.351136] start_secondary+0x462/0x5e0 [ 468.356351] ? set_cpu_sibling_map+0x3120/0x3120 [ 468.362268] secondary_startup_64+0xb5/0xc0 Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
The PR Link: https://gitee.com/anolis/cloud-kernel/pulls/3061
Created attachment 1132 [details] mlx5_eq_cq_get bug
在带了 mlx5 驱动的机器上,用 debug 内核加载驱动时必现: [ 120.345761] INFO: lockdep is turned off.^M [ 120.346277] do_idle+0x2ab/0x500^M [ 120.346730] WARNING: bad unlock balance detected!^M [ 120.346733] 4.19.91+ #57 Tainted: G E ^M [ 120.347304] ? arch_cpu_idle_exit+0x40/0x40^M [ 120.347711] -------------------------------------^M [ 120.347714] systemd-udevd/987 is trying to release lock (&mm->mmap_sem) at:^M [ 120.348245] ? _raw_spin_unlock_irqrestore+0x4b/0x60^M [ 120.349075] [<ffffffff87176f1c>] __do_page_fault+0x48c/0xa90^M [ 120.349077] but there are no more locks to release!^M [ 120.349892] cpu_startup_entry+0xcb/0xd4^M [ 120.350339] ^M [ 120.350339] other info that might help us debug this:^M [ 120.350342] 3 locks held by systemd-udevd/987:^M [ 120.350721] ? cpu_in_idle+0x20/0x20^M [ 120.351258] #0: 0000000048e8a29f (&mm->mmap_sem){++++}, at: __do_page_fault+0x313/0xa90^M [ 120.351833] ? _raw_spin_unlock_irqrestore+0x4b/0x60^M [ 120.352307] #1: 000000003dc069f3 (rcu_read_lock){....}, at: mlx5_eq_cq_get+0x5/0x170 [mlx5_core]^M [ 120.352855] ? lockdep_hardirqs_on+0x394/0x590^M [ 120.353643] #2: 000000003dc069f3 (rcu_read_lock){....}, at: mlx5_eq_cq_get+0xa1/0x170 [mlx5_core]^M [ 120.354214] start_secondary+0x449/0x5f0^M [ 120.354856] ^M [ 120.354856] stack backtrace:^M [ 120.362396] ? set_cpu_sibling_map+0x2f10/0x2f10^M [ 120.362931] secondary_startup_64+0xb5/0xc0^M [ 120.363414] CPU: 6 PID: 987 Comm: systemd-udevd Tainted: G E 4.19.91+ #57^M
打上补丁后,问题消失
https://gitee.com/anolis/cloud-kernel/pulls/3061 补丁已合入,打上补丁后问题解决。