Description of problem: 由于业务服务器数据巨大,重装系统代价过大,我们在测试CentOS 7系统下,直接使用龙蜥rpm包升级内核,当前有一台服务器升级到5.10.134-14.an8.x86_64内核后,系统会频繁crash,如果手动选择3.10内核则没有问题,请帮忙分析下是否龙蜥内核或其他方面有什么问题。 KERNEL: /usr/lib/debug/lib/modules/5.10.134-14.an8.x86_64/vmlinux [TAINTED] DUMPFILE: vmcore [PARTIAL DUMP] CPUS: 48 DATE: Wed Oct 25 18:37:28 CST 2023 UPTIME: 00:02:34 LOAD AVERAGE: 5.03, 2.02, 0.78 TASKS: 1542 NODENAME: RELEASE: 5.10.134-14.an8.x86_64 VERSION: #1 SMP Thu Apr 27 16:42:03 CST 2023 MACHINE: x86_64 (2200 Mhz) MEMORY: 255.7 GB PANIC: "Kernel panic - not syncing: Hard LOCKUP" PID: 268 COMMAND: "khugepaged" TASK: ffff8bf2cda50000 [THREAD_INFO: ffff8bf2cda50000] CPU: 16 STATE: TASK_RUNNING (PANIC) crash8> bt PID: 268 TASK: ffff8bf2cda50000 CPU: 16 COMMAND: "khugepaged" #0 [fffffe046257cab0] machine_kexec at ffffffffaa05c917 #1 [fffffe046257caf8] __crash_kexec at ffffffffaa19c10a #2 [fffffe046257cbb8] panic at ffffffffaaa13b6b #3 [fffffe046257cc38] watchdog_hardlockup_check.cold.8 at ffffffffaaa1d6a1 #4 [fffffe046257cc48] __perf_event_overflow at ffffffffaa26916f #5 [fffffe046257cc78] handle_pmi_common at ffffffffaa00ee48 #6 [fffffe046257ce08] intel_pmu_handle_irq at ffffffffaa00efc9 #7 [fffffe046257ce48] perf_event_nmi_handler at ffffffffaa005004 #8 [fffffe046257ce60] nmi_handle at ffffffffaa025452 #9 [fffffe046257cea8] default_do_nmi at ffffffffaaa53039 #10 [fffffe046257cec8] exc_nmi at ffffffffaaa53214 #11 [fffffe046257cef0] end_repeat_nmi at ffffffffaac01508 [exception RIP: queued_spin_lock_slowpath+96] RIP: ffffffffaa149570 RSP: ffffb7588d24bdb0 RFLAGS: 00000002 RAX: 0000000000000101 RBX: ffff8c11c00bce00 RCX: 0000000000030fa0 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8c11c00b48c0 RBP: 00000000000318b0 R8: 0000000000000000 R9: 0000000000000021 R10: 01ee7f2e998b15b8 R11: 0000000000000000 R12: 0000000000000021 R13: ffff8bf2c0241e00 R14: ffffffffab421b00 R15: ffff8c11c00b0f60 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #12 [ffffb7588d24bdb0] queued_spin_lock_slowpath at ffffffffaa149570 #13 [ffffb7588d24bdb0] __queue_work at ffffffffaa0fdf86 #14 [ffffb7588d24be00] queue_work_on at ffffffffaa0fe290 #15 [ffffb7588d24be10] lru_add_drain_all at ffffffffaa293c70 #16 [ffffb7588d24be48] khugepaged at ffffffffaa32b43d #17 [ffffb7588d24bf10] kthread at ffffffffaa105c74 #18 [ffffb7588d24bf50] ret_from_fork at ffffffffaa0033bf Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
死锁的直接原因是cpu22在调度的时候发生了错误,在printk的时候又尝试拿本cpu的锁,典型的打印死锁,根因还是要看perf_cgroup_switch发生了什么错误 PID: 23799 TASK: ffff9787583fc500 CPU: 22 COMMAND: "runc:[2:INIT]" #0 [fffffe003c1cce58] crash_nmi_callback at ffffffffad04fbde #1 [fffffe003c1cce60] nmi_handle at ffffffffad025452 #2 [fffffe003c1ccea8] default_do_nmi at ffffffffada53039 #3 [fffffe003c1ccec8] exc_nmi at ffffffffada53214 #4 [fffffe003c1ccef0] end_repeat_nmi at ffffffffadc01508 [exception RIP: queued_spin_lock_slowpath+94] RIP: ffffffffad14956e RSP: ffffbac2604579a8 RFLAGS: 00000002 RAX: 0000000000000101 RBX: ffff97874232c500 RCX: 0000000000000016 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff97a6bf734fc0 RBP: ffff97a6bf734fc0 R8: ffff97a6bf7348e0 R9: ffff9767c0402238 R10: 0000000000000000 R11: ffffffffaea5c278 R12: 0000000000000000 R13: ffff97874232d37c R14: 0000000000000087 R15: 0000000000000016 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #5 [ffffbac2604579a8] queued_spin_lock_slowpath at ffffffffad14956e #6 [ffffbac2604579a8] try_to_wake_up at ffffffffad117894 #7 [ffffbac260457a00] __queue_work at ffffffffad0fdfec #8 [ffffbac260457a50] queue_work_on at ffffffffad0fe290 #9 [ffffbac260457a60] soft_cursor at ffffffffad6125c1 #10 [ffffbac260457ab8] bit_cursor at ffffffffad6121b6 #11 [ffffbac260457b80] hide_cursor at ffffffffad6a4477 #12 [ffffbac260457b90] vt_console_print at ffffffffad6a6b7b #13 [ffffbac260457be8] console_unlock at ffffffffad15442b #14 [ffffbac260457ca8] vprintk_emit at ffffffffad15565b #15 [ffffbac260457cf8] printk at ffffffffada19302 #16 [ffffbac260457d50] report_bug.cold.1 at ffffffffada29416 #17 [ffffbac260457d88] handle_bug at ffffffffada51c4f #18 [ffffbac260457d98] exc_invalid_op at ffffffffada51dc3 #19 [ffffbac260457db0] asm_exc_invalid_op at ffffffffadc00a92 #20 [ffffbac260457e38] perf_cgroup_switch at ffffffffad270968 #21 [ffffbac260457ea8] __perf_event_task_sched_in at ffffffffad270f93 #22 [ffffbac260457f00] finish_task_switch at ffffffffad111e34 #23 [ffffbac260457f38] schedule_tail at ffffffffad118c8c #24 [ffffbac260457f50] ret_from_fork at ffffffffad0033a8 RIP: 00007fe8b0550ad1 RSP: 00007fe88913ffb0 RFLAGS: 00000202 RAX: 0000000000000000 RBX: 00007fe889140700 RCX: 00007fe8b0550ad1 RDX: 00007fe8891409d0 RSI: 00007fe88913ffb0 RDI: 00000000003d0f00 RBP: 00007ffcb74ce3b0 R8: 00007fe889140700 R9: 00007fe889140700 R10: 00007fe8891409d0 R11: 0000000000000202 R12: 0000000000000000 R13: 0000000000801000 R14: 0000000000000000 R15: 00007fe889140700 ORIG_RAX: 0000000000000038 CS: 0033 SS: 002b