Description of problem: KVM虚拟机cpu0分布在非numa node0上启动,导致物理机死机。 KVM虚拟机cpu0分布在numa node0上启动,物理机无异常,虚拟机无异常。 Version-Release number of selected component (if applicable): 内核版本: kernel-4.19.190-7.11.an8.loongarch64 qemu版本: qemu-kvm-6.2.0-53.0.3.module+an8.9.0+11292+334bc2d1.2.loongarch64 How reproducible: Steps to Reproduce: 1. 启动虚拟机,参数如下: # /usr/libexec/qemu-kvm \ -machine loongson7a \ -cpu 'Loongson-3A5000' \ -m 4096 \ -object memory-backend-ram,size=1024M,id=mem-mem0 \ -object memory-backend-ram,size=3072M,id=mem-mem1 \ -smp 2,maxcpus=2 \ -numa node,memdev=mem-mem0,cpus=1 \ -numa node,memdev=mem-mem1,cpus=0 \ -bios loongarch_bios.bin \ -drive file=AnolisOS-8.10-loongarch64.qcow2,if=virtio \ -nographic \ -enable-kvm \ -serial stdio \ -monitor telnet:localhost:4444,server,nowait 2. 观察虚拟机状态,观察虚拟机numa node信息是否与预设一致。 Actual results: KVM虚拟机cpu0分布在非numa node0上启动,导致物理机死机。物理机报错信息如下: [ 1348.730826] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [ 1348.736736] rcu: 22-...0: (0 ticks this GP) idle=bf2/1/0x4000000000000000 softirq=6400/6400 fqs=1088 [ 1348.745985] rcu: (detected by 21, t=5269 jiffies, g=40025, q=2196) [ 1348.752223] Sending NMI from CPU 21 to CPUs 22: [ 1356.222814] watchdog: BUG: soft lockup - CPU#31 stuck for 23s! [gnome-shell:3947] [ 1356.230256] Modules linked in: tun(E) ib_core(E) xt_CHECKSUM(E) ipt_MASQUERADE(E) xt_conntrack(E) ipt_REJECT(E) nf_reject_ipv4(E) nft_compat(E) nft_chain_route_ipv6(E) nft_chain_nat_ipv6(E) nf_nat_ipv6(E) nft_counter(E) nft_chain_route_ipv4(E) nft_chain_nat_ipv4(E) nf_nat_ipv4(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) nf_tables(E) nfnetlink(E) bridge(E) rfkill(E) vfat(E) fat(E) ipmi_ssif(E) efi_pstore(E) kvm(E) efivars(E) snd_hda_loongson(E) ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) nvme_fabrics(E) dm_multipath(E) r8168(E) txgbe(E) megaraid_sas(E) sw_se_echip_drv(E) cdc_ether(E) usbnet(E) 8021q(E) garp(E) mrp(E) stp(E) llc(E) be2iscsi(E) bnx2i(E) cnic(E) uio(E) cxgb4i(E) cxgb4(E) libcxgbi(E) libcxgb(E) qla4xxx(E) iscsi_boot_sysfs(E) iscsi_tcp(E) libiscsi_tcp(E) libiscsi(E) [ 1356.300722] fuse(E) scsi_transport_iscsi(E) [ 1356.304966] CPU: 31 PID: 3947 Comm: gnome-shell Kdump: loaded Tainted: G E 4.19.190-7.11.an8.loongarch64 #1 [ 1356.315943] Hardware name: LOONGSON Dabieshan/Loongson-LS2C50C6, BIOS Loongson UEFI (3C50007A2000_C6) V4.0.19-Dual 07/09/24 11:20:02 [ 1356.327783] pc 90000000002e7a30 ra 90000000002e7a00 tp 90001010249e8000 sp 90001010249eb9a0 [ 1356.336081] a0 0000000000000016 a1 0000000000000016 a2 0000000000000016 a3 9000100081e105c0 [ 1356.344378] a4 9000100081e105c0 a5 00000000003d26af a6 0000000000000030 a7 0000000000000001 [ 1356.352676] t0 0000000000000001 t1 9000100080c14aa0 t2 ffffffffffffffc0 t3 0000000000001040 [ 1356.360973] t4 ffffffffffffffff t5 0000000000000001 t6 ffffffff80000000 t7 fffffffffe000000 [ 1356.369271] t8 000000ff4c2f8000 u0 90000000003b9d90 s9 0000000000000100 s0 9000100081e10580 [ 1356.377568] s1 900000000160e110 s2 9000100081e10588 s3 90000000015f7690 s4 0000000000000001 [ 1356.385866] s5 9000000000211460 s6 90001010249eba38 s7 9000000001d30540 s8 9000100081e105a8 [ 1356.394179] ra: 90000000002e7a00 smp_call_function_many+0x2d0/0x3b0 [ 1356.400663] ERA: 90000000002e7a30 smp_call_function_many+0x300/0x3b0 [ 1356.407146] CRMD: 00000001 (PLV1 -IE -DA -PG DACF=SUC DACM=SUC -WE) [ 1356.413460] PRMD: 00000004 (PPLV0 +PIE -PWE) [ 1356.417784] EUEN: 9000100081e10588 (-FPE -SXE -ASXE +BTE) [ 1356.423232] ECFG: 900000000160e110 (LIE=4,8 VS=0) [ 1356.427987] ESTAT: 9000100081e10580 [ 1356.431446] ExcCode : 21 (SubCode 7) [ 1356.434991] PRID: 0014c011 (Loongson-64bit, Loongson-3C5000) [ 1356.440698] CPU: 31 PID: 3947 Comm: gnome-shell Kdump: loaded Tainted: G E 4.19.190-7.11.an8.loongarch64 #1 [ 1356.451674] Hardware name: LOONGSON Dabieshan/Loongson-LS2C50C6, BIOS Loongson UEFI (3C50007A2000_C6) V4.0.19-Dual 07/09/24 11:20:02 [ 1356.463514] Stack : 0000000000000000 900000000110a188 90001010249e8000 900010103c043c10 [ 1356.471468] 0000000000000000 900010103c043c10 0000000000000000 9000000000aedce0 [ 1356.479420] 90000000060359e8 9000000001e79aa0 6572617764726148 32303a30323a3131 [ 1356.487372] 900000000110a188 0000000000000007 0000000000000006 0000000000000007 [ 1356.495324] 9000100081e00950 0000000000000000 00000000000004e3 6f73676e6f6f4c20 [ 1356.503277] 000000000006e45a 00001000800e0000 9000100081e06f40 ffff800000000000 [ 1356.511230] 90000000017402b0 0000000000000000 0000000000000000 0000000000000000 [ 1356.519182] 90001010249eb860 900000000160e268 900000000174d1a8 0000000000000000 [ 1356.527135] 900000000020a4c4 000000ff4c2f8600 00000000000000b0 0000000000000004 [ 1356.535087] 0000000000000001 000000000007141c 0000000000000000 90000000017402b0 [ 1356.543040] ... [ 1356.545463] Call Trace: [ 1356.547893] [<900000000020a4c4>] show_stack+0x34/0x140 [ 1356.553007] [<900000000110a184>] dump_stack+0xac/0xe8 [ 1356.558028] [<900000000031be10>] watchdog_timer_fn+0x2d0/0x330 [ 1356.563824] [<90000000002ce984>] __hrtimer_run_queues+0x194/0x400 [ 1356.569877] [<90000000002cfc70>] hrtimer_interrupt+0x140/0x380 [ 1356.575672] [<9000000000209844>] constant_timer_interrupt+0x34/0x50 [ 1356.581901] [<90000000002a4c68>] __handle_irq_event_percpu+0x88/0x280 [ 1356.588298] [<90000000002a4e84>] handle_irq_event_percpu+0x24/0x90 [ 1356.594439] [<90000000002aad60>] handle_percpu_irq+0x60/0xa0 [ 1356.600059] [<90000000002a37e8>] generic_handle_irq+0x28/0x50 [ 1356.605771] [<900000000111877c>] do_IRQ+0x1c/0x30 [ 1356.610442] [<90000000002034b0>] except_vec_vi_handler+0xac/0xdc [ 1356.616408] [<90000000002e7a30>] smp_call_function_many+0x300/0x3b0 [ 1356.622634] [<90000000002e7c20>] on_each_cpu_mask+0x30/0xb0 [ 1356.628170] [<90000000002113ac>] flush_tlb_page+0x6c/0x120 [ 1356.633622] [<9000000000409300>] ptep_clear_flush+0x60/0x80 [ 1356.639157] [<900000000040b1d0>] try_to_unmap_one+0x230/0x810 [ 1356.644863] [<900000000040a100>] rmap_walk_anon+0x110/0x2c0 [ 1356.650397] [<900000000040c678>] try_to_unmap+0xb8/0x130 [ 1356.655673] [<9000000000440a48>] migrate_pages+0x828/0xa30 [ 1356.661121] [<90000000004415fc>] migrate_misplaced_page+0x19c/0x2d0 [ 1356.667351] [<90000000003fd030>] __handle_mm_fault+0x750/0x1500 [ 1356.673230] [<90000000003fdef0>] handle_mm_fault+0x110/0x250 [ 1356.678850] [<900000000111853c>] do_page_fault+0x17c/0x3a0 [ 1356.684300] [<9000000000219b00>] tlb_do_page_fault_0+0x110/0x128 [ 1356.690268] Kernel panic - not syncing: softlockup: hung tasks [ 1356.696061] CPU: 31 PID: 3947 Comm: gnome-shell Kdump: loaded Tainted: G EL 4.19.190-7.11.an8.loongarch64 #1 [ 1356.707036] Hardware name: LOONGSON Dabieshan/Loongson-LS2C50C6, BIOS Loongson UEFI (3C50007A2000_C6) V4.0.19-Dual 07/09/24 11:20:02 [ 1356.718876] Stack : 0000000000000000 900000000110a188 90001010249e8000 900010103c043b80 [ 1356.726828] 0000000000000000 900010103c043b80 0000000000000000 9000000000aedce0 [ 1356.734780] 90000000060365e0 9000000001e79aa0 6572617764726148 32303a30323a3131 [ 1356.742733] 900000000110a188 0000000000000007 0000000000000006 0000000000000007 [ 1356.750685] 9000100081e00950 0000000000000000 000000000000050b 6f73676e6f6f4c20 [ 1356.758638] 00000000000ac9dc 00001000800e0000 9000100081e06f40 ffff800000000000 [ 1356.766590] 90000000017402b0 0000000000000000 0000000000000000 0000000000000000 [ 1356.774543] 90001010249eb860 900000000160e268 900000000174d1a8 0000000000000000 [ 1356.782495] 900000000020a4c4 000000ff4c2f8600 00000000000000b0 0000000000000004 [ 1356.790447] 0000000000000001 000000000007141c 0000000000000000 90000000017402b0 [ 1356.798400] ... [ 1356.800822] Call Trace: [ 1356.803246] [<900000000020a4c4>] show_stack+0x34/0x140 [ 1356.808348] [<900000000110a184>] dump_stack+0xac/0xe8 [ 1356.813367] [<90000000010fe97c>] panic+0x120/0x288 [ 1356.818124] [<900000000031be5c>] watchdog_timer_fn+0x31c/0x330 [ 1356.823917] [<90000000002ce984>] __hrtimer_run_queues+0x194/0x400 [ 1356.829969] [<90000000002cfc70>] hrtimer_interrupt+0x140/0x380 [ 1356.835762] [<9000000000209844>] constant_timer_interrupt+0x34/0x50 [ 1356.841986] [<90000000002a4c68>] __handle_irq_event_percpu+0x88/0x280 [ 1356.848383] [<90000000002a4e84>] handle_irq_event_percpu+0x24/0x90 [ 1356.854521] [<90000000002aad60>] handle_percpu_irq+0x60/0xa0 [ 1356.860141] [<90000000002a37e8>] generic_handle_irq+0x28/0x50 [ 1356.865847] [<900000000111877c>] do_IRQ+0x1c/0x30 [ 1356.870516] [<90000000002034b0>] except_vec_vi_handler+0xac/0xdc [ 1356.876482] [<90000000002e7a30>] smp_call_function_many+0x300/0x3b0 [ 1356.882707] [<90000000002e7c20>] on_each_cpu_mask+0x30/0xb0 [ 1356.888240] [<90000000002113ac>] flush_tlb_page+0x6c/0x120 [ 1356.893687] [<9000000000409300>] ptep_clear_flush+0x60/0x80 [ 1356.899221] [<900000000040b1d0>] try_to_unmap_one+0x230/0x810 [ 1356.904927] [<900000000040a100>] rmap_walk_anon+0x110/0x2c0 [ 1356.910460] [<900000000040c678>] try_to_unmap+0xb8/0x130 [ 1356.915734] [<9000000000440a48>] migrate_pages+0x828/0xa30 [ 1356.921180] [<90000000004415fc>] migrate_misplaced_page+0x19c/0x2d0 [ 1356.927405] [<90000000003fd030>] __handle_mm_fault+0x750/0x1500 [ 1356.933284] [<90000000003fdef0>] handle_mm_fault+0x110/0x250 [ 1356.938904] [<900000000111853c>] do_page_fault+0x17c/0x3a0 [ 1356.944351] [<9000000000219b00>] tlb_do_page_fault_0+0x110/0x128 [ 1358.741094] rcu: rcu_sched kthread starved for 2490 jiffies! g40025 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=0 [ 1358.751206] rcu: RCU grace-period kthread stack dump: [ 1358.756223] rcu_sched I 0 11 2 0x00004000 [ 1358.761671] Stack : ffffffffffffffc0 9000000001121470 9000000001112470 900000000600f380 [ 1358.769625] 900000000161c2f0 0000000000000004 900000103939fd40 0000000000000000 [ 1358.777579] ffffffffffffffff 90000000015f5940 900000000161c2f0 0000000000000000 [ 1358.785531] 900000103939fd40 9000000006004a00 00000000000000b0 00000001000409ce [ 1358.793484] 900000000160e034 900000000111206c 00000001000409ce 90000000011169ac [ 1358.801437] 900000000160df48 00000000000000b4 0000000000000000 9000000006004ae8 [ 1358.809389] 00000001000409ce 90000000002cc6b0 0000000003c00000 9000001039375400 [ 1358.817341] 900000000161c2f0 0000000000000000 0000000000000001 900000000160e030 [ 1358.825294] 0000000000000005 90000000015f5940 0000000000000006 900000000161c2f0 [ 1358.833248] 9000000001634c00 9000000001632c00 900000000160e034 90000000002bc024 [ 1358.841201] ... [ 1358.843624] Call Trace: [ 1358.846049] [<9000000001111840>] __schedule+0x4f0/0xcf0 [ 1358.851237] [<9000000001112068>] schedule+0x28/0x80 [ 1358.856080] [<90000000011169a8>] schedule_timeout+0x208/0x520 [ 1358.861792] [<90000000002bc020>] rcu_gp_kthread+0x9a0/0xab0 [ 1358.867329] [<900000000024f18c>] kthread+0x12c/0x140 [ 1358.872261] [<9000000000203248>] ret_from_kernel_thread+0x8/0x10 Expected results: 物理机正常,KVM虚拟机正常,虚拟机numa node信息与预设一致。 Additional info: KVM虚拟机cpu0分布在numa node0上启动,物理机无异常,虚拟机无异常。启动参数如下: # /usr/libexec/qemu-kvm \ -machine loongson7a \ -cpu 'Loongson-3A5000' \ -m 4096 \ -object memory-backend-ram,size=1024M,id=mem-mem0 \ -object memory-backend-ram,size=3072M,id=mem-mem1 \ -smp 2,maxcpus=2 \ -numa node,memdev=mem-mem0,cpus=0 \ -numa node,memdev=mem-mem1,cpus=1 \ -bios loongarch_bios.bin \ -drive file=AnolisOS-8.10-loongarch64.qcow2,if=virtio \ -nographic \ -enable-kvm \ -serial stdio \ -monitor telnet:localhost:4444,server,nowait
问题已经定位修改,patch提交到了内部rd http://rd.loongson.cn:8081/c/kernel/linux-4.19-anolis/+/62657
PR 已合入,龙芯内核已更新构建 kernel-4.19.190-7.12.an8
内部代码评审,存在意见,需要进一步分析问题原因,问题暂时遗留