Bug 19066 - [Anolis 8.10][RC1][loongarch64] KVM虚拟机cpu0分布在非numa node0上启动,导致物理机死机。
Summary: [Anolis 8.10][RC1][loongarch64] KVM虚拟机cpu0分布在非numa node0上启动,导致物理机死机。
Status: NEW
Alias: None
Product: Anolis OS 8
Classification: Anolis OS
Component: kernel - anck-4.19 (show other bugs) kernel - anck-4.19
Version: 8.10
Hardware: loongarch Linux
: P3-Medium S3-normal
Target Milestone: ---
Assignee: wenlong
QA Contact: shuming
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2025-02-27 09:55 UTC by wuzhiguo
Modified: 2025-03-19 16:16 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description wuzhiguo loongson_group 2025-02-27 09:55:21 UTC
Description of problem:
KVM虚拟机cpu0分布在非numa node0上启动,导致物理机死机。
KVM虚拟机cpu0分布在numa node0上启动,物理机无异常,虚拟机无异常。

Version-Release number of selected component (if applicable):
内核版本: kernel-4.19.190-7.11.an8.loongarch64
qemu版本: qemu-kvm-6.2.0-53.0.3.module+an8.9.0+11292+334bc2d1.2.loongarch64

How reproducible:

Steps to Reproduce:
1. 启动虚拟机,参数如下:
# /usr/libexec/qemu-kvm \
    -machine loongson7a \
    -cpu 'Loongson-3A5000' \
    -m 4096 \
    -object memory-backend-ram,size=1024M,id=mem-mem0 \
    -object memory-backend-ram,size=3072M,id=mem-mem1  \
    -smp 2,maxcpus=2  \
    -numa node,memdev=mem-mem0,cpus=1  \
    -numa node,memdev=mem-mem1,cpus=0  \
    -bios loongarch_bios.bin \
    -drive file=AnolisOS-8.10-loongarch64.qcow2,if=virtio \
    -nographic \
    -enable-kvm \
    -serial stdio \
    -monitor telnet:localhost:4444,server,nowait
2. 观察虚拟机状态,观察虚拟机numa node信息是否与预设一致。

Actual results:
KVM虚拟机cpu0分布在非numa node0上启动,导致物理机死机。物理机报错信息如下:
[ 1348.730826] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1348.736736] rcu: 	22-...0: (0 ticks this GP) idle=bf2/1/0x4000000000000000 softirq=6400/6400 fqs=1088 
[ 1348.745985] rcu: 	(detected by 21, t=5269 jiffies, g=40025, q=2196)
[ 1348.752223] Sending NMI from CPU 21 to CPUs 22:
[ 1356.222814] watchdog: BUG: soft lockup - CPU#31 stuck for 23s! [gnome-shell:3947]
[ 1356.230256] Modules linked in: tun(E) ib_core(E) xt_CHECKSUM(E) ipt_MASQUERADE(E) xt_conntrack(E) ipt_REJECT(E) nf_reject_ipv4(E) nft_compat(E) nft_chain_route_ipv6(E) nft_chain_nat_ipv6(E) nf_nat_ipv6(E) nft_counter(E) nft_chain_route_ipv4(E) nft_chain_nat_ipv4(E) nf_nat_ipv4(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) nf_tables(E) nfnetlink(E) bridge(E) rfkill(E) vfat(E) fat(E) ipmi_ssif(E) efi_pstore(E) kvm(E) efivars(E) snd_hda_loongson(E) ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) nvme_fabrics(E) dm_multipath(E) r8168(E) txgbe(E) megaraid_sas(E) sw_se_echip_drv(E) cdc_ether(E) usbnet(E) 8021q(E) garp(E) mrp(E) stp(E) llc(E) be2iscsi(E) bnx2i(E) cnic(E) uio(E) cxgb4i(E) cxgb4(E) libcxgbi(E) libcxgb(E) qla4xxx(E) iscsi_boot_sysfs(E) iscsi_tcp(E) libiscsi_tcp(E) libiscsi(E)
[ 1356.300722]  fuse(E) scsi_transport_iscsi(E)
[ 1356.304966] CPU: 31 PID: 3947 Comm: gnome-shell Kdump: loaded Tainted: G            E     4.19.190-7.11.an8.loongarch64 #1
[ 1356.315943] Hardware name: LOONGSON Dabieshan/Loongson-LS2C50C6, BIOS Loongson UEFI (3C50007A2000_C6) V4.0.19-Dual 07/09/24 11:20:02
[ 1356.327783] pc 90000000002e7a30 ra 90000000002e7a00 tp 90001010249e8000 sp 90001010249eb9a0
[ 1356.336081] a0 0000000000000016 a1 0000000000000016 a2 0000000000000016 a3 9000100081e105c0
[ 1356.344378] a4 9000100081e105c0 a5 00000000003d26af a6 0000000000000030 a7 0000000000000001
[ 1356.352676] t0 0000000000000001 t1 9000100080c14aa0 t2 ffffffffffffffc0 t3 0000000000001040
[ 1356.360973] t4 ffffffffffffffff t5 0000000000000001 t6 ffffffff80000000 t7 fffffffffe000000
[ 1356.369271] t8 000000ff4c2f8000 u0 90000000003b9d90 s9 0000000000000100 s0 9000100081e10580
[ 1356.377568] s1 900000000160e110 s2 9000100081e10588 s3 90000000015f7690 s4 0000000000000001
[ 1356.385866] s5 9000000000211460 s6 90001010249eba38 s7 9000000001d30540 s8 9000100081e105a8
[ 1356.394179]    ra: 90000000002e7a00 smp_call_function_many+0x2d0/0x3b0
[ 1356.400663]   ERA: 90000000002e7a30 smp_call_function_many+0x300/0x3b0
[ 1356.407146]  CRMD: 00000001 (PLV1 -IE -DA -PG DACF=SUC DACM=SUC -WE)
[ 1356.413460]  PRMD: 00000004 (PPLV0 +PIE -PWE)
[ 1356.417784]  EUEN: 9000100081e10588 (-FPE -SXE -ASXE +BTE)
[ 1356.423232]  ECFG: 900000000160e110 (LIE=4,8 VS=0)
[ 1356.427987] ESTAT: 9000100081e10580
[ 1356.431446] ExcCode : 21 (SubCode 7)
[ 1356.434991]  PRID: 0014c011 (Loongson-64bit, Loongson-3C5000)
[ 1356.440698] CPU: 31 PID: 3947 Comm: gnome-shell Kdump: loaded Tainted: G            E     4.19.190-7.11.an8.loongarch64 #1
[ 1356.451674] Hardware name: LOONGSON Dabieshan/Loongson-LS2C50C6, BIOS Loongson UEFI (3C50007A2000_C6) V4.0.19-Dual 07/09/24 11:20:02
[ 1356.463514] Stack : 0000000000000000 900000000110a188 90001010249e8000 900010103c043c10
[ 1356.471468]         0000000000000000 900010103c043c10 0000000000000000 9000000000aedce0
[ 1356.479420]         90000000060359e8 9000000001e79aa0 6572617764726148 32303a30323a3131
[ 1356.487372]         900000000110a188 0000000000000007 0000000000000006 0000000000000007
[ 1356.495324]         9000100081e00950 0000000000000000 00000000000004e3 6f73676e6f6f4c20
[ 1356.503277]         000000000006e45a 00001000800e0000 9000100081e06f40 ffff800000000000
[ 1356.511230]         90000000017402b0 0000000000000000 0000000000000000 0000000000000000
[ 1356.519182]         90001010249eb860 900000000160e268 900000000174d1a8 0000000000000000
[ 1356.527135]         900000000020a4c4 000000ff4c2f8600 00000000000000b0 0000000000000004
[ 1356.535087]         0000000000000001 000000000007141c 0000000000000000 90000000017402b0
[ 1356.543040]         ...
[ 1356.545463] Call Trace:
[ 1356.547893] [<900000000020a4c4>] show_stack+0x34/0x140
[ 1356.553007] [<900000000110a184>] dump_stack+0xac/0xe8
[ 1356.558028] [<900000000031be10>] watchdog_timer_fn+0x2d0/0x330
[ 1356.563824] [<90000000002ce984>] __hrtimer_run_queues+0x194/0x400
[ 1356.569877] [<90000000002cfc70>] hrtimer_interrupt+0x140/0x380
[ 1356.575672] [<9000000000209844>] constant_timer_interrupt+0x34/0x50
[ 1356.581901] [<90000000002a4c68>] __handle_irq_event_percpu+0x88/0x280
[ 1356.588298] [<90000000002a4e84>] handle_irq_event_percpu+0x24/0x90
[ 1356.594439] [<90000000002aad60>] handle_percpu_irq+0x60/0xa0
[ 1356.600059] [<90000000002a37e8>] generic_handle_irq+0x28/0x50
[ 1356.605771] [<900000000111877c>] do_IRQ+0x1c/0x30
[ 1356.610442] [<90000000002034b0>] except_vec_vi_handler+0xac/0xdc
[ 1356.616408] [<90000000002e7a30>] smp_call_function_many+0x300/0x3b0
[ 1356.622634] [<90000000002e7c20>] on_each_cpu_mask+0x30/0xb0
[ 1356.628170] [<90000000002113ac>] flush_tlb_page+0x6c/0x120
[ 1356.633622] [<9000000000409300>] ptep_clear_flush+0x60/0x80
[ 1356.639157] [<900000000040b1d0>] try_to_unmap_one+0x230/0x810
[ 1356.644863] [<900000000040a100>] rmap_walk_anon+0x110/0x2c0
[ 1356.650397] [<900000000040c678>] try_to_unmap+0xb8/0x130
[ 1356.655673] [<9000000000440a48>] migrate_pages+0x828/0xa30
[ 1356.661121] [<90000000004415fc>] migrate_misplaced_page+0x19c/0x2d0
[ 1356.667351] [<90000000003fd030>] __handle_mm_fault+0x750/0x1500
[ 1356.673230] [<90000000003fdef0>] handle_mm_fault+0x110/0x250
[ 1356.678850] [<900000000111853c>] do_page_fault+0x17c/0x3a0
[ 1356.684300] [<9000000000219b00>] tlb_do_page_fault_0+0x110/0x128
[ 1356.690268] Kernel panic - not syncing: softlockup: hung tasks
[ 1356.696061] CPU: 31 PID: 3947 Comm: gnome-shell Kdump: loaded Tainted: G            EL    4.19.190-7.11.an8.loongarch64 #1
[ 1356.707036] Hardware name: LOONGSON Dabieshan/Loongson-LS2C50C6, BIOS Loongson UEFI (3C50007A2000_C6) V4.0.19-Dual 07/09/24 11:20:02
[ 1356.718876] Stack : 0000000000000000 900000000110a188 90001010249e8000 900010103c043b80
[ 1356.726828]         0000000000000000 900010103c043b80 0000000000000000 9000000000aedce0
[ 1356.734780]         90000000060365e0 9000000001e79aa0 6572617764726148 32303a30323a3131
[ 1356.742733]         900000000110a188 0000000000000007 0000000000000006 0000000000000007
[ 1356.750685]         9000100081e00950 0000000000000000 000000000000050b 6f73676e6f6f4c20
[ 1356.758638]         00000000000ac9dc 00001000800e0000 9000100081e06f40 ffff800000000000
[ 1356.766590]         90000000017402b0 0000000000000000 0000000000000000 0000000000000000
[ 1356.774543]         90001010249eb860 900000000160e268 900000000174d1a8 0000000000000000
[ 1356.782495]         900000000020a4c4 000000ff4c2f8600 00000000000000b0 0000000000000004
[ 1356.790447]         0000000000000001 000000000007141c 0000000000000000 90000000017402b0
[ 1356.798400]         ...
[ 1356.800822] Call Trace:
[ 1356.803246] [<900000000020a4c4>] show_stack+0x34/0x140
[ 1356.808348] [<900000000110a184>] dump_stack+0xac/0xe8
[ 1356.813367] [<90000000010fe97c>] panic+0x120/0x288
[ 1356.818124] [<900000000031be5c>] watchdog_timer_fn+0x31c/0x330
[ 1356.823917] [<90000000002ce984>] __hrtimer_run_queues+0x194/0x400
[ 1356.829969] [<90000000002cfc70>] hrtimer_interrupt+0x140/0x380
[ 1356.835762] [<9000000000209844>] constant_timer_interrupt+0x34/0x50
[ 1356.841986] [<90000000002a4c68>] __handle_irq_event_percpu+0x88/0x280
[ 1356.848383] [<90000000002a4e84>] handle_irq_event_percpu+0x24/0x90
[ 1356.854521] [<90000000002aad60>] handle_percpu_irq+0x60/0xa0
[ 1356.860141] [<90000000002a37e8>] generic_handle_irq+0x28/0x50
[ 1356.865847] [<900000000111877c>] do_IRQ+0x1c/0x30
[ 1356.870516] [<90000000002034b0>] except_vec_vi_handler+0xac/0xdc
[ 1356.876482] [<90000000002e7a30>] smp_call_function_many+0x300/0x3b0
[ 1356.882707] [<90000000002e7c20>] on_each_cpu_mask+0x30/0xb0
[ 1356.888240] [<90000000002113ac>] flush_tlb_page+0x6c/0x120
[ 1356.893687] [<9000000000409300>] ptep_clear_flush+0x60/0x80
[ 1356.899221] [<900000000040b1d0>] try_to_unmap_one+0x230/0x810
[ 1356.904927] [<900000000040a100>] rmap_walk_anon+0x110/0x2c0
[ 1356.910460] [<900000000040c678>] try_to_unmap+0xb8/0x130
[ 1356.915734] [<9000000000440a48>] migrate_pages+0x828/0xa30
[ 1356.921180] [<90000000004415fc>] migrate_misplaced_page+0x19c/0x2d0
[ 1356.927405] [<90000000003fd030>] __handle_mm_fault+0x750/0x1500
[ 1356.933284] [<90000000003fdef0>] handle_mm_fault+0x110/0x250
[ 1356.938904] [<900000000111853c>] do_page_fault+0x17c/0x3a0
[ 1356.944351] [<9000000000219b00>] tlb_do_page_fault_0+0x110/0x128
[ 1358.741094] rcu: rcu_sched kthread starved for 2490 jiffies! g40025 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=0
[ 1358.751206] rcu: RCU grace-period kthread stack dump:
[ 1358.756223] rcu_sched       I    0    11      2 0x00004000
[ 1358.761671] Stack : ffffffffffffffc0 9000000001121470 9000000001112470 900000000600f380
[ 1358.769625]         900000000161c2f0 0000000000000004 900000103939fd40 0000000000000000
[ 1358.777579]         ffffffffffffffff 90000000015f5940 900000000161c2f0 0000000000000000
[ 1358.785531]         900000103939fd40 9000000006004a00 00000000000000b0 00000001000409ce
[ 1358.793484]         900000000160e034 900000000111206c 00000001000409ce 90000000011169ac
[ 1358.801437]         900000000160df48 00000000000000b4 0000000000000000 9000000006004ae8
[ 1358.809389]         00000001000409ce 90000000002cc6b0 0000000003c00000 9000001039375400
[ 1358.817341]         900000000161c2f0 0000000000000000 0000000000000001 900000000160e030
[ 1358.825294]         0000000000000005 90000000015f5940 0000000000000006 900000000161c2f0
[ 1358.833248]         9000000001634c00 9000000001632c00 900000000160e034 90000000002bc024
[ 1358.841201]         ...
[ 1358.843624] Call Trace:
[ 1358.846049] [<9000000001111840>] __schedule+0x4f0/0xcf0
[ 1358.851237] [<9000000001112068>] schedule+0x28/0x80
[ 1358.856080] [<90000000011169a8>] schedule_timeout+0x208/0x520
[ 1358.861792] [<90000000002bc020>] rcu_gp_kthread+0x9a0/0xab0
[ 1358.867329] [<900000000024f18c>] kthread+0x12c/0x140
[ 1358.872261] [<9000000000203248>] ret_from_kernel_thread+0x8/0x10

Expected results:
物理机正常,KVM虚拟机正常,虚拟机numa node信息与预设一致。

Additional info:
KVM虚拟机cpu0分布在numa node0上启动,物理机无异常,虚拟机无异常。启动参数如下:
# /usr/libexec/qemu-kvm \
    -machine loongson7a \
    -cpu 'Loongson-3A5000' \
    -m 4096 \
    -object memory-backend-ram,size=1024M,id=mem-mem0 \
    -object memory-backend-ram,size=3072M,id=mem-mem1  \
    -smp 2,maxcpus=2  \
    -numa node,memdev=mem-mem0,cpus=0  \
    -numa node,memdev=mem-mem1,cpus=1  \
    -bios loongarch_bios.bin \
    -drive file=AnolisOS-8.10-loongarch64.qcow2,if=virtio \
    -nographic \
    -enable-kvm \
    -serial stdio \
    -monitor telnet:localhost:4444,server,nowait
Comment 1 lixianglai loongson_group 2025-03-07 16:04:30 UTC
问题已经定位修改,patch提交到了内部rd
http://rd.loongson.cn:8081/c/kernel/linux-4.19-anolis/+/62657
Comment 2 wangzhe 2025-03-16 18:30:17 UTC
PR 已合入,龙芯内核已更新构建
kernel-4.19.190-7.12.an8
Comment 3 lixianglai loongson_group 2025-03-19 16:16:46 UTC
内部代码评审,存在意见,需要进一步分析问题原因,问题暂时遗留