Bug 2412 - 海光虚拟机上nmi-watchdog复位异常
Summary: 海光虚拟机上nmi-watchdog复位异常
Status: NEW
Alias: None
Product: ANCK 4.19 Dev
Classification: ANCK
Component: X86 (show other bugs) X86
Version: 4.19-026.x
Hardware: x86_64 Linux
: P3-Medium S2-major
Target Milestone: ---
Assignee: Artie Ding
QA Contact: shuming
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-10-17 16:09 UTC by shenbw
Modified: 2022-10-24 14:11 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description shenbw 2022-10-17 16:09:22 UTC
Description of problem:
在海光虚拟机上进行模拟长时间关中断时,nmi-watchdog中断异常,复位信息不是nmi触发的


Version-Release number of selected component (if applicable):
kernel:4.19.91-26.an8.x86_64

How reproducible:
必现

Steps to Reproduce:
1.海光虚拟机上进行关中断测试
2.查看复位日志信息
3.观察到nmi-watchdog异常,检测不到hardlockup,复位信息为softlockup:hung task

Actual results:
发生softlockup复位

Expected results:
检测到hardlockup,nmi复位。

Additional info:
Comment 1 shenbw 2022-10-17 16:11:44 UTC
@
Comment 2 shenbw 2022-10-19 16:20:28 UTC
物理机:hygongenuine,hygon C86 7265 24-core processor,内核为基于kernel419(主要使用的欧拉),虚拟机:欧拉和龙蜥的 都试了(kernel-419)
物理机、虚拟机cmdline主要是:panic=3 nmi_watchdog=1 softlockup_panic=1
Comment 3 shenbw 2022-10-19 16:27:02 UTC
尝试适配的补丁,适配之后,问题现象从必现到偶现,补丁没解决该问题:

KVM: x86/pmu: Update AMD PMC sample period to fix guest NMI-watchdog
https://github.com/torvalds/linux/commit/75189d1de1b377e580ebd2d2c55914631eac9c64

前置补丁:
perf/core: Provide a kernel-internal interface to recalibrate event period
https://github.com/torvalds/linux/commit/3ca270fc9edb258d5bfa271bcf851614e9e6e7d4

KVM: x86: Fix perfctr WRMSR for running counters
https://github.com/torvalds/linux/commit/4400cf546b4bb62d49198f6642add01bf6e9b34d

KVM: x86: Adjust counter sample period after a wrmsr
https://github.com/torvalds/linux/commit/168d918f2643d7d3f0240e768d40b4f8aba3540a
Comment 4 shenbw 2022-10-19 16:30:34 UTC
(In reply to shenbw from comment #3)
> 尝试适配的补丁,适配之后,升级更新物理机,在虚拟机上进行测试问题现象从必现到偶现,补丁没解决该问题:
> 
> KVM: x86/pmu: Update AMD PMC sample period to fix guest NMI-watchdog
> https://github.com/torvalds/linux/commit/
> 75189d1de1b377e580ebd2d2c55914631eac9c64
> 
> 前置补丁:
> perf/core: Provide a kernel-internal interface to recalibrate event period
> https://github.com/torvalds/linux/commit/
> 3ca270fc9edb258d5bfa271bcf851614e9e6e7d4
> 
> KVM: x86: Fix perfctr WRMSR for running counters
> https://github.com/torvalds/linux/commit/
> 4400cf546b4bb62d49198f6642add01bf6e9b34d
> 
> KVM: x86: Adjust counter sample period after a wrmsr
> https://github.com/torvalds/linux/commit/
> 168d918f2643d7d3f0240e768d40b4f8aba3540a
Comment 5 likexu 2022-10-19 17:24:39 UTC
try to check "cat /proc/sys/kernel/nmi_watchdog"
Comment 6 shenbw 2022-10-19 18:30:13 UTC
(In reply to likexu from comment #5)
> try to check "cat /proc/sys/kernel/nmi_watchdog"

value is 1
Comment 7 likexu 2022-10-19 18:58:17 UTC
(In reply to shenbw from comment #6)
> (In reply to likexu from comment #5)
> > try to check "cat /proc/sys/kernel/nmi_watchdog"
> 
> value is 1

# inside the guest
echo 1 > /proc/sys/kernel/nmi_watchdog
cat /proc/sys/kernel/nmi_watchdog
watch -n 1 -d "cat /proc/interrupts| grep NMI"
// Check if the output numbers increase periodically
Comment 8 shenbw 2022-10-20 09:56:13 UTC
(In reply to likexu from comment #7)
> (In reply to shenbw from comment #6)
> > (In reply to likexu from comment #5)
> > > try to check "cat /proc/sys/kernel/nmi_watchdog"
> > 
> > value is 1
> 
> # inside the guest
> echo 1 > /proc/sys/kernel/nmi_watchdog
> cat /proc/sys/kernel/nmi_watchdog
> watch -n 1 -d "cat /proc/interrupts| grep NMI"
> // Check if the output numbers increase periodically

Now the value of nmi_watchdog is 1
The interrupts number will not increase if the above patch is not merged, but will increase after the patch is merged
Comment 9 likexu 2022-10-20 11:14:16 UTC
(In reply to shenbw from comment #8)
> (In reply to likexu from comment #7)
> > (In reply to shenbw from comment #6)
> > > (In reply to likexu from comment #5)
> > > > try to check "cat /proc/sys/kernel/nmi_watchdog"
> > > 
> > > value is 1
> > 
> > # inside the guest
> > echo 1 > /proc/sys/kernel/nmi_watchdog
> > cat /proc/sys/kernel/nmi_watchdog
> > watch -n 1 -d "cat /proc/interrupts| grep NMI"
> > // Check if the output numbers increase periodically
> 
> Now the value of nmi_watchdog is 1
> The interrupts number will not increase if the above patch is not merged,
> but will increase after the patch is merged

At least the above patch helps fix part of your issue,
which means the NMI watchdog works, but the hardlockup_panic doesn't.

You may re-test on a normal AMD guest w/ same configurations,
or figure out the code path difference between AMD and hygon.

I may suspect "softlockup_panic=1" or "hardlockup_panic=1" first.
Comment 10 shenbw 2022-10-20 11:57:55 UTC
(In reply to likexu from comment #9)
> (In reply to shenbw from comment #8)
> > (In reply to likexu from comment #7)
> > > (In reply to shenbw from comment #6)
> > > > (In reply to likexu from comment #5)
> > > > > try to check "cat /proc/sys/kernel/nmi_watchdog"
> > > > 
> > > > value is 1
> > > 
> > > # inside the guest
> > > echo 1 > /proc/sys/kernel/nmi_watchdog
> > > cat /proc/sys/kernel/nmi_watchdog
> > > watch -n 1 -d "cat /proc/interrupts| grep NMI"
> > > // Check if the output numbers increase periodically
> > 
> > Now the value of nmi_watchdog is 1
> > The interrupts number will not increase if the above patch is not merged,
> > but will increase after the patch is merged
> 
> At least the above patch helps fix part of your issue,
> which means the NMI watchdog works, but the hardlockup_panic doesn't.
> 
> You may re-test on a normal AMD guest w/ same configurations,
> or figure out the code path difference between AMD and hygon.
> 
> I may suspect "softlockup_panic=1" or "hardlockup_panic=1" first.

After the patch is added and "softlockup_panic=0", the system hardlockup occurs, but the time for guest reset is not fixed, which is much longer than the configured time (10s)
Comment 11 shenbw 2022-10-24 09:39:34 UTC
(In reply to shenbw from comment #2)
> 物理机:hygongenuine,hygon C86 7265 24-core
> processor,内核为基于kernel419(主要使用的欧拉),虚拟机:欧拉和龙蜥的 都试了(kernel-419)
> 物理机、虚拟机cmdline主要是:panic=3 nmi_watchdog=1 softlockup_panic=1

物理机在使用龙蜥(kernel-419)时,虚拟机(龙蜥kernel-419)进行测试时,在虚拟机上进行长时间关中断时,nmi中断异常,复位信息错误。