Description of problem: 在海光虚拟机上进行模拟长时间关中断时,nmi-watchdog中断异常,复位信息不是nmi触发的 Version-Release number of selected component (if applicable): kernel:4.19.91-26.an8.x86_64 How reproducible: 必现 Steps to Reproduce: 1.海光虚拟机上进行关中断测试 2.查看复位日志信息 3.观察到nmi-watchdog异常,检测不到hardlockup,复位信息为softlockup:hung task Actual results: 发生softlockup复位 Expected results: 检测到hardlockup,nmi复位。 Additional info:
@
物理机:hygongenuine,hygon C86 7265 24-core processor,内核为基于kernel419(主要使用的欧拉),虚拟机:欧拉和龙蜥的 都试了(kernel-419) 物理机、虚拟机cmdline主要是:panic=3 nmi_watchdog=1 softlockup_panic=1
尝试适配的补丁,适配之后,问题现象从必现到偶现,补丁没解决该问题: KVM: x86/pmu: Update AMD PMC sample period to fix guest NMI-watchdog https://github.com/torvalds/linux/commit/75189d1de1b377e580ebd2d2c55914631eac9c64 前置补丁: perf/core: Provide a kernel-internal interface to recalibrate event period https://github.com/torvalds/linux/commit/3ca270fc9edb258d5bfa271bcf851614e9e6e7d4 KVM: x86: Fix perfctr WRMSR for running counters https://github.com/torvalds/linux/commit/4400cf546b4bb62d49198f6642add01bf6e9b34d KVM: x86: Adjust counter sample period after a wrmsr https://github.com/torvalds/linux/commit/168d918f2643d7d3f0240e768d40b4f8aba3540a
(In reply to shenbw from comment #3) > 尝试适配的补丁,适配之后,升级更新物理机,在虚拟机上进行测试问题现象从必现到偶现,补丁没解决该问题: > > KVM: x86/pmu: Update AMD PMC sample period to fix guest NMI-watchdog > https://github.com/torvalds/linux/commit/ > 75189d1de1b377e580ebd2d2c55914631eac9c64 > > 前置补丁: > perf/core: Provide a kernel-internal interface to recalibrate event period > https://github.com/torvalds/linux/commit/ > 3ca270fc9edb258d5bfa271bcf851614e9e6e7d4 > > KVM: x86: Fix perfctr WRMSR for running counters > https://github.com/torvalds/linux/commit/ > 4400cf546b4bb62d49198f6642add01bf6e9b34d > > KVM: x86: Adjust counter sample period after a wrmsr > https://github.com/torvalds/linux/commit/ > 168d918f2643d7d3f0240e768d40b4f8aba3540a
try to check "cat /proc/sys/kernel/nmi_watchdog"
(In reply to likexu from comment #5) > try to check "cat /proc/sys/kernel/nmi_watchdog" value is 1
(In reply to shenbw from comment #6) > (In reply to likexu from comment #5) > > try to check "cat /proc/sys/kernel/nmi_watchdog" > > value is 1 # inside the guest echo 1 > /proc/sys/kernel/nmi_watchdog cat /proc/sys/kernel/nmi_watchdog watch -n 1 -d "cat /proc/interrupts| grep NMI" // Check if the output numbers increase periodically
(In reply to likexu from comment #7) > (In reply to shenbw from comment #6) > > (In reply to likexu from comment #5) > > > try to check "cat /proc/sys/kernel/nmi_watchdog" > > > > value is 1 > > # inside the guest > echo 1 > /proc/sys/kernel/nmi_watchdog > cat /proc/sys/kernel/nmi_watchdog > watch -n 1 -d "cat /proc/interrupts| grep NMI" > // Check if the output numbers increase periodically Now the value of nmi_watchdog is 1 The interrupts number will not increase if the above patch is not merged, but will increase after the patch is merged
(In reply to shenbw from comment #8) > (In reply to likexu from comment #7) > > (In reply to shenbw from comment #6) > > > (In reply to likexu from comment #5) > > > > try to check "cat /proc/sys/kernel/nmi_watchdog" > > > > > > value is 1 > > > > # inside the guest > > echo 1 > /proc/sys/kernel/nmi_watchdog > > cat /proc/sys/kernel/nmi_watchdog > > watch -n 1 -d "cat /proc/interrupts| grep NMI" > > // Check if the output numbers increase periodically > > Now the value of nmi_watchdog is 1 > The interrupts number will not increase if the above patch is not merged, > but will increase after the patch is merged At least the above patch helps fix part of your issue, which means the NMI watchdog works, but the hardlockup_panic doesn't. You may re-test on a normal AMD guest w/ same configurations, or figure out the code path difference between AMD and hygon. I may suspect "softlockup_panic=1" or "hardlockup_panic=1" first.
(In reply to likexu from comment #9) > (In reply to shenbw from comment #8) > > (In reply to likexu from comment #7) > > > (In reply to shenbw from comment #6) > > > > (In reply to likexu from comment #5) > > > > > try to check "cat /proc/sys/kernel/nmi_watchdog" > > > > > > > > value is 1 > > > > > > # inside the guest > > > echo 1 > /proc/sys/kernel/nmi_watchdog > > > cat /proc/sys/kernel/nmi_watchdog > > > watch -n 1 -d "cat /proc/interrupts| grep NMI" > > > // Check if the output numbers increase periodically > > > > Now the value of nmi_watchdog is 1 > > The interrupts number will not increase if the above patch is not merged, > > but will increase after the patch is merged > > At least the above patch helps fix part of your issue, > which means the NMI watchdog works, but the hardlockup_panic doesn't. > > You may re-test on a normal AMD guest w/ same configurations, > or figure out the code path difference between AMD and hygon. > > I may suspect "softlockup_panic=1" or "hardlockup_panic=1" first. After the patch is added and "softlockup_panic=0", the system hardlockup occurs, but the time for guest reset is not fixed, which is much longer than the configured time (10s)
(In reply to shenbw from comment #2) > 物理机:hygongenuine,hygon C86 7265 24-core > processor,内核为基于kernel419(主要使用的欧拉),虚拟机:欧拉和龙蜥的 都试了(kernel-419) > 物理机、虚拟机cmdline主要是:panic=3 nmi_watchdog=1 softlockup_panic=1 物理机在使用龙蜥(kernel-419)时,虚拟机(龙蜥kernel-419)进行测试时,在虚拟机上进行长时间关中断时,nmi中断异常,复位信息错误。