Bug 21644 - SMC-R-bond-down卡测试引发内核crash
Summary: SMC-R-bond-down卡测试引发内核crash
Status: NEW
Alias: None
Product: ANCK 5.10 Dev
Classification: ANCK
Component: net (show other bugs) net
Version: 5.10.y-19
Hardware: All Linux
: P2-High S1-blocker
Target Milestone: ---
Assignee: XuanZhuo
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2025-06-04 11:27 UTC by antli1001
Modified: 2025-06-04 11:49 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description antli1001 2025-06-04 11:27:58 UTC
Description of problem:
smc-r测试,bond(mod4)down掉一个网卡,压测,引发softlock
For help, type "help".
Type "apropos word" to search for commands related to "word"...
 
      KERNEL: /usr/lib/debug/lib/modules/5.10.134-19.an8.x86_64/vmlinux  [TAINTED]
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 24
        DATE: Fri May 30 16:12:48 CST 2025
      UPTIME: 00:04:41
LOAD AVERAGE: 74.17, 17.67, 5.91
       TASKS: 1028
    NODENAME: rdma-test-001
     RELEASE: 5.10.134-19.an8.x86_64
     VERSION: #1 SMP Wed May 21 14:39:39 CST 2025
     MACHINE: x86_64  (2599 Mhz)
      MEMORY: 128 GB
       PANIC: "Kernel panic - not syncing: softlockup: hung tasks"
         PID: 142
     COMMAND: "kworker/1:1"
        TASK: ffff8d0c06fd8000  [THREAD_INFO: ffff8d0c06fd8000]
         CPU: 1
       STATE: TASK_RUNNING (PANIC)
 
crash> bt
PID: 142    TASK: ffff8d0c06fd8000  CPU: 1   COMMAND: "kworker/1:1"
 #0 [ffffb94946504d20] machine_kexec at ffffffff98069c70
 #1 [ffffb94946504d70] __crash_kexec at ffffffff981cd5bd
 #2 [ffffb94946504e38] panic at ffffffff980f4f02
 #3 [ffffb94946504eb8] watchdog_timer_fn at ffffffff98207753
 #4 [ffffb94946504f20] __hrtimer_run_queues at ffffffff981aa25c
 #5 [ffffb94946504f78] hrtimer_interrupt at ffffffff981aaa50
 #6 [ffffb94946504fd8] __sysvec_apic_timer_interrupt at ffffffff980607aa
 #7 [ffffb94946504ff0] asm_call_sysvec_on_stack at ffffffff98c0113f
--- <IRQ stack> ---
 #8 [ffffb9494699bd48] asm_call_sysvec_on_stack at ffffffff98c0113f
    [exception RIP: unknown or invalid address]
    RIP: 0000000000000000  RSP: 0000000000000000  RFLAGS: 00000101
    RAX: ffff8d0c6f828bc8  RBX: 0000000000000000  RCX: ffff8d0c689a0010
    RDX: ffff8d0c689a0000  RSI: ffff8d0c6f828000  RDI: 0000000000000000
    RBP: 0000000000000000   R8: ffffffff98c00d42   R9: 0000000000000000
    R10: 0000000000000000  R11: ffffffff98b54a13  R12: ffffb9494699bd98
    R13: ffffffff980fc189  R14: 0000000000000000  R15: ffffb9494699bd88
    ORIG_RAX: ffff8d0c6f828000  CS: 0000  SS: 0000
bt: WARNING: possibly bogus exception frame
 #9 [ffffb9494699be18] smc_close_active_abort at ffffffffc0cec9c5 [smc]
#10 [ffffb9494699be58] __smc_lgr_terminate at ffffffffc0cdc228 [smc]
#11 [ffffb9494699be98] process_one_work at ffffffff98115213
#12 [ffffb9494699bed8] worker_thread at ffffffff98115420
#13 [ffffb9494699bf10] kthread at ffffffff9811b8c4
#14 [ffffb9494699bf50] ret_from_fork at ffffffff9800502f


[  267.632834] smc_llc_link_active: 103 callbacks suppressed
[  267.632836] smc: SMC-R lg 00850000 link added: id 00008501, peerid 00009b01, ibdev mlx5_bond_0, ibport 1
[  267.632840] smcr_lgr_set_type: 103 callbacks suppressed
[  267.632841] smc: SMC-R lg 00850000 state changed: SINGLE, pnetid
[  267.637917] smc: SMC-R lg 00980000 link added: id 00009801, peerid 00009c01, ibdev mlx5_bond_0, ibport 1
[  267.637921] smc: SMC-R lg 00980000 state changed: SINGLE, pnetid
[  267.642821] smc: SMC-R lg 009d0000 link added: id 00009d01, peerid 00009d01, ibdev mlx5_bond_0, ibport 1
[  267.642824] smc: SMC-R lg 009d0000 state changed: SINGLE, pnetid
[  267.647736] smc: SMC-R lg 00930000 link added: id 00009301, peerid 00009e01, ibdev mlx5_bond_0, ibport 1
[  267.647738] smc: SMC-R lg 00930000 state changed: SINGLE, pnetid
[  267.652607] smc: SMC-R lg 009b0000 link added: id 00009b01, peerid 00009f01, ibdev mlx5_bond_0, ibport 1
[  267.652611] smc: SMC-R lg 009b0000 state changed: SINGLE, pnetid
[  267.657679] smc: SMC-R lg 008e0000 link added: id 00008e01, peerid 0000a001, ibdev mlx5_bond_0, ibport 1
[  267.657682] smc: SMC-R lg 008e0000 state changed: SINGLE, pnetid
[  267.663052] smc: SMC-R lg 008c0000 link added: id 00008c01, peerid 0000a101, ibdev mlx5_bond_0, ibport 1
[  267.663054] smc: SMC-R lg 008c0000 state changed: SINGLE, pnetid
[  267.668030] smc: SMC-R lg 00960000 link added: id 00009601, peerid 0000a201, ibdev mlx5_bond_0, ibport 1
[  267.668032] smc: SMC-R lg 00960000 state changed: SINGLE, pnetid
[  267.672942] smc: SMC-R lg 00970000 link added: id 00009701, peerid 0000a301, ibdev mlx5_bond_0, ibport 1
[  267.672944] smc: SMC-R lg 00970000 state changed: SINGLE, pnetid
[  267.677853] smc: SMC-R lg 00a10000 link added: id 0000a101, peerid 0000a401, ibdev mlx5_bond_0, ibport 1
[  267.677856] smc: SMC-R lg 00a10000 state changed: SINGLE, pnetid
[  282.373627] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/1:1:142]
[  282.373711] CPU#1 Utilization every 4s during lockup:
[  282.373785]  #1: 100% system,          0% softirq,     0% hardirq,     0% idle
[  282.373861]  #2: 101% system,          0% softirq,     0% hardirq,     0% idle
[  282.373937]  #3: 100% system,          0% softirq,     0% hardirq,     0% idle
[  282.374014]  #4: 101% system,          0% softirq,     0% hardirq,     0% idle
[  282.374089]  #5: 100% system,          0% softirq,     0% hardirq,     0% idle
[  282.374165] Modules linked in: smc(E) mlx5_ib(E) ib_uverbs(E) ib_core(E) udp_diag(E) tcp_diag(E) inet_diag(E) 8021q(E) garp(E) mrp(E) stp(E) llc(E) rfkill(E) intel_rapl_m
sr(E) intel_rapl_common(E) sb_edac(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmu
lni_intel(E) rapl(E) intel_cstate(E) intel_uncore(E) pcspkr(E) joydev(E) mei_me(E) mei(E) ses(E) i2c_i801(E) ioatdma(E) lpc_ich(E) i2c_smbus(E) enclosure(E) xfs(E) sd_mod(E)
 sg(E) nvme(E) nvme_core(E) isci(E) t10_pi(E) libcrc32c(E) ahci(E) crc32c_intel(E) libsas(E) libahci(E) scsi_transport_sas(E) libata(E) mlx5_core(E) megaraid_sas(E) igb(E) i
xgbe(E) i2c_algo_bit(E) mlxfw(E) mdio(E) pci_hyperv_intf(E) dca(E) i2c_core(E) wmi(E) ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) fuse(E) bonding(E)
[  282.374214] CPU: 1 PID: 142 Comm: kworker/1:1 Kdump: loaded Tainted: G S          E     5.10.134-19.an8.x86_64 #1
[  282.374215] Hardware name: Huawei Technologies Co., Ltd. RH2288A V2/BC11SRSI0, BIOS RMIBV512 08/27/2015
[  282.374223] Workqueue: events smc_lgr_terminate_work [smc]
[  282.374232] RIP: 0010:smc_close_active_abort+0x175/0x360 [smc]
[  282.374234] Code: d3 e2 f7 c2 00 00 00 09 0f 84 ce fe ff ff c6 83 e0 0a 00 00 1a eb 4e 0f b6 43 12 3c 1a 0f 84 48 01 00 00 31 ed f0 80 4b 60 01 <48> 8b 83 d0 02 00 00 48
89 df ff d0 0f 1f 00 40 84 ed 75 07 5b 5d
[  282.374235] RSP: 0018:ffffb9494699be48 EFLAGS: 00000206
[  282.374237] RAX: 0000000000000000 RBX: ffff8d0c6f828000 RCX: 0000000000000007
[  282.374238] RDX: 0000000000000080 RSI: 0000000000000001 RDI: ffff8d0c6f828000
[  282.374239] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000101
[  282.374240] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8d0c6f828000
[  282.374240] R13: ffff8d0c689a0000 R14: ffff8d0c689a0010 R15: ffff8d0c6f828bc8
[  282.374242] FS:  0000000000000000(0000) GS:ffff8d1b3f880000(0000) knlGS:0000000000000000
[  282.374243] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  282.374244] CR2: 000055812070b730 CR3: 00000018eba12006 CR4: 00000000001706e0
[  282.374244] Call Trace:
[  282.374247]  <IRQ>
[  282.374253]  ? watchdog_timer_fn+0x324/0x480
[  282.374256]  ? report_softlockup+0x1b0/0x1b0
[  282.374260]  ? __hrtimer_run_queues+0xfc/0x250
[  282.374262]  ? hrtimer_interrupt+0x100/0x240
[  282.374265]  ? __sysvec_apic_timer_interrupt+0x5a/0x100
[  282.374271]  ? asm_call_irq_on_stack+0xf/0x20
[  282.374272]  </IRQ>
[  282.374275]  ? sysvec_apic_timer_interrupt+0x73/0x80
[  282.374277]  ? asm_sysvec_apic_timer_interrupt+0x12/0x20
[  282.374285]  ? smc_close_active_abort+0x175/0x360 [smc]
[  282.374292]  __smc_lgr_terminate.part.38+0xc8/0x180 [smc]
[  282.374297]  process_one_work+0x1a3/0x380
[  282.374298]  worker_thread+0x30/0x380
[  282.374300]  ? process_one_work+0x380/0x380
[  282.374302]  kthread+0x114/0x130
[  282.374304]  ? __kthread_cancel_work+0x50/0x50
[  282.374309]  ret_from_fork+0x1f/0x30
[  282.374312] Kernel panic - not syncing: softlockup: hung tasks
[  282.374388] CPU: 1 PID: 142 Comm: kworker/1:1 Kdump: loaded Tainted: G S          EL    5.10.134-19.an8.x86_64 #1
[  282.374472] Hardware name: Huawei Technologies Co., Ltd. RH2288A V2/BC11SRSI0, BIOS RMIBV512 08/27/2015
[  282.374561] Workqueue: events smc_lgr_terminate_work [smc]
[  282.374635] Call Trace:
[  282.374703]  <IRQ>
[  282.374774]  dump_stack+0x5c/0x90
[  282.374845]  panic+0x390/0x3a0
[  282.374917]  watchdog_timer_fn+0x353/0x480
[  282.374989]  ? report_softlockup+0x1b0/0x1b0
[  282.375061]  __hrtimer_run_queues+0xfc/0x250
[  282.375152]  hrtimer_interrupt+0x100/0x240
[  282.375242]  __sysvec_apic_timer_interrupt+0x5a/0x100
[  282.375336]  asm_call_irq_on_stack+0xf/0x20
[  282.375425]  </IRQ>
[  282.375512]  sysvec_apic_timer_interrupt+0x73/0x80
[  282.375610]  asm_sysvec_apic_timer_interrupt+0x12/0x20
[  282.375710] RIP: 0010:smc_close_active_abort+0x175/0x360 [smc]
[  282.375805] Code: d3 e2 f7 c2 00 00 00 09 0f 84 ce fe ff ff c6 83 e0 0a 00 00 1a eb 4e 0f b6 43 12 3c 1a 0f 84 48 01 00 00 31 ed f0 80 4b 60 01 <48> 8b 83 d0 02 00 00 48
89 df ff d0 0f 1f 00 40 84 ed 75 07 5b 5d
[  282.375963] RSP: 0018:ffffb9494699be48 EFLAGS: 00000206
[  282.376058] RAX: 0000000000000000 RBX: ffff8d0c6f828000 RCX: 0000000000000007
[  282.376155] RDX: 0000000000000080 RSI: 0000000000000001 RDI: ffff8d0c6f828000
[  282.376252] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000101
[  282.376348] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8d0c6f828000
[  282.376445] R13: ffff8d0c689a0000 R14: ffff8d0c689a0010 R15: ffff8d0c6f828bc8
[  282.376549]  __smc_lgr_terminate.part.38+0xc8/0x180 [smc]
[  282.376642]  process_one_work+0x1a3/0x380
[  282.376734]  worker_thread+0x30/0x380
[  282.376824]  ? process_one_work+0x380/0x380
[  282.376914]  kthread+0x114/0x130
[  282.377003]  ? __kthread_cancel_work+0x50/0x50
[  282.377095]  ret_from_fork+0x1f/0x30
[  282.669632] Kernel Offset: 0x17000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

Version-Release number of selected component (if applicable):

1. Anolis OS release 8.8
2. 5.10.134-19.an8.x86_64 
3. 网卡:mlnx-cx6-lx bond 802.3ad
4. redis-6.0.9


How reproducible:

server:
smc_run ./redis-6.0.9/src/redis-server --protected-mode no

client:
[root@rdma-test-001 ~]# ifdown eth1
WARN      : [ifdown] You are using 'ifdown' script provided by 'network-scripts', which are now deprecated.
WARN      : [ifdown] 'network-scripts' will be removed in one of the next major releases of RHEL.
WARN      : [ifdown] It is advised to switch to 'NetworkManager' instead - it provides 'ifup/ifdown' scripts as well.
Device 'eth1' successfully disconnected.
立即执行:
smc_run  redis-benchmark -h 10.199.36.23  -p 6379 -c 300 -n 3000000 -d 30 --threads 300 -r 10 -t GET -k 1
ERROR: failed to fetch CONFIG from 10.199.36.23:6379
WARN: could not fetch server CONFIG
Error: Server closed the connection

出现概率非常高:
ifdown ifup eth0 eth1

Steps to Reproduce:
1. 网卡bond后执行down掉其中一张网卡,立即开启压测
2. 可以重复测试down-up掉eth0、eth1

Actual results:


Expected results:


Additional info: