Description of problem: smc-r测试,bond(mod4)down掉一个网卡,压测,引发softlock For help, type "help". Type "apropos word" to search for commands related to "word"... KERNEL: /usr/lib/debug/lib/modules/5.10.134-19.an8.x86_64/vmlinux [TAINTED] DUMPFILE: vmcore [PARTIAL DUMP] CPUS: 24 DATE: Fri May 30 16:12:48 CST 2025 UPTIME: 00:04:41 LOAD AVERAGE: 74.17, 17.67, 5.91 TASKS: 1028 NODENAME: rdma-test-001 RELEASE: 5.10.134-19.an8.x86_64 VERSION: #1 SMP Wed May 21 14:39:39 CST 2025 MACHINE: x86_64 (2599 Mhz) MEMORY: 128 GB PANIC: "Kernel panic - not syncing: softlockup: hung tasks" PID: 142 COMMAND: "kworker/1:1" TASK: ffff8d0c06fd8000 [THREAD_INFO: ffff8d0c06fd8000] CPU: 1 STATE: TASK_RUNNING (PANIC) crash> bt PID: 142 TASK: ffff8d0c06fd8000 CPU: 1 COMMAND: "kworker/1:1" #0 [ffffb94946504d20] machine_kexec at ffffffff98069c70 #1 [ffffb94946504d70] __crash_kexec at ffffffff981cd5bd #2 [ffffb94946504e38] panic at ffffffff980f4f02 #3 [ffffb94946504eb8] watchdog_timer_fn at ffffffff98207753 #4 [ffffb94946504f20] __hrtimer_run_queues at ffffffff981aa25c #5 [ffffb94946504f78] hrtimer_interrupt at ffffffff981aaa50 #6 [ffffb94946504fd8] __sysvec_apic_timer_interrupt at ffffffff980607aa #7 [ffffb94946504ff0] asm_call_sysvec_on_stack at ffffffff98c0113f --- <IRQ stack> --- #8 [ffffb9494699bd48] asm_call_sysvec_on_stack at ffffffff98c0113f [exception RIP: unknown or invalid address] RIP: 0000000000000000 RSP: 0000000000000000 RFLAGS: 00000101 RAX: ffff8d0c6f828bc8 RBX: 0000000000000000 RCX: ffff8d0c689a0010 RDX: ffff8d0c689a0000 RSI: ffff8d0c6f828000 RDI: 0000000000000000 RBP: 0000000000000000 R8: ffffffff98c00d42 R9: 0000000000000000 R10: 0000000000000000 R11: ffffffff98b54a13 R12: ffffb9494699bd98 R13: ffffffff980fc189 R14: 0000000000000000 R15: ffffb9494699bd88 ORIG_RAX: ffff8d0c6f828000 CS: 0000 SS: 0000 bt: WARNING: possibly bogus exception frame #9 [ffffb9494699be18] smc_close_active_abort at ffffffffc0cec9c5 [smc] #10 [ffffb9494699be58] __smc_lgr_terminate at ffffffffc0cdc228 [smc] #11 [ffffb9494699be98] process_one_work at ffffffff98115213 #12 [ffffb9494699bed8] worker_thread at ffffffff98115420 #13 [ffffb9494699bf10] kthread at ffffffff9811b8c4 #14 [ffffb9494699bf50] ret_from_fork at ffffffff9800502f [ 267.632834] smc_llc_link_active: 103 callbacks suppressed [ 267.632836] smc: SMC-R lg 00850000 link added: id 00008501, peerid 00009b01, ibdev mlx5_bond_0, ibport 1 [ 267.632840] smcr_lgr_set_type: 103 callbacks suppressed [ 267.632841] smc: SMC-R lg 00850000 state changed: SINGLE, pnetid [ 267.637917] smc: SMC-R lg 00980000 link added: id 00009801, peerid 00009c01, ibdev mlx5_bond_0, ibport 1 [ 267.637921] smc: SMC-R lg 00980000 state changed: SINGLE, pnetid [ 267.642821] smc: SMC-R lg 009d0000 link added: id 00009d01, peerid 00009d01, ibdev mlx5_bond_0, ibport 1 [ 267.642824] smc: SMC-R lg 009d0000 state changed: SINGLE, pnetid [ 267.647736] smc: SMC-R lg 00930000 link added: id 00009301, peerid 00009e01, ibdev mlx5_bond_0, ibport 1 [ 267.647738] smc: SMC-R lg 00930000 state changed: SINGLE, pnetid [ 267.652607] smc: SMC-R lg 009b0000 link added: id 00009b01, peerid 00009f01, ibdev mlx5_bond_0, ibport 1 [ 267.652611] smc: SMC-R lg 009b0000 state changed: SINGLE, pnetid [ 267.657679] smc: SMC-R lg 008e0000 link added: id 00008e01, peerid 0000a001, ibdev mlx5_bond_0, ibport 1 [ 267.657682] smc: SMC-R lg 008e0000 state changed: SINGLE, pnetid [ 267.663052] smc: SMC-R lg 008c0000 link added: id 00008c01, peerid 0000a101, ibdev mlx5_bond_0, ibport 1 [ 267.663054] smc: SMC-R lg 008c0000 state changed: SINGLE, pnetid [ 267.668030] smc: SMC-R lg 00960000 link added: id 00009601, peerid 0000a201, ibdev mlx5_bond_0, ibport 1 [ 267.668032] smc: SMC-R lg 00960000 state changed: SINGLE, pnetid [ 267.672942] smc: SMC-R lg 00970000 link added: id 00009701, peerid 0000a301, ibdev mlx5_bond_0, ibport 1 [ 267.672944] smc: SMC-R lg 00970000 state changed: SINGLE, pnetid [ 267.677853] smc: SMC-R lg 00a10000 link added: id 0000a101, peerid 0000a401, ibdev mlx5_bond_0, ibport 1 [ 267.677856] smc: SMC-R lg 00a10000 state changed: SINGLE, pnetid [ 282.373627] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/1:1:142] [ 282.373711] CPU#1 Utilization every 4s during lockup: [ 282.373785] #1: 100% system, 0% softirq, 0% hardirq, 0% idle [ 282.373861] #2: 101% system, 0% softirq, 0% hardirq, 0% idle [ 282.373937] #3: 100% system, 0% softirq, 0% hardirq, 0% idle [ 282.374014] #4: 101% system, 0% softirq, 0% hardirq, 0% idle [ 282.374089] #5: 100% system, 0% softirq, 0% hardirq, 0% idle [ 282.374165] Modules linked in: smc(E) mlx5_ib(E) ib_uverbs(E) ib_core(E) udp_diag(E) tcp_diag(E) inet_diag(E) 8021q(E) garp(E) mrp(E) stp(E) llc(E) rfkill(E) intel_rapl_m sr(E) intel_rapl_common(E) sb_edac(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmu lni_intel(E) rapl(E) intel_cstate(E) intel_uncore(E) pcspkr(E) joydev(E) mei_me(E) mei(E) ses(E) i2c_i801(E) ioatdma(E) lpc_ich(E) i2c_smbus(E) enclosure(E) xfs(E) sd_mod(E) sg(E) nvme(E) nvme_core(E) isci(E) t10_pi(E) libcrc32c(E) ahci(E) crc32c_intel(E) libsas(E) libahci(E) scsi_transport_sas(E) libata(E) mlx5_core(E) megaraid_sas(E) igb(E) i xgbe(E) i2c_algo_bit(E) mlxfw(E) mdio(E) pci_hyperv_intf(E) dca(E) i2c_core(E) wmi(E) ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) fuse(E) bonding(E) [ 282.374214] CPU: 1 PID: 142 Comm: kworker/1:1 Kdump: loaded Tainted: G S E 5.10.134-19.an8.x86_64 #1 [ 282.374215] Hardware name: Huawei Technologies Co., Ltd. RH2288A V2/BC11SRSI0, BIOS RMIBV512 08/27/2015 [ 282.374223] Workqueue: events smc_lgr_terminate_work [smc] [ 282.374232] RIP: 0010:smc_close_active_abort+0x175/0x360 [smc] [ 282.374234] Code: d3 e2 f7 c2 00 00 00 09 0f 84 ce fe ff ff c6 83 e0 0a 00 00 1a eb 4e 0f b6 43 12 3c 1a 0f 84 48 01 00 00 31 ed f0 80 4b 60 01 <48> 8b 83 d0 02 00 00 48 89 df ff d0 0f 1f 00 40 84 ed 75 07 5b 5d [ 282.374235] RSP: 0018:ffffb9494699be48 EFLAGS: 00000206 [ 282.374237] RAX: 0000000000000000 RBX: ffff8d0c6f828000 RCX: 0000000000000007 [ 282.374238] RDX: 0000000000000080 RSI: 0000000000000001 RDI: ffff8d0c6f828000 [ 282.374239] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000101 [ 282.374240] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8d0c6f828000 [ 282.374240] R13: ffff8d0c689a0000 R14: ffff8d0c689a0010 R15: ffff8d0c6f828bc8 [ 282.374242] FS: 0000000000000000(0000) GS:ffff8d1b3f880000(0000) knlGS:0000000000000000 [ 282.374243] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 282.374244] CR2: 000055812070b730 CR3: 00000018eba12006 CR4: 00000000001706e0 [ 282.374244] Call Trace: [ 282.374247] <IRQ> [ 282.374253] ? watchdog_timer_fn+0x324/0x480 [ 282.374256] ? report_softlockup+0x1b0/0x1b0 [ 282.374260] ? __hrtimer_run_queues+0xfc/0x250 [ 282.374262] ? hrtimer_interrupt+0x100/0x240 [ 282.374265] ? __sysvec_apic_timer_interrupt+0x5a/0x100 [ 282.374271] ? asm_call_irq_on_stack+0xf/0x20 [ 282.374272] </IRQ> [ 282.374275] ? sysvec_apic_timer_interrupt+0x73/0x80 [ 282.374277] ? asm_sysvec_apic_timer_interrupt+0x12/0x20 [ 282.374285] ? smc_close_active_abort+0x175/0x360 [smc] [ 282.374292] __smc_lgr_terminate.part.38+0xc8/0x180 [smc] [ 282.374297] process_one_work+0x1a3/0x380 [ 282.374298] worker_thread+0x30/0x380 [ 282.374300] ? process_one_work+0x380/0x380 [ 282.374302] kthread+0x114/0x130 [ 282.374304] ? __kthread_cancel_work+0x50/0x50 [ 282.374309] ret_from_fork+0x1f/0x30 [ 282.374312] Kernel panic - not syncing: softlockup: hung tasks [ 282.374388] CPU: 1 PID: 142 Comm: kworker/1:1 Kdump: loaded Tainted: G S EL 5.10.134-19.an8.x86_64 #1 [ 282.374472] Hardware name: Huawei Technologies Co., Ltd. RH2288A V2/BC11SRSI0, BIOS RMIBV512 08/27/2015 [ 282.374561] Workqueue: events smc_lgr_terminate_work [smc] [ 282.374635] Call Trace: [ 282.374703] <IRQ> [ 282.374774] dump_stack+0x5c/0x90 [ 282.374845] panic+0x390/0x3a0 [ 282.374917] watchdog_timer_fn+0x353/0x480 [ 282.374989] ? report_softlockup+0x1b0/0x1b0 [ 282.375061] __hrtimer_run_queues+0xfc/0x250 [ 282.375152] hrtimer_interrupt+0x100/0x240 [ 282.375242] __sysvec_apic_timer_interrupt+0x5a/0x100 [ 282.375336] asm_call_irq_on_stack+0xf/0x20 [ 282.375425] </IRQ> [ 282.375512] sysvec_apic_timer_interrupt+0x73/0x80 [ 282.375610] asm_sysvec_apic_timer_interrupt+0x12/0x20 [ 282.375710] RIP: 0010:smc_close_active_abort+0x175/0x360 [smc] [ 282.375805] Code: d3 e2 f7 c2 00 00 00 09 0f 84 ce fe ff ff c6 83 e0 0a 00 00 1a eb 4e 0f b6 43 12 3c 1a 0f 84 48 01 00 00 31 ed f0 80 4b 60 01 <48> 8b 83 d0 02 00 00 48 89 df ff d0 0f 1f 00 40 84 ed 75 07 5b 5d [ 282.375963] RSP: 0018:ffffb9494699be48 EFLAGS: 00000206 [ 282.376058] RAX: 0000000000000000 RBX: ffff8d0c6f828000 RCX: 0000000000000007 [ 282.376155] RDX: 0000000000000080 RSI: 0000000000000001 RDI: ffff8d0c6f828000 [ 282.376252] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000101 [ 282.376348] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8d0c6f828000 [ 282.376445] R13: ffff8d0c689a0000 R14: ffff8d0c689a0010 R15: ffff8d0c6f828bc8 [ 282.376549] __smc_lgr_terminate.part.38+0xc8/0x180 [smc] [ 282.376642] process_one_work+0x1a3/0x380 [ 282.376734] worker_thread+0x30/0x380 [ 282.376824] ? process_one_work+0x380/0x380 [ 282.376914] kthread+0x114/0x130 [ 282.377003] ? __kthread_cancel_work+0x50/0x50 [ 282.377095] ret_from_fork+0x1f/0x30 [ 282.669632] Kernel Offset: 0x17000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) Version-Release number of selected component (if applicable): 1. Anolis OS release 8.8 2. 5.10.134-19.an8.x86_64 3. 网卡:mlnx-cx6-lx bond 802.3ad 4. redis-6.0.9 How reproducible: server: smc_run ./redis-6.0.9/src/redis-server --protected-mode no client: [root@rdma-test-001 ~]# ifdown eth1 WARN : [ifdown] You are using 'ifdown' script provided by 'network-scripts', which are now deprecated. WARN : [ifdown] 'network-scripts' will be removed in one of the next major releases of RHEL. WARN : [ifdown] It is advised to switch to 'NetworkManager' instead - it provides 'ifup/ifdown' scripts as well. Device 'eth1' successfully disconnected. 立即执行: smc_run redis-benchmark -h 10.199.36.23 -p 6379 -c 300 -n 3000000 -d 30 --threads 300 -r 10 -t GET -k 1 ERROR: failed to fetch CONFIG from 10.199.36.23:6379 WARN: could not fetch server CONFIG Error: Server closed the connection 出现概率非常高: ifdown ifup eth0 eth1 Steps to Reproduce: 1. 网卡bond后执行down掉其中一张网卡,立即开启压测 2. 可以重复测试down-up掉eth0、eth1 Actual results: Expected results: Additional info: