Bug 9006 - peer的lgr被销毁后,local主机会小概率的crash。
Summary: peer的lgr被销毁后,local主机会小概率的crash。
Status: NEW
Alias: None
Product: ANCK 5.10 Dev
Classification: ANCK
Component: net (show other bugs) net
Version: unspecified
Hardware: All Linux
: P3-Medium S3-normal
Target Milestone: ---
Assignee: XuanZhuo
QA Contact: shuming
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-05-11 16:40 UTC by zhan_zhaozeng
Modified: 2024-05-11 16:40 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description zhan_zhaozeng 2024-05-11 16:40:10 UTC
Description of problem:
peer的lgr被销毁后,local主机会极小概率的crash。相关的栈信息如下:
[133.9636771Call Trace:
[133.963685]smc ism unset conn+x24/0x60 [smc]
[133.963702]smc conn kill+0x8b/0xf0[smc]
[133.963716]smc lgr terminate.part.36+0x98/0x110 [smc]
[133.963731]process one work+x1a7/0x360
[133.963741]?create worker+0xla0/0x1a0
[133.963749]worker thread+x30/0x390
[133:963758]?create worker+0xla0/0x1a0
[133.963766]kthread+0x10a/0x120
[133.963775]? set kthread struct+0x50/0x50
[133.963785]ret from fork+0x1f/0x40

当peer的lgr被销毁后,local主机会触发smcr_link_down,在smcr_link_down中,smc_switch_conns中的流程与smcr_link_clear为异步执行,当smcr_link_clear执行早于smc_switch_conns中异步执行的__smc_lgr_teninate时,lgr会在smcr_link_clear中被free。导致__smc_lgr_teninate中执行smc_conn_kill(conn,soft)时crash.

Version-Release number of selected component (if applicable):


How reproducible:
正常复现的概率较小,可以通过在smc_lgr_terminate_work的__smc_lgr_teninate前加些延时,然后将peer端的与本节点相关的lgr销毁。

Steps to Reproduce:

1.在local节点中smc_core.c的代码中smc_lgr_terminate_work的__smc_lgr_teninate前加些延时来增大复现概率。
2.在已经建立smcr连接的peer端,销毁linkgroup。


Actual results:

smcr_link_clear执行早于smc_switch_conns中异步执行的__smc_lgr_teninate,lgr会在smcr_link_clear中被free。导致__smc_lgr_teninate中执行smc_conn_kill(conn,soft)时crash:
[133.9636771Call Trace:
[133.963685]smc ism unset conn+x24/0x60 [smc]
[133.963702]smc conn kill+0x8b/0xf0[smc]
[133.963716]smc lgr terminate.part.36+0x98/0x110 [smc]
[133.963731]process one work+x1a7/0x360
[133.963741]?create worker+0xla0/0x1a0
[133.963749]worker thread+x30/0x390
[133:963758]?create worker+0xla0/0x1a0
[133.963766]kthread+0x10a/0x120
[133.963775]? set kthread struct+0x50/0x50
[133.963785]ret from fork+0x1f/0x40

Expected results:

local主机会触发smcr_link_down,在其中的smc_switch_conns中通过异步流程,销毁lgr,然后再通过smcr_link_down的smcr_link_clear清理 link。
Additional info: