9006 – peer的lgr被销毁后，local主机会小概率的crash。

Bug 9006 - peer的lgr被销毁后，local主机会小概率的crash。

Summary: peer的lgr被销毁后，local主机会小概率的crash。

Status:	NEW

Alias:	None

Product:	ANCK 5.10 Dev
Classification:	ANCK
Component:	net (show other bugs)	net
Sub Component:
Version:	unspecified
Hardware:	All Linux

Importance:	P3-Medium S3-normal
Target Milestone:	---
Assignee:	XuanZhuo
QA Contact:	shuming

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2024-05-11 16:40 UTC by zhan_zhaozeng
Modified:	2024-05-11 16:40 UTC (History)
CC List:	2 users (show)

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description zhan_zhaozeng 2024-05-11 16:40:10 UTC

Description of problem:
peer的lgr被销毁后，local主机会极小概率的crash。相关的栈信息如下：
[133.9636771Call Trace:
[133.963685]smc ism unset conn+x24/0x60 [smc]
[133.963702]smc conn kill+0x8b/0xf0[smc]
[133.963716]smc lgr terminate.part.36+0x98/0x110 [smc]
[133.963731]process one work+x1a7/0x360
[133.963741]?create worker+0xla0/0x1a0
[133.963749]worker thread+x30/0x390
[133:963758]?create worker+0xla0/0x1a0
[133.963766]kthread+0x10a/0x120
[133.963775]? set kthread struct+0x50/0x50
[133.963785]ret from fork+0x1f/0x40

当peer的lgr被销毁后，local主机会触发smcr_link_down，在smcr_link_down中，smc_switch_conns中的流程与smcr_link_clear为异步执行，当smcr_link_clear执行早于smc_switch_conns中异步执行的__smc_lgr_teninate时，lgr会在smcr_link_clear中被free。导致__smc_lgr_teninate中执行smc_conn_kill(conn,soft)时crash.

Version-Release number of selected component (if applicable):


How reproducible:
正常复现的概率较小，可以通过在smc_lgr_terminate_work的__smc_lgr_teninate前加些延时,然后将peer端的与本节点相关的lgr销毁。

Steps to Reproduce:

1.在local节点中smc_core.c的代码中smc_lgr_terminate_work的__smc_lgr_teninate前加些延时来增大复现概率。
2.在已经建立smcr连接的peer端，销毁linkgroup。


Actual results:

smcr_link_clear执行早于smc_switch_conns中异步执行的__smc_lgr_teninate，lgr会在smcr_link_clear中被free。导致__smc_lgr_teninate中执行smc_conn_kill(conn,soft)时crash:
[133.9636771Call Trace:
[133.963685]smc ism unset conn+x24/0x60 [smc]
[133.963702]smc conn kill+0x8b/0xf0[smc]
[133.963716]smc lgr terminate.part.36+0x98/0x110 [smc]
[133.963731]process one work+x1a7/0x360
[133.963741]?create worker+0xla0/0x1a0
[133.963749]worker thread+x30/0x390
[133:963758]?create worker+0xla0/0x1a0
[133.963766]kthread+0x10a/0x120
[133.963775]? set kthread struct+0x50/0x50
[133.963785]ret from fork+0x1f/0x40

Expected results:

local主机会触发smcr_link_down，在其中的smc_switch_conns中通过异步流程，销毁lgr，然后再通过smcr_link_down的smcr_link_clear清理 link。
Additional info: