Bug 8612 - 龙蜥8.9系统下进行PCIe switch热插拔后出现u.2设备丢失问题
Summary: 龙蜥8.9系统下进行PCIe switch热插拔后出现u.2设备丢失问题
Status: NEW
Alias: None
Product: Anolis OS 8
Classification: Anolis OS
Component: kernel - anck-5.10 (show other bugs) kernel - anck-5.10
Version: 8.9
Hardware: All Linux
: P3-Medium S3-normal
Target Milestone: ---
Assignee: maqiao_mq
QA Contact: shuming
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-03-22 17:40 UTC by W521978
Modified: 2024-04-01 19:31 UTC (History)
1 user (show)

See Also:


Attachments
热插拔日志 (234.73 KB, application/x-compressed)
2024-03-22 17:40 UTC, W521978
Details

Note You need to log in before you can comment on or make changes to this bug.
Description W521978 2024-03-22 17:40:13 UTC
Created attachment 1104 [details]
热插拔日志

Description of problem:
配置博通的PXE89104,switch下挂了13个U.2的盘和1个I350网卡芯片,都是可以正常工作的;当拔掉1个U.2的盘后,整个系统的热插拔就失效了,包括CPU直出的root port也会失效。失效后插入设备lspci下,不会显示,dmesg下显示link down,card not present。Switch看内部状态和link状态都没问题,CPU看PCIe的寄存器状态没问题,卡是在位的,但是中断上报到OS,OS未收到,不会进行枚举操作。

Version-Release number of selected component (if applicable):
OS镜像:anolis-8.9-x86_64-dvd.iso

How reproducible:
系统下进行PCIe Switch设备热插拔

Steps to Reproduce:
1.带着port30一个盘上电;
2.然后插入port0一架盘,正常;
3.拔出port0盘lspci显示+-00.0-[14]--;
4.重新插入port0盘,lspci还是没有反应;
5.再插入port4盘,lspci依然没有反应;

Actual results:
u.2盘系统下热插拔后,lspci还是没有反应

Expected results:
u.2盘系统下热插拔后,lspci能正常显示设备

Additional info:
1、如果把switch下的I350断开,U.2的热插拔功能是正常的。
2、相同环境使用麒麟和方德系统验证热插拔功能正常
Comment 1 gumi alibaba_cloud_group 2024-04-01 19:31:15 UTC
从日志理看到异常信息
[  591.037807] pcieport 0000:00:03.1: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:12:00.0
[  591.038180] pci 0000:12:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[  591.038189] pci 0000:12:00.0:   device [1000:c030] error status/mask=00100000/00400000
[  591.038191] pci 0000:12:00.0:    [20] UnsupReq               (First)
[  591.038196] pci 0000:12:00.0: AER:   TLP Header: 40001001 0000000f d0c0241c 01000000
[  591.070670] pci 0000:14:00.0: AER: can't recover (no error_detected callback)
[  591.070680] pci 0000:15:00.0: AER: can't recover (no error_detected callback)
[  591.070686] pci 0000:16:00.0: AER: can't recover (no error_detected callback)
[  591.070691] pci 0000:17:00.0: AER: can't recover (no error_detected callback)
[  591.070696] pci 0000:18:00.0: AER: can't recover (no error_detected callback)
[  591.070702] pci 0000:19:00.0: AER: can't recover (no error_detected callback)
[  591.070706] pci 0000:1a:00.0: AER: can't recover (no error_detected callback)
[  591.070711] pci 0000:1b:00.0: AER: can't recover (no error_detected callback)
[  591.070717] pci 0000:1c:00.0: AER: can't recover (no error_detected callback)
[  591.070722] pci 0000:1d:00.0: AER: can't recover (no error_detected callback)
[  591.070727] pci 0000:1e:00.0: AER: can't recover (no error_detected callback)
[  591.070741] pci 0000:1f:00.0: AER: can't recover (no error_detected callback)
[  591.154022] pci 0000:22:00.0: AER: can't recover (no error_detected callback)

了解到,是通过暴力热插拔方式进行的操作,从日志看,应该是在给pci 分配资源时,出现了无法处理的多个uce问题,怀疑是link down 之后,硬件寄存器中某些数据丢失,重新插回时,无法成功恢复。建议是否有firwmare层的debug可以看看link down前后是否硬件某些寄存器数据不一样。或者可以采用非暴力(通知式)的操作对比看看。