Created attachment 1104 [details] 热插拔日志 Description of problem: 配置博通的PXE89104,switch下挂了13个U.2的盘和1个I350网卡芯片,都是可以正常工作的;当拔掉1个U.2的盘后,整个系统的热插拔就失效了,包括CPU直出的root port也会失效。失效后插入设备lspci下,不会显示,dmesg下显示link down,card not present。Switch看内部状态和link状态都没问题,CPU看PCIe的寄存器状态没问题,卡是在位的,但是中断上报到OS,OS未收到,不会进行枚举操作。 Version-Release number of selected component (if applicable): OS镜像:anolis-8.9-x86_64-dvd.iso How reproducible: 系统下进行PCIe Switch设备热插拔 Steps to Reproduce: 1.带着port30一个盘上电; 2.然后插入port0一架盘,正常; 3.拔出port0盘lspci显示+-00.0-[14]--; 4.重新插入port0盘,lspci还是没有反应; 5.再插入port4盘,lspci依然没有反应; Actual results: u.2盘系统下热插拔后,lspci还是没有反应 Expected results: u.2盘系统下热插拔后,lspci能正常显示设备 Additional info: 1、如果把switch下的I350断开,U.2的热插拔功能是正常的。 2、相同环境使用麒麟和方德系统验证热插拔功能正常
从日志理看到异常信息 [ 591.037807] pcieport 0000:00:03.1: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:12:00.0 [ 591.038180] pci 0000:12:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID) [ 591.038189] pci 0000:12:00.0: device [1000:c030] error status/mask=00100000/00400000 [ 591.038191] pci 0000:12:00.0: [20] UnsupReq (First) [ 591.038196] pci 0000:12:00.0: AER: TLP Header: 40001001 0000000f d0c0241c 01000000 [ 591.070670] pci 0000:14:00.0: AER: can't recover (no error_detected callback) [ 591.070680] pci 0000:15:00.0: AER: can't recover (no error_detected callback) [ 591.070686] pci 0000:16:00.0: AER: can't recover (no error_detected callback) [ 591.070691] pci 0000:17:00.0: AER: can't recover (no error_detected callback) [ 591.070696] pci 0000:18:00.0: AER: can't recover (no error_detected callback) [ 591.070702] pci 0000:19:00.0: AER: can't recover (no error_detected callback) [ 591.070706] pci 0000:1a:00.0: AER: can't recover (no error_detected callback) [ 591.070711] pci 0000:1b:00.0: AER: can't recover (no error_detected callback) [ 591.070717] pci 0000:1c:00.0: AER: can't recover (no error_detected callback) [ 591.070722] pci 0000:1d:00.0: AER: can't recover (no error_detected callback) [ 591.070727] pci 0000:1e:00.0: AER: can't recover (no error_detected callback) [ 591.070741] pci 0000:1f:00.0: AER: can't recover (no error_detected callback) [ 591.154022] pci 0000:22:00.0: AER: can't recover (no error_detected callback) 了解到,是通过暴力热插拔方式进行的操作,从日志看,应该是在给pci 分配资源时,出现了无法处理的多个uce问题,怀疑是link down 之后,硬件寄存器中某些数据丢失,重新插回时,无法成功恢复。建议是否有firwmare层的debug可以看看link down前后是否硬件某些寄存器数据不一样。或者可以采用非暴力(通知式)的操作对比看看。