Bug 2511 - [ANCK 4.19] 4.19内核对SSD盘direct IO,可能存在ncq不兼容性问题
Summary: [ANCK 4.19] 4.19内核对SSD盘direct IO,可能存在ncq不兼容性问题
Status: RESOLVED FIXED
Alias: None
Product: ANCK 4.19 Dev
Classification: ANCK
Component: block/storage (show other bugs) block/storage
Version: 4.19-026.x
Hardware: aarch64 Linux
: P3-Medium S3-normal
Target Milestone: ---
Assignee: gumi
QA Contact: shuming
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-10-24 14:36 UTC by songkai
Modified: 2022-12-07 09:09 UTC (History)
4 users (show)

See Also:


Attachments
系统崩溃前的demsg日志 (516.36 KB, image/jpeg)
2022-10-24 14:36 UTC, songkai
Details

Note You need to log in before you can comment on or make changes to this bug.
Description songkai inspur_group 2022-10-24 14:36:44 UTC
Created attachment 430 [details]
系统崩溃前的demsg日志

问题描述:
ANCK4.19,在ltp的一个测试项为dio(derect IO),在测试期间对系统盘(SSD盘)进行了测试,30分钟后发现机器黑屏,系统崩溃,通过串口信息查看发现系统盘挂起,同时出现ncq dma等相关的关键词,于是进行如下测试:
1. 更换rhel兼容的4.18内核,测试无问题
2. 添加内核参数libata.force=noncq关闭磁盘的ncq特性,测试无问题
   等同于设置磁盘IO的队列深度为1

系统环境:
内核:        4.19.91-26
CPU型号:     飞腾2500
SATA 控制器: ASMedia Technology Inc.
SSD盘型号:   Intel SSDSC2KG48

复现步骤:
1. ltp测试,执行/opt/ltp/runltp -f dio -l /home/dio.log -d /tmp/ -o /home/dio.out.log -t 7d &
2.观察机器

dmesg日志:
前面没有错误,死机前出现的日志见附件
Comment 1 Joseph Qi alibaba_cloud_group 2022-10-24 14:51:51 UTC
liusong, please take a look at this bug, thanks.
Comment 2 落盘 alibaba_cloud_group 2022-10-25 09:47:45 UTC
(In reply to josephqi from comment #1)
> liusong, please take a look at this bug, thanks.

OK, it seems to be related to the specific environment, next, gumi will take this.
Comment 3 gumi alibaba_cloud_group 2022-10-28 11:44:08 UTC
目前看是由于INTEL SSDSC2KG48型号的SSD盘不兼容NCQ,其实本来SSD盘就不需要NCQ功能,NCQ功能是针对HDD盘的优化。所以这块可能存在某些问题,类似的BZ可参考如下:
https://bugzilla.kernel.org/show_bug.cgi?id=203475#c15
Comment 4 gumi alibaba_cloud_group 2022-11-30 09:56:01 UTC
【问题根因】
飞腾2000和飞腾2500芯片SMMU bug,针对该芯片,默认启用iommu_passthrough
Comment 5 songkai inspur_group 2022-12-07 09:09:06 UTC
(In reply to gumi from comment #4)
> 【问题根因】
> 飞腾2000和飞腾2500芯片SMMU bug,针对该芯片,默认启用iommu_passthrough

感谢gumi的支持,问题已经解决