Bug 8004 - 龙蜥8.8 【5.10内核】,安装报错,系统CPU为酷睿12代
Summary: 龙蜥8.8 【5.10内核】,安装报错,系统CPU为酷睿12代
Status: IN_PROGRESS
Alias: None
Product: ANCK 5.10 Dev
Classification: ANCK
Component: ARCH(unspecified) (show other bugs) ARCH(unspecified)
Version: 5.10.y-9
Hardware: All Linux
: P3-Medium S3-normal
Target Milestone: ---
Assignee: kun(llfl)
QA Contact: shuming
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-01-22 14:19 UTC by jaingzilin
Modified: 2024-02-02 17:12 UTC (History)
3 users (show)

See Also:


Attachments
8.8 5.10安装报错,CPU为酷睿12代 (952.54 KB, image/jpeg)
2024-01-22 14:22 UTC, jaingzilin
Details
lscpu (854.44 KB, image/jpeg)
2024-01-23 17:22 UTC, kun(llfl)
Details

Note You need to log in before you can comment on or make changes to this bug.
Description jaingzilin 2024-01-22 14:19:24 UTC
Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
Comment 1 jaingzilin 2024-01-22 14:22:36 UTC
Created attachment 971 [details]
8.8 5.10安装报错,CPU为酷睿12代
Comment 2 Guanjun alibaba_cloud_group 2024-01-22 15:48:03 UTC
内核版本是多少呢,具体到5.10.134-xx,其中的xx是多少。

是遇到的空指针还是hang task了
看代码的话,这里不应该存在问题。
Comment 3 Guanjun alibaba_cloud_group 2024-01-22 16:29:19 UTC
请我们的resctrl专家@库恩帮忙看一下
Comment 4 kun(llfl) alibaba_cloud_group 2024-01-23 11:22:51 UTC
根因确定,根据代码静态分析得出结论:crash原因为运行CPU不支持CQM_LLC功能,因此cpuinfo中的x86_cache_max_rmid变量被初始化为-1,resctrl中初始化未能正确处理该情况,往kmalloc函数中传了错误的参数,导致内核crash。
Comment 5 kun(llfl) alibaba_cloud_group 2024-01-23 17:22:24 UTC
Created attachment 974 [details]
lscpu
Comment 6 kun(llfl) alibaba_cloud_group 2024-01-23 17:29:24 UTC
引入问题补丁0b3b8363ed1eeaa81fe952af3dc44f7d50fcf093
该补丁将resctrl_arch_late_init()流程中的dom_data_init()中下列调用结构
resctrl_arch_late_init()
  -> get_rdt_resources()
    -> get_rdt_alloc_resources()
    -> get_rdt_mon_resources()
      ->dom_data_init()
  -> resctrl_init()
...
改成了如下调用结构
resctrl_arch_late_init()
  -> get_rdt_resources()
    -> get_rdt_alloc_resources()
    -> get_rdt_mon_resources()
  -> resctrl_init()
    -> resctrl_mon_resource_init()
        ->dom_data_init()
...

当机器只有CAT功能但MBM功能缺失时,该结构会错误地走到resctrl_mon_resource_init()流程,导致kernel init阶段crash
Comment 7 小龙 admin 2024-02-02 17:12:03 UTC
The PR Link: https://gitee.com/anolis/cloud-kernel/pulls/2749