Bug 4604 - kernel crash 2023-03-24
Summary: kernel crash 2023-03-24
Status: RESOLVED FIXED
Alias: None
Product: ANCK 4.19 Dev
Classification: ANCK
Component: bpf (show other bugs) bpf
Version: 4.19-026.x
Hardware: All Linux
: P3-Medium S3-normal
Target Milestone: ---
Assignee: dtcccc
QA Contact: shuming
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-03-24 14:57 UTC by sunwuhao
Modified: 2023-06-28 13:12 UTC (History)
2 users (show)

See Also:


Attachments
vmcore-dmesg.txt (101.18 KB, text/plain)
2023-03-24 14:57 UTC, sunwuhao
Details
kexec-dmesg.log (106.77 KB, text/plain)
2023-03-24 14:58 UTC, sunwuhao
Details
ebpf_test (19.16 MB, application/x-sharedlib)
2023-03-28 14:11 UTC, sunwuhao
Details

Note You need to log in before you can comment on or make changes to this bug.
Description sunwuhao 2023-03-24 14:57:45 UTC
Created attachment 684 [details]
vmcore-dmesg.txt

问题描述:系统频繁 crash

系统版本:Anolis OS release 8.6

kernel 版本:4.19.91-26.iqiyi.1.git.6dd2a08dda3a.an8.x86_64(基于 4.19.91-26.an8.x86_64 开启了group identit 等特性,此前已经稳定运行了 8 个多月)

kexec-dmesg.log  vmcore  vmcore-dmesg.txt 如附件
Comment 1 sunwuhao 2023-03-24 14:58:44 UTC
Created attachment 685 [details]
kexec-dmesg.log
Comment 2 maqiao alibaba_cloud_group 2023-03-24 15:07:10 UTC
看起来和ebpf probe的函数有关系,请@dtccc帮忙看下
Comment 3 sunwuhao 2023-03-24 15:12:41 UTC
vmcore 文件,上传到了百度网盘

链接:https://pan.baidu.com/s/1mQCNFXOFRO46YiT7bKRbeA 
提取码:wqwk
Comment 4 sunwuhao 2023-03-24 15:33:26 UTC
测试了这个

https://github.com/deepflowio/deepflow

deepflow-agent 使用 ebpf 触发了 crash
Comment 5 dtcccc alibaba_cloud_group 2023-03-27 15:47:03 UTC
(In reply to sunwuhao from comment #4)
> 测试了这个
> 
> https://github.com/deepflowio/deepflow
> 
> deepflow-agent 使用 ebpf 触发了 crash

有没有快速复现的方法或是可以提供的环境?我build这个都费劲。。。
Comment 6 sunwuhao 2023-03-28 14:10:09 UTC
(In reply to dtcccc from comment #5)
> (In reply to sunwuhao from comment #4)
> > 测试了这个
> > 
> > https://github.com/deepflowio/deepflow
> > 
> > deepflow-agent 使用 ebpf 触发了 crash
> 
> 有没有快速复现的方法或是可以提供的环境?我build这个都费劲。。。

实际环境是 k8s node 节点一旦跑了 deeplow 的 daemonset agent kernel 就 crash 了
可以用这个 example 来模拟测试,启动了 ebpf 实例,然后通过 ebfp hook 读取相关 io 数据并打印到标准输出
源代码链接: https://github.com/deepflowio/deepflow/tree/main/agent/src/ebpf/samples/rust
可执行文件:ebpf_test上传到附件中了,加个执行权限,直接执行即可
Comment 7 sunwuhao 2023-03-28 14:11:13 UTC
Created attachment 688 [details]
ebpf_test
Comment 8 dtcccc alibaba_cloud_group 2023-03-28 15:28:52 UTC
(In reply to sunwuhao from comment #7)
> Created attachment 688 [details]
> ebpf_test

看起来这个程序用了for循环,在4.19.91-26内核上会被拒绝
back-edge from insn 1876 to 1851

正好手头有个移植了bounded loop的4.19内核能跑,目前还没出问题

另外,看dmesg里,关于probe read的地址只是个warning,和后面真正crash的地方无关
真正crash是因为nmi重入了,和bpf的关联还需要再看
Comment 9 sunwuhao 2023-03-30 11:12:28 UTC
(In reply to dtcccc from comment #8)
> (In reply to sunwuhao from comment #7)
> > Created attachment 688 [details]
> > ebpf_test
> 
> 看起来这个程序用了for循环,在4.19.91-26内核上会被拒绝
> back-edge from insn 1876 to 1851
> 
> 正好手头有个移植了bounded loop的4.19内核能跑,目前还没出问题
> 
> 另外,看dmesg里,关于probe read的地址只是个warning,和后面真正crash的地方无关
> 真正crash是因为nmi重入了,和bpf的关联还需要再看

好的,辛苦,主要看看为撒 crash 了
Comment 10 dtcccc alibaba_cloud_group 2023-04-18 11:03:35 UTC
你好,NMI重入的问题已经在4.19.91-27版本中解决

相关PR:https://gitee.com/anolis/cloud-kernel/commit/97ee1061fc39e25f8ad568fe6fe76bd1bc9ce682
Comment 11 dtcccc alibaba_cloud_group 2023-04-18 11:04:49 UTC
(In reply to dtcccc from comment #10)
> 你好,NMI重入的问题已经在4.19.91-27版本中解决
> 
> 相关PR:https://gitee.com/anolis/cloud-kernel/commit/
> 97ee1061fc39e25f8ad568fe6fe76bd1bc9ce682

https://gitee.com/anolis/cloud-kernel/pulls/867
Comment 12 sunwuhao 2023-04-19 11:32:51 UTC
收到,多谢多谢
Comment 13 dtcccc alibaba_cloud_group 2023-06-28 13:12:39 UTC
问题关闭