Bug 4338 - [Anolis 23][社区nightly & ANCK-5.10-14-rc1][aarch64]执行LTP测试mm模块至oom01内存不足,导致系统中断,无法继续执行。
Summary: [Anolis 23][社区nightly & ANCK-5.10-14-rc1][aarch64]执行LTP测试mm模块至oom01内存不足,导致系统中...
Status: RESOLVED WONTFIX
Alias: None
Product: Anolis OS 23
Classification: Anolis OS
Component: BaseOS Modules (show other bugs) BaseOS Modules
Version: 23.0
Hardware: All Linux
: P3-Medium S3-normal
Target Milestone: ---
Assignee: yunmeng365524
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-03-03 15:28 UTC by Banana
Modified: 2023-03-13 14:53 UTC (History)
4 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Banana alibaba_cloud_group 2023-03-03 15:28:45 UTC
[问题描述]:anolis23 执行ltp测试mm模块执行到ksm05会导致内存不足,系统中断一下(ssh断连、ltp无法继续执行等),无法继续执行。

message日志:
2023-03-03T11:06:09.471894+08:00 qibo-anck014-an23-g6r-1 systemd-logind[790]: Removed session 9.
2023-03-03T11:06:09.471251+08:00 qibo-anck014-an23-g6r-1 systemd[1]: session-9.scope: Consumed 12min 23.395s CPU time.
2023-03-03T11:06:09.469958+08:00 qibo-anck014-an23-g6r-1 systemd[1]: session-9.scope: Failed with result 'oom-kill'.
2023-03-03T11:06:09.377890+08:00 qibo-anck014-an23-g6r-1 systemd-logind[790]: Session 9 logged out. Waiting for processes to exit.
2023-03-03T11:06:09.377867+08:00 qibo-anck014-an23-g6r-1 systemd[1]: session-9.scope: Killing process 1644914 (node) with signal SIGKILL.、
2023-03-03T11:06:09.373685+08:00 qibo-anck014-an23-g6r-1 systemd[1]: session-9.scope: Killing process 2112707 (oom01) with signal SIGKILL.
2023-03-03T11:06:09.373656+08:00 qibo-anck014-an23-g6r-1 systemd[1]: session-9.scope: Killing process 2111537 (cpuUsage.sh) with signal SIGKILL.
2023-03-03T11:06:09.373635+08:00 qibo-anck014-an23-g6r-1 systemd[1]: session-9.scope: Killing process 1880907 (sleep) with signal SIGKILL.
2023-03-03T11:06:09.373611+08:00 qibo-anck014-an23-g6r-1 systemd[1]: session-9.scope: Killing process 1645418 (ltp-pan) with signal SIGKILL.
2023-03-03T11:06:09.373588+08:00 qibo-anck014-an23-g6r-1 systemd[1]: session-9.scope: Killing process 1645208 (runltp) with signal SIGKILL.
2023-03-03T11:06:09.373564+08:00 qibo-anck014-an23-g6r-1 systemd[1]: session-9.scope: Killing process 1645186 (run_test.sh) with signal SIGKILL.
2023-03-03T11:06:09.373542+08:00 qibo-anck014-an23-g6r-1 systemd[1]: session-9.scope: Killing process 1645150 (python) with signal SIGKILL.
2023-03-03T11:06:09.373517+08:00 qibo-anck014-an23-g6r-1 systemd[1]: session-9.scope: Killing process 1644967 (bash) with signal SIGKILL.
2023-03-03T11:06:09.373494+08:00 qibo-anck014-an23-g6r-1 systemd[1]: session-9.scope: Killing process 1644901 (node) with signal SIGKILL.
2023-03-03T11:06:09.373402+08:00 qibo-anck014-an23-g6r-1 systemd[1]: session-9.scope: Killing process 1644689 (sh) with signal SIGKILL.
2023-03-03T11:06:09.373378+08:00 qibo-anck014-an23-g6r-1 systemd[1]: session-9.scope: Killing process 1644649 (bash) with signal SIGKILL.
2023-03-03T11:06:09.366658+08:00 qibo-anck014-an23-g6r-1 systemd[1]: session-9.scope: Killing process 1644633 (sshd) with signal SIGKILL.
2023-03-03T11:06:09.358146+08:00 qibo-anck014-an23-g6r-1 systemd[1]: session-9.scope: A process of this unit has been killed by the OOM killer.
2023-03-03T11:06:09.356904+08:00 qibo-anck014-an23-g6r-1 kernel: Out of memory: Killed process 2112715 (oom01) total-vm:44100812kB, anon-rss:29070676kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:56996kB oom_score_adj:0
2023-03-03T11:06:09.356889+08:00 qibo-anck014-an23-g6r-1 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/user.slice/user-0.slice/session-9.scope,task=oom01,pid=2112715,uid=0

[环境信息]:
机器类型:ECS

内核信息:
[root@qibo-anck014-an23-g6r-1 2]# uname -r
5.10.134-14_rc1.an23.aarch64

操作系统信息:
[root@qibo-anck014-an23-g6r-1 2]# uname -r
5.10.134-14_rc1.an23.aarch64
[root@qibo-anck014-an23-g6r-1 2]# cat /etc/os-release
NAME="Anolis OS"
VERSION="23"
ID="anolis"
VERSION_ID="23"
PLATFORM_ID="platform:an23"
PRETTY_NAME="Anolis OS 23"
ANSI_COLOR="0;31"
HOME_URL="https://openanolis.cn/"
BUG_REPORT_URL="https://bugzilla.openanolis.cn/"

[问题发生概率]:必现

[复现步骤]:
1、安装部署LTP环境
2、执行LTP测试,执行ltp mm模块 ./runltp -f mm -s oom01
Comment 1 Banana alibaba_cloud_group 2023-03-03 17:35:53 UTC
Anck-5.10-14版本及其他版本都存在此问题
Comment 2 yunmeng365524 2023-03-07 10:17:44 UTC
该用例代码:
static void verify_oom(void)
{
#if __WORDSIZE == 32
        tst_brk(TCONF, "test is not designed for 32-bit system.");
#endif

        /* we expect mmap to fail before OOM is hit */
        set_sys_tune("overcommit_memory", 2, 1);
        oom(NORMAL, 0, ENOMEM, 0);

        /* with overcommit_memory set to 0 or 1 there's no
         * guarantee that mmap fails before OOM */
        set_sys_tune("overcommit_memory", 0, 1);
        oom(NORMAL, 0, ENOMEM, 1);

        set_sys_tune("overcommit_memory", 1, 1);
        testoom(0, 0, ENOMEM, 1);
}
意图是测试分别设置overcommit_memory 为0、1、2时,触发oom的反应。
3个参数的配置分别代表:
#define OVERCOMMIT_GUESS                0
#define OVERCOMMIT_ALWAYS               1
#define OVERCOMMIT_NEVER                2
oom函数每次会吃掉3G的内存,直到触发oomkiller,当overcommit_memory配置时1的时候,申请3G的内存,剩余的内存不够,就会一致杀死系统中的进程直到满足3G为止。所以这个case 会导致断链应该是正常现象。
跟海洋交流过,出现oom后,设备“就是会一下子好像所有东西都断了一下 ssh会断联一下”。并没有其他异常,是正常现象。