Bug 8822 - [Anolis23.1 GA][Beta][ANCK-6.6.25-2][x86_64/aarch64] kernel-selftests测试cgroup.test_memcontrol执行fail
Summary: [Anolis23.1 GA][Beta][ANCK-6.6.25-2][x86_64/aarch64] kernel-selftests测试cgroup...
Status: CLOSED BYDESIGN
Alias: None
Product: ANCK 6.6 Dev
Classification: ANCK
Component: generic (show other bugs) generic
Version: 6.6.25-2
Hardware: All Linux
: P3-Medium S3-normal
Target Milestone: ---
Assignee: xuyu
QA Contact:
URL:
Whiteboard:
Keywords: Function
Depends on:
Blocks:
 
Reported: 2024-04-22 17:49 UTC by anolislw
Modified: 2024-05-21 16:52 UTC (History)
6 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description anolislw alibaba_cloud_group 2024-04-22 17:49:26 UTC
[缺陷描述]:
kernel-selftests测试breakpoints.step_after_suspend_test用例执行fail ,报错信息Bail out! Failed to enter Suspend state 


[重现概率]:
必现

[重现步骤]
1. 下载kernel-6.6.25-2_rc1.an23.src.rpm
2. rpm -i kernel-6.6.25-2_rc1.an23.src.rpm
3. yum-builddep -y /root/rpmbuild/SPECS/kernel.spec   
   rpmbuild -bp /root/rpmbuild/SPECS/kernel.spec
   cd /root/rpmbuild/BUILD/kernel-6.6.25-2_rc1.an23/linux-6.6.25-2_rc1.an23.aarch64/tools/testing/selftests/cgroup
4. make;./test_memcontrol

[期望结果]:
用例执行PASS


[实际结果]:
[root@iZbp143ti4ccpaufkzata6Z cgroup]# ./test_memcontrol
ok 1 test_memcg_subtree_control
ok 2 test_memcg_current
ok 3 test_memcg_min
ok 4 test_memcg_low
ok 5 test_memcg_high
ok 6 test_memcg_high_sync
ok 7 test_memcg_max
ok 8 test_memcg_reclaim
ok 9 test_memcg_oom_events
ok 10 # SKIP test_memcg_swap_max
not ok 11 test_memcg_sock
ok 12 test_memcg_oom_group_leaf_events
ok 13 test_memcg_oom_group_parent_events
ok 14 test_memcg_oom_group_score_events



[重现环境]:
环境信息:云上ecs

[root@iZbp143ti4ccpaufkzata6Z breakpoints]# uname -ra
Linux iZbp143ti4ccpaufkzata6Z 6.6.25-2_rc1.an23.aarch64 #1 SMP PREEMPT_DYNAMIC Thu Apr 11 15:02:38 CST 2024 aarch64 aarch64 aarch64 GNU/Linux
[root@iZbp143ti4ccpaufkzata6Z breakpoints]#
[root@iZbp143ti4ccpaufkzata6Z breakpoints]# cat /etc/os-release
NAME="Anolis OS"
VERSION="23"
ID="anolis"
VERSION_ID="23"
PLATFORM_ID="platform:an23"
PRETTY_NAME="Anolis OS 23"
ANSI_COLOR="0;31"
HOME_URL="https://openanolis.cn/"
BUG_REPORT_URL="https://bugzilla.openanolis.cn/"

[root@iZbp143ti4ccpaufkzata6Z breakpoints]#
[root@iZbp143ti4ccpaufkzata6Z breakpoints]#
[root@iZbp143ti4ccpaufkzata6Z breakpoints]# df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        4.0M     0  4.0M   0% /dev
tmpfs            16G     0   16G   0% /dev/shm
tmpfs           6.1G  804K  6.1G   1% /run
efivarfs        256K   18K  239K   7% /sys/firmware/efi/efivars
/dev/nvme0n1p2   40G   13G   27G  33% /
tmpfs            16G  3.1M   16G   1% /tmp
/dev/nvme0n1p1  500M  6.5M  494M   2% /boot/efi
tmpfs           3.1G  4.0K  3.1G   1% /run/user/0
[root@iZbp143ti4ccpaufkzata6Z breakpoints]#
[root@iZbp143ti4ccpaufkzata6Z breakpoints]# free -g
               total        used        free      shared  buff/cache   available
Mem:              30           0          28           0           1          29
Swap:              0           0           0
[root@iZbp143ti4ccpaufkzata6Z breakpoints]#
[root@iZbp143ti4ccpaufkzata6Z breakpoints]# cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt2)/boot/vmlinuz-6.6.25-2_rc1.an23.aarch64 root=UUID=6424d533-3c41-4ad9-89fa-1d3bf8c49fd3 ro rhgb crashkernel=0M-2G:0M,2G-64G:256M,64G-:384M iommu.passthrough=1 iommu.strict=0 cryptomgr.notests cgroup.memory=nokmem rcupdate.rcu_cpu_stall_timeout=300 quiet selinux=1 console=tty0 biosdevname=0 net.ifnames=0 console=ttyAMA0,115200n8 noibrs nvme_core.io_timeout=4294967295 nvme_core.admin_timeout=4294967295
[root@iZbp143ti4ccpaufkzata6Z breakpoints]#
[root@iZbp143ti4ccpaufkzata6Z breakpoints]# lscpu
Architecture:             aarch64
  CPU op-mode(s):         32-bit, 64-bit
  Byte Order:             Little Endian
CPU(s):                   8
  On-line CPU(s) list:    0-7
Vendor ID:                ARM
  BIOS Vendor ID:         Alibaba Cloud
  Model name:             Neoverse-N2
    BIOS Model name:      virt-rhel7.6.0  CPU @ 2.0GHz
    BIOS CPU family:      1
    Model:                0
    Thread(s) per core:   1
    Core(s) per socket:   8
    Socket(s):            1
    Stepping:             r0p0
    BogoMIPS:             100.00
    Flags:                fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm
                          3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb dcpodp sve2 sveaes svepmull svebitperm svesh
                          a3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh
Caches (sum of all):
  L1d:                    512 KiB (8 instances)
  L1i:                    512 KiB (8 instances)
  L2:                     8 MiB (8 instances)
  L3:                     64 MiB (1 instance)
NUMA:
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-7
Vulnerabilities:
  Gather data sampling:   Not affected
  Itlb multihit:          Not affected
  L1tf:                   Not affected
  Mds:                    Not affected
  Meltdown:               Not affected
  Mmio stale data:        Not affected
  Reg file data sampling: Not affected
  Retbleed:               Not affected
  Spec rstack overflow:   Not affected
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:             Mitigation; __user pointer sanitization
  Spectre v2:             Mitigation; CSV2, BHB
  Srbds:                  Not affected
  Tsx async abort:        Not affected
Comment 1 anolislw alibaba_cloud_group 2024-04-22 17:55:20 UTC
通过循环验证,这个fail case存在概率性fail问题,在x86_64环境下 循环执行20次均pass,在aarch64环境下,循环执行20次只pass一次
Comment 2 yunmeng365524 2024-05-07 20:15:45 UTC
5.10 也存在失败的情况。详细见bugzilla:https://bugzilla.openanolis.cn/show_bug.cgi?id=4519

请研发帮忙确认,为什么x86和arm会有差异。
Comment 3 escape alibaba_cloud_group 2024-05-08 15:22:56 UTC
test_memcg_sock 测试失败的原因为关闭 client 之后,读取 memory.stat 里的 sock 数据不为 0,不为 0 是因为这里的数据统计基于 rstat 框架,存在 percpu 的缓存,并不会每次读取的时候都刷新缓存(定时刷,或者超过阈值会刷),也就是说,即便 sock 已经被释放了,读取统计数据的时候也有不为 0 的可能。

可以多跑几次测试用例,过一次就算没问题;或者在测试用例读之前睡眠个一小会儿等定时work刷了缓存再读。
Comment 4 escape alibaba_cloud_group 2024-05-08 15:24:54 UTC
这里的统计数据, x86 和 arm 实现上没什么区别;但刷缓存的阈值跟 numa 节点,cpu 数量有联系,不同机型这些不一样,因此 x86 和 arm 有的容易成功,有的概率成功。
Comment 5 zhangjing alibaba_cloud_group 2024-05-09 15:20:54 UTC
更改状态by design
Comment 6 yunmeng365524 2024-05-21 16:52:08 UTC
研发已经分析确认,测试建立对应基线。