[缺陷描述]: kernel-selftests测试breakpoints.step_after_suspend_test用例执行fail ,报错信息Bail out! Failed to enter Suspend state [重现概率]: 必现 [重现步骤] 1. 下载kernel-6.6.25-2_rc1.an23.src.rpm 2. rpm -i kernel-6.6.25-2_rc1.an23.src.rpm 3. yum-builddep -y /root/rpmbuild/SPECS/kernel.spec rpmbuild -bp /root/rpmbuild/SPECS/kernel.spec cd /root/rpmbuild/BUILD/kernel-6.6.25-2_rc1.an23/linux-6.6.25-2_rc1.an23.aarch64/tools/testing/selftests/cgroup 4. make;./test_memcontrol [期望结果]: 用例执行PASS [实际结果]: [root@iZbp143ti4ccpaufkzata6Z cgroup]# ./test_memcontrol ok 1 test_memcg_subtree_control ok 2 test_memcg_current ok 3 test_memcg_min ok 4 test_memcg_low ok 5 test_memcg_high ok 6 test_memcg_high_sync ok 7 test_memcg_max ok 8 test_memcg_reclaim ok 9 test_memcg_oom_events ok 10 # SKIP test_memcg_swap_max not ok 11 test_memcg_sock ok 12 test_memcg_oom_group_leaf_events ok 13 test_memcg_oom_group_parent_events ok 14 test_memcg_oom_group_score_events [重现环境]: 环境信息:云上ecs [root@iZbp143ti4ccpaufkzata6Z breakpoints]# uname -ra Linux iZbp143ti4ccpaufkzata6Z 6.6.25-2_rc1.an23.aarch64 #1 SMP PREEMPT_DYNAMIC Thu Apr 11 15:02:38 CST 2024 aarch64 aarch64 aarch64 GNU/Linux [root@iZbp143ti4ccpaufkzata6Z breakpoints]# [root@iZbp143ti4ccpaufkzata6Z breakpoints]# cat /etc/os-release NAME="Anolis OS" VERSION="23" ID="anolis" VERSION_ID="23" PLATFORM_ID="platform:an23" PRETTY_NAME="Anolis OS 23" ANSI_COLOR="0;31" HOME_URL="https://openanolis.cn/" BUG_REPORT_URL="https://bugzilla.openanolis.cn/" [root@iZbp143ti4ccpaufkzata6Z breakpoints]# [root@iZbp143ti4ccpaufkzata6Z breakpoints]# [root@iZbp143ti4ccpaufkzata6Z breakpoints]# df -h Filesystem Size Used Avail Use% Mounted on devtmpfs 4.0M 0 4.0M 0% /dev tmpfs 16G 0 16G 0% /dev/shm tmpfs 6.1G 804K 6.1G 1% /run efivarfs 256K 18K 239K 7% /sys/firmware/efi/efivars /dev/nvme0n1p2 40G 13G 27G 33% / tmpfs 16G 3.1M 16G 1% /tmp /dev/nvme0n1p1 500M 6.5M 494M 2% /boot/efi tmpfs 3.1G 4.0K 3.1G 1% /run/user/0 [root@iZbp143ti4ccpaufkzata6Z breakpoints]# [root@iZbp143ti4ccpaufkzata6Z breakpoints]# free -g total used free shared buff/cache available Mem: 30 0 28 0 1 29 Swap: 0 0 0 [root@iZbp143ti4ccpaufkzata6Z breakpoints]# [root@iZbp143ti4ccpaufkzata6Z breakpoints]# cat /proc/cmdline BOOT_IMAGE=(hd0,gpt2)/boot/vmlinuz-6.6.25-2_rc1.an23.aarch64 root=UUID=6424d533-3c41-4ad9-89fa-1d3bf8c49fd3 ro rhgb crashkernel=0M-2G:0M,2G-64G:256M,64G-:384M iommu.passthrough=1 iommu.strict=0 cryptomgr.notests cgroup.memory=nokmem rcupdate.rcu_cpu_stall_timeout=300 quiet selinux=1 console=tty0 biosdevname=0 net.ifnames=0 console=ttyAMA0,115200n8 noibrs nvme_core.io_timeout=4294967295 nvme_core.admin_timeout=4294967295 [root@iZbp143ti4ccpaufkzata6Z breakpoints]# [root@iZbp143ti4ccpaufkzata6Z breakpoints]# lscpu Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Vendor ID: ARM BIOS Vendor ID: Alibaba Cloud Model name: Neoverse-N2 BIOS Model name: virt-rhel7.6.0 CPU @ 2.0GHz BIOS CPU family: 1 Model: 0 Thread(s) per core: 1 Core(s) per socket: 8 Socket(s): 1 Stepping: r0p0 BogoMIPS: 100.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm 3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb dcpodp sve2 sveaes svepmull svebitperm svesh a3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh Caches (sum of all): L1d: 512 KiB (8 instances) L1i: 512 KiB (8 instances) L2: 8 MiB (8 instances) L3: 64 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-7 Vulnerabilities: Gather data sampling: Not affected Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Reg file data sampling: Not affected Retbleed: Not affected Spec rstack overflow: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; __user pointer sanitization Spectre v2: Mitigation; CSV2, BHB Srbds: Not affected Tsx async abort: Not affected
通过循环验证,这个fail case存在概率性fail问题,在x86_64环境下 循环执行20次均pass,在aarch64环境下,循环执行20次只pass一次
5.10 也存在失败的情况。详细见bugzilla:https://bugzilla.openanolis.cn/show_bug.cgi?id=4519 请研发帮忙确认,为什么x86和arm会有差异。
test_memcg_sock 测试失败的原因为关闭 client 之后,读取 memory.stat 里的 sock 数据不为 0,不为 0 是因为这里的数据统计基于 rstat 框架,存在 percpu 的缓存,并不会每次读取的时候都刷新缓存(定时刷,或者超过阈值会刷),也就是说,即便 sock 已经被释放了,读取统计数据的时候也有不为 0 的可能。 可以多跑几次测试用例,过一次就算没问题;或者在测试用例读之前睡眠个一小会儿等定时work刷了缓存再读。
这里的统计数据, x86 和 arm 实现上没什么区别;但刷缓存的阈值跟 numa 节点,cpu 数量有联系,不同机型这些不一样,因此 x86 和 arm 有的容易成功,有的概率成功。
更改状态by design
研发已经分析确认,测试建立对应基线。