[缺陷描述]: 2024.1.31日晚构建的nighlty任务(内核:5.10.134-1209.git.580b79aed91e.al8.x86_64+debug)是正常的,从2024.2.1日晚构建的nightly任务(内核:5.10.134-1210.git.c0d2face0e45.al8.x86_64+debug)出现此情况,执行ltp:hugetlb系统会卡主,有vmcore生成; 部分vmcore-dmesg日志如下: [ 352.919799] ------------[ cut here ]------------ [ 352.923014] kernel BUG at include/linux/bootmem_info.h:39! [ 352.926572] invalid opcode: 0000 [#1] SMP KASAN PTI [ 352.929987] CPU: 1 PID: 12636 Comm: hugeshmat04 Kdump: loaded Tainted: G E 5.10.134-1210.git.c0d2face0e45.al8.x86_64+debug #1 [ 352.936833] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8c24b4c 04/01/2014 [ 352.938772] RIP: 0010:free_vmemmap_page_list+0x221/0x320 [ 352.940493] Code: 3d f9 ff e9 f7 fe ff ff 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 48 c7 c6 80 5e b8 b5 48 89 ef e8 5f cb ef ff <0f> 0b 48 89 ef e8 e5 2a 02 00 eb b6 4c 89 ef e8 5b 2b 02 00 eb 83 [ 352.945751] RSP: 0018:ffffc9000a4f78f8 EFLAGS: 00010286 [ 352.947632] RAX: 0000000000000000 RBX: ffffea000430f480 RCX: 0000000000000000 [ 352.951086] RDX: 1ffffd4001e00c57 RSI: 0000000000000000 RDI: ffffea000f0062b8 [ 352.956758] RBP: ffffea000f006280 R08: 000000000000003e R09: ffff8883ee601c07 [ 352.958672] R10: ffffed107dcc0380 R11: 0000000000000001 R12: ffffc9000a4f7978 [ 352.961017] R13: ffffea000f0062b4 R14: dffffc0000000000 R15: dead000000000100 [ 352.963396] FS: 00007f2ca683f740(0000) GS:ffff8883ee400000(0000) knlGS:0000000000000000 [ 352.965983] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 352.968096] CR2: 0000000000a83000 CR3: 00000001196ea005 CR4: 00000000003706e0 [ 352.970490] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 352.973000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 352.975376] Call Trace: [ 352.976765] ? vmemmap_remap_free+0xe8/0x170 [ 352.978343] ? lock_release+0x20d/0x2b0 [ 352.980060] vmemmap_remap_free+0xf0/0x170 [ 352.981842] ? vmemmap_restore_pte+0x600/0x600 [ 352.983711] ? __alloc_pages_slowpath+0x1380/0x1380 [ 352.985642] ? start_flush_work+0x860/0x860 [ 352.987446] ? __mutex_lock+0xae5/0x10c0 [ 352.989170] ? lock_acquire+0x21d/0x2d0 [ 352.990876] ? vmemmap_remap_range+0x290/0x290 [ 352.992627] ? rcu_read_lock_sched_held+0x12/0x80 [ 352.994338] ? hugetlb_vmemmap_free+0x2d/0x70 [ 352.995928] ? lock_release+0x20d/0x2b0 [ 352.997431] hugetlb_vmemmap_free+0x40/0x70 部分vmcore日志如下: For help, type "help". Type "apropos word" to search for commands related to "word"... KERNEL: /usr/lib/debug/usr/lib/modules/5.10.134-1210.git.c0d2face0e45.al8.x86_64+debug/vmlinux [TAINTED] DUMPFILE: vmcore [PARTIAL DUMP] CPUS: 4 DATE: Fri Feb 2 15:20:22 CST 2024 UPTIME: 00:05:49 LOAD AVERAGE: 0.64, 1.37, 0.80 TASKS: 690 NODENAME: VM20201111-32 RELEASE: 5.10.134-1210.git.c0d2face0e45.al8.x86_64+debug VERSION: #1 SMP Thu Feb 1 13:14:58 UTC 2024 MACHINE: x86_64 (2499 Mhz) MEMORY: 16 GB PANIC: "kernel BUG at include/linux/bootmem_info.h:39!" PID: 12636 COMMAND: "hugeshmat04" TASK: ffff8881163c8000 [THREAD_INFO: ffff8881163c8000] CPU: 1 STATE: TASK_RUNNING (PANIC) crash> bt PID: 12636 TASK: ffff8881163c8000 CPU: 1 COMMAND: "hugeshmat04" #0 [ffffc9000a4f7468] machine_kexec at ffffffffb311e06a #1 [ffffc9000a4f7548] __crash_kexec at ffffffffb34ab2a0 #2 [ffffc9000a4f7680] panic at ffffffffb546ebc7 #3 [ffffc9000a4f7760] do_trap at ffffffffb306b1cb #4 [ffffc9000a4f77c0] do_error_trap at ffffffffb306b81b #5 [ffffc9000a4f7810] handle_invalid_op at ffffffffb306b91c #6 [ffffc9000a4f7828] exc_invalid_op at ffffffffb558339b #7 [ffffc9000a4f7840] asm_exc_invalid_op at ffffffffb5600a92 [exception RIP: free_vmemmap_page_list+545] RIP: ffffffffb39bd3b1 RSP: ffffc9000a4f78f8 RFLAGS: 00010286 RAX: 0000000000000000 RBX: ffffea000430f480 RCX: 0000000000000000 RDX: 1ffffd4001e00c57 RSI: 0000000000000000 RDI: ffffea000f0062b8 RBP: ffffea000f006280 R8: 000000000000003e R9: ffff8883ee601c07 R10: ffffed107dcc0380 R11: 0000000000000001 R12: ffffc9000a4f7978 R13: ffffea000f0062b4 R14: dffffc0000000000 R15: dead000000000100 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #8 [ffffc9000a4f7940] vmemmap_remap_free at ffffffffb39bf120 #9 [ffffc9000a4f7a08] hugetlb_vmemmap_free at ffffffffb39afd10 #10 [ffffc9000a4f7a20] __prep_new_huge_page at ffffffffb3999883 #11 [ffffc9000a4f7a38] alloc_fresh_huge_page at ffffffffb399afaa #12 [ffffc9000a4f7a90] alloc_pool_huge_page at ffffffffb399bed2 #13 [ffffc9000a4f7ac0] set_max_huge_pages at ffffffffb399c1ce #14 [ffffc9000a4f7ba0] hugetlb_sysctl_handler_common at ffffffffb399cceb #15 [ffffc9000a4f7ca8] proc_sys_call_handler at ffffffffb3cd7e51 #16 [ffffc9000a4f7d58] new_sync_write at ffffffffb3ad5ea5 #17 [ffffc9000a4f7e80] vfs_write at ffffffffb3ade44b #18 [ffffc9000a4f7ec8] ksys_write at ffffffffb3adeeb9 #19 [ffffc9000a4f7f40] do_syscall_64 at ffffffffb5582f20 #20 [ffffc9000a4f7f50] entry_SYSCALL_64_after_hwframe at ffffffffb5600099 RIP: 00007f2ca6934cc7 RSP: 00007ffc0b026068 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f2ca6934cc7 RDX: 0000000000000003 RSI: 0000000000a80500 RDI: 0000000000000006 RBP: 0000000000a80500 R8: 0000000000000000 R9: 00007ffc0b0260e5 系统卡主是的部分串口日志如下: [ 22.495080] kdump.sh[339]: kdump: done [ 22.594187] kdump[347]: saving to /sysroot//var/crash/127.0.0.1-2024-02-03-00:34:13/ [ 22.768810] EXT4-fs (vda1): re-mounted. Opts: (null) [ 22.706052] kdump[351]: saving vmcore-dmesg.txt to /sysroot//var/crash/127.0.0.1-2024-02-03-00:34:13/ [ 23.216024] kdump[357]: saving vmcore-dmesg.txt complete [ 23.260653] kdump[359]: saving vmcore [ 24.209155] rngd[205]: [jitter]: Enabling JITTER rng support [ 24.211781] rngd[205]: [jitter]: Initialized Copying data : [100.0 %] \ eta: 0s [ 37.529594] kdump.sh[360]: The dumpfile is saved to /sysroot//var/crash/127.0.0.1-2024-02-03-00:34:13/vmcore-incomplete. [ 37.532014] kdump.sh[360]: makedumpfile Completed. [ 37.994273] kdump[364]: saving vmcore complete [ 38.037479] kdump[366]: saving the /run/initramfs/kexec-dmesg.log to /sysroot//var/crash/127.0.0.1-2024-02-03-00:34:13/ [ 38.574906] kdump[372]: Executing final action systemctl reboot -f [ 38.670792] systemd[1]: Shutting down. [ 38.906710] printk: systemd-shutdow: 17 output lines suppressed due to ratelimiting [ 40.070700] systemd-shutdown[1]: Syncing filesystems and block devices. [ 40.081640] systemd-shutdown[1]: Sending SIGTERM to remaining processes... [ 40.312906] systemd-shutdown[1]: Sending SIGKILL to remaining processes... [ 40.338174] systemd-shutdown[1]: Unmounting file systems. [ 40.346768] [376]: Remounting '/sysroot' read-only in with options 'x-systemd.before=initrd-fs.target'. [ 40.350095] EXT4-fs (vda1): Unrecognized mount option "x-systemd.before=initrd-fs.target" or missing value [ 40.353141] [376]: Failed to remount '/sysroot' read-only: Invalid argument [ 40.364173] [377]: Unmounting '/sysroot'. [ 40.792315] [378]: Remounting '/' read-only in with options 'lowerdir=/squash/root,upperdir=/squash/overlay/upper,workdir=/squash/overlay/work/,index=off'. [ 40.847278] [379]: Unmounting '/squash/root'. [ 40.859524] [380]: Unmounting '/squash'. [ 40.869796] systemd-shutdown[1]: All filesystems unmounted. [ 40.877078] systemd-shutdown[1]: Deactivating swaps. [ 40.886303] systemd-shutdown[1]: All swaps deactivated. [ 40.893473] systemd-shutdown[1]: Detaching loop devices. [ 40.901196] systemd-shutdown[1]: Detaching loopback /dev/loop0. [ 40.906627] systemd-shutdown[1]: All loop devices detached. [ 40.929135] reboot: Restarting system [ 40.931126] reboot: machine restart [复现概率]: 执行hugetlb必现,复现发现不是某个固定的用例导致,但是执行几个用例就会复现 [复现环境]: 机器:线下vm 内核: # uname -r 5.10.134-1210.git.c0d2face0e45.al8.x86_64+debug # cat /etc/os-release NAME="Alibaba Cloud Linux" VERSION="3 (Soaring Falcon)" ID="alinux" ID_LIKE="rhel fedora centos anolis" VERSION_ID="3" PLATFORM_ID="platform:al8" PRETTY_NAME="Alibaba Cloud Linux 3 (Soaring Falcon)" ANSI_COLOR="0;31" HOME_URL="https://www.aliyun.com/" cpu信息: # lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 2 Core(s) per socket: 1 Socket(s): 2 NUMA node(s): 1 Vendor ID: GenuineIntel BIOS Vendor ID: Alibaba Cloud CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz BIOS Model name: pc-i440fx-2.1 Stepping: 7 CPU MHz: 2499.998 BogoMIPS: 4999.99 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 36608K NUMA node0 CPU(s): 0-3 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl cpuid tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_vnni 内存信息: # free -h total used free shared buff/cache available Mem: 13Gi 1.0Gi 10Gi 4.0Mi 1.3Gi 11Gi Swap: 0B 0B 0B [复现步骤]: # 下载并编译测试套 git clone http://code.alibaba-inc.com/alikernel/ltp.git --branch LTP-20200930 # 5.10 export CFLAGS="-fcommon" # gcc 10 需要添加这个 cd ltp make autotools ./configure make make install # 执行测试 /opt/ltp/runltp -f hugetlb [预期结果]: 用例执行成功 [实际结果]: 用例执行过程中系统卡主,产生vmcore
疑似与此内核提交有关:https://gitee.com/anolis/cloud-kernel/pulls/2700
The PR Link: https://gitee.com/anolis/cloud-kernel/pulls/2753
https://gitee.com/anolis/cloud-kernel/pulls/2753 已合入。