Bug 8171 - [alinux3][x86-64][nightly][debug-kernel]执行ltp:hugetlb会导致系统卡主,有vmcore生成
Summary: [alinux3][x86-64][nightly][debug-kernel]执行ltp:hugetlb会导致系统卡主,有vmcore生成
Status: RESOLVED FIXED
Alias: None
Product: ANCK 5.10 Dev
Classification: ANCK
Component: general/others (show other bugs) general/others
Version: 5.10.y-16
Hardware: x86_64 Linux
: P2-High S2-major
Target Milestone: ---
Assignee: maqiao
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-02-02 17:55 UTC by wangpingping
Modified: 2024-02-04 12:29 UTC (History)
7 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description wangpingping alibaba_cloud_group 2024-02-02 17:55:46 UTC
[缺陷描述]:
2024.1.31日晚构建的nighlty任务(内核:5.10.134-1209.git.580b79aed91e.al8.x86_64+debug)是正常的,从2024.2.1日晚构建的nightly任务(内核:5.10.134-1210.git.c0d2face0e45.al8.x86_64+debug)出现此情况,执行ltp:hugetlb系统会卡主,有vmcore生成;

部分vmcore-dmesg日志如下:
[  352.919799] ------------[ cut here ]------------
[  352.923014] kernel BUG at include/linux/bootmem_info.h:39!
[  352.926572] invalid opcode: 0000 [#1] SMP KASAN PTI
[  352.929987] CPU: 1 PID: 12636 Comm: hugeshmat04 Kdump: loaded Tainted: G            E     5.10.134-1210.git.c0d2face0e45.al8.x86_64+debug #1
[  352.936833] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8c24b4c 04/01/2014
[  352.938772] RIP: 0010:free_vmemmap_page_list+0x221/0x320
[  352.940493] Code: 3d f9 ff e9 f7 fe ff ff 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 48 c7 c6 80 5e b8 b5 48 89 ef e8 5f cb ef ff <0f> 0b 48 89 ef e8 e5 2a 02 00 eb b6 4c 89 ef e8 5b 2b 02 00 eb 83
[  352.945751] RSP: 0018:ffffc9000a4f78f8 EFLAGS: 00010286
[  352.947632] RAX: 0000000000000000 RBX: ffffea000430f480 RCX: 0000000000000000
[  352.951086] RDX: 1ffffd4001e00c57 RSI: 0000000000000000 RDI: ffffea000f0062b8
[  352.956758] RBP: ffffea000f006280 R08: 000000000000003e R09: ffff8883ee601c07
[  352.958672] R10: ffffed107dcc0380 R11: 0000000000000001 R12: ffffc9000a4f7978
[  352.961017] R13: ffffea000f0062b4 R14: dffffc0000000000 R15: dead000000000100
[  352.963396] FS:  00007f2ca683f740(0000) GS:ffff8883ee400000(0000) knlGS:0000000000000000
[  352.965983] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  352.968096] CR2: 0000000000a83000 CR3: 00000001196ea005 CR4: 00000000003706e0
[  352.970490] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  352.973000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  352.975376] Call Trace:
[  352.976765]  ? vmemmap_remap_free+0xe8/0x170
[  352.978343]  ? lock_release+0x20d/0x2b0
[  352.980060]  vmemmap_remap_free+0xf0/0x170
[  352.981842]  ? vmemmap_restore_pte+0x600/0x600
[  352.983711]  ? __alloc_pages_slowpath+0x1380/0x1380
[  352.985642]  ? start_flush_work+0x860/0x860
[  352.987446]  ? __mutex_lock+0xae5/0x10c0
[  352.989170]  ? lock_acquire+0x21d/0x2d0
[  352.990876]  ? vmemmap_remap_range+0x290/0x290
[  352.992627]  ? rcu_read_lock_sched_held+0x12/0x80
[  352.994338]  ? hugetlb_vmemmap_free+0x2d/0x70
[  352.995928]  ? lock_release+0x20d/0x2b0
[  352.997431]  hugetlb_vmemmap_free+0x40/0x70

部分vmcore日志如下:
For help, type "help".
Type "apropos word" to search for commands related to "word"...

      KERNEL: /usr/lib/debug/usr/lib/modules/5.10.134-1210.git.c0d2face0e45.al8.x86_64+debug/vmlinux  [TAINTED]
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 4
        DATE: Fri Feb  2 15:20:22 CST 2024
      UPTIME: 00:05:49
LOAD AVERAGE: 0.64, 1.37, 0.80
       TASKS: 690
    NODENAME: VM20201111-32
     RELEASE: 5.10.134-1210.git.c0d2face0e45.al8.x86_64+debug
     VERSION: #1 SMP Thu Feb 1 13:14:58 UTC 2024
     MACHINE: x86_64  (2499 Mhz)
      MEMORY: 16 GB
       PANIC: "kernel BUG at include/linux/bootmem_info.h:39!"
         PID: 12636
     COMMAND: "hugeshmat04"
        TASK: ffff8881163c8000  [THREAD_INFO: ffff8881163c8000]
         CPU: 1
       STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 12636    TASK: ffff8881163c8000  CPU: 1    COMMAND: "hugeshmat04"
 #0 [ffffc9000a4f7468] machine_kexec at ffffffffb311e06a
 #1 [ffffc9000a4f7548] __crash_kexec at ffffffffb34ab2a0
 #2 [ffffc9000a4f7680] panic at ffffffffb546ebc7
 #3 [ffffc9000a4f7760] do_trap at ffffffffb306b1cb
 #4 [ffffc9000a4f77c0] do_error_trap at ffffffffb306b81b
 #5 [ffffc9000a4f7810] handle_invalid_op at ffffffffb306b91c
 #6 [ffffc9000a4f7828] exc_invalid_op at ffffffffb558339b
 #7 [ffffc9000a4f7840] asm_exc_invalid_op at ffffffffb5600a92
    [exception RIP: free_vmemmap_page_list+545]
    RIP: ffffffffb39bd3b1  RSP: ffffc9000a4f78f8  RFLAGS: 00010286
    RAX: 0000000000000000  RBX: ffffea000430f480  RCX: 0000000000000000
    RDX: 1ffffd4001e00c57  RSI: 0000000000000000  RDI: ffffea000f0062b8
    RBP: ffffea000f006280   R8: 000000000000003e   R9: ffff8883ee601c07
    R10: ffffed107dcc0380  R11: 0000000000000001  R12: ffffc9000a4f7978
    R13: ffffea000f0062b4  R14: dffffc0000000000  R15: dead000000000100
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #8 [ffffc9000a4f7940] vmemmap_remap_free at ffffffffb39bf120
 #9 [ffffc9000a4f7a08] hugetlb_vmemmap_free at ffffffffb39afd10
#10 [ffffc9000a4f7a20] __prep_new_huge_page at ffffffffb3999883
#11 [ffffc9000a4f7a38] alloc_fresh_huge_page at ffffffffb399afaa
#12 [ffffc9000a4f7a90] alloc_pool_huge_page at ffffffffb399bed2
#13 [ffffc9000a4f7ac0] set_max_huge_pages at ffffffffb399c1ce
#14 [ffffc9000a4f7ba0] hugetlb_sysctl_handler_common at ffffffffb399cceb
#15 [ffffc9000a4f7ca8] proc_sys_call_handler at ffffffffb3cd7e51
#16 [ffffc9000a4f7d58] new_sync_write at ffffffffb3ad5ea5
#17 [ffffc9000a4f7e80] vfs_write at ffffffffb3ade44b
#18 [ffffc9000a4f7ec8] ksys_write at ffffffffb3adeeb9
#19 [ffffc9000a4f7f40] do_syscall_64 at ffffffffb5582f20
#20 [ffffc9000a4f7f50] entry_SYSCALL_64_after_hwframe at ffffffffb5600099
    RIP: 00007f2ca6934cc7  RSP: 00007ffc0b026068  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 0000000000000003  RCX: 00007f2ca6934cc7
    RDX: 0000000000000003  RSI: 0000000000a80500  RDI: 0000000000000006
    RBP: 0000000000a80500   R8: 0000000000000000   R9: 00007ffc0b0260e5

系统卡主是的部分串口日志如下:

[   22.495080] kdump.sh[339]: kdump: done
[   22.594187] kdump[347]: saving to /sysroot//var/crash/127.0.0.1-2024-02-03-00:34:13/
[   22.768810] EXT4-fs (vda1): re-mounted. Opts: (null)
[   22.706052] kdump[351]: saving vmcore-dmesg.txt to /sysroot//var/crash/127.0.0.1-2024-02-03-00:34:13/
[   23.216024] kdump[357]: saving vmcore-dmesg.txt complete
[   23.260653] kdump[359]: saving vmcore
[   24.209155] rngd[205]: [jitter]: Enabling JITTER rng support
[   24.211781] rngd[205]: [jitter]: Initialized
Copying data                                      : [100.0 %] \           eta: 0s
[   37.529594] kdump.sh[360]: The dumpfile is saved to /sysroot//var/crash/127.0.0.1-2024-02-03-00:34:13/vmcore-incomplete.
[   37.532014] kdump.sh[360]: makedumpfile Completed.
[   37.994273] kdump[364]: saving vmcore complete
[   38.037479] kdump[366]: saving the /run/initramfs/kexec-dmesg.log to /sysroot//var/crash/127.0.0.1-2024-02-03-00:34:13/
[   38.574906] kdump[372]: Executing final action systemctl reboot -f
[   38.670792] systemd[1]: Shutting down.
[   38.906710] printk: systemd-shutdow: 17 output lines suppressed due to ratelimiting
[   40.070700] systemd-shutdown[1]: Syncing filesystems and block devices.
[   40.081640] systemd-shutdown[1]: Sending SIGTERM to remaining processes...
[   40.312906] systemd-shutdown[1]: Sending SIGKILL to remaining processes...
[   40.338174] systemd-shutdown[1]: Unmounting file systems.
[   40.346768] [376]: Remounting '/sysroot' read-only in with options 'x-systemd.before=initrd-fs.target'.
[   40.350095] EXT4-fs (vda1): Unrecognized mount option "x-systemd.before=initrd-fs.target" or missing value
[   40.353141] [376]: Failed to remount '/sysroot' read-only: Invalid argument
[   40.364173] [377]: Unmounting '/sysroot'.
[   40.792315] [378]: Remounting '/' read-only in with options 'lowerdir=/squash/root,upperdir=/squash/overlay/upper,workdir=/squash/overlay/work/,index=off'.
[   40.847278] [379]: Unmounting '/squash/root'.
[   40.859524] [380]: Unmounting '/squash'.
[   40.869796] systemd-shutdown[1]: All filesystems unmounted.
[   40.877078] systemd-shutdown[1]: Deactivating swaps.
[   40.886303] systemd-shutdown[1]: All swaps deactivated.
[   40.893473] systemd-shutdown[1]: Detaching loop devices.
[   40.901196] systemd-shutdown[1]: Detaching loopback /dev/loop0.
[   40.906627] systemd-shutdown[1]: All loop devices detached.
[   40.929135] reboot: Restarting system
[   40.931126] reboot: machine restart



[复现概率]:
执行hugetlb必现,复现发现不是某个固定的用例导致,但是执行几个用例就会复现

[复现环境]:
机器:线下vm
内核:
# uname -r
5.10.134-1210.git.c0d2face0e45.al8.x86_64+debug
# cat /etc/os-release
NAME="Alibaba Cloud Linux"
VERSION="3 (Soaring Falcon)"
ID="alinux"
ID_LIKE="rhel fedora centos anolis"
VERSION_ID="3"
PLATFORM_ID="platform:al8"
PRETTY_NAME="Alibaba Cloud Linux 3 (Soaring Falcon)"
ANSI_COLOR="0;31"
HOME_URL="https://www.aliyun.com/"

cpu信息:
# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  2
Core(s) per socket:  1
Socket(s):           2
NUMA node(s):        1
Vendor ID:           GenuineIntel
BIOS Vendor ID:      Alibaba Cloud
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz
BIOS Model name:     pc-i440fx-2.1
Stepping:            7
CPU MHz:             2499.998
BogoMIPS:            4999.99
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            36608K
NUMA node0 CPU(s):   0-3
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl cpuid tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_vnni


内存信息:
# free -h
              total        used        free      shared  buff/cache   available
Mem:           13Gi       1.0Gi        10Gi       4.0Mi       1.3Gi        11Gi
Swap:            0B          0B          0B

[复现步骤]:
# 下载并编译测试套
git clone http://code.alibaba-inc.com/alikernel/ltp.git --branch LTP-20200930      # 5.10
export CFLAGS="-fcommon"               #  gcc 10 需要添加这个
cd ltp
make autotools
./configure
make
make install

# 执行测试
/opt/ltp/runltp -f hugetlb

[预期结果]:
用例执行成功

[实际结果]:
用例执行过程中系统卡主,产生vmcore
Comment 1 wangpingping alibaba_cloud_group 2024-02-02 17:56:38 UTC
疑似与此内核提交有关:https://gitee.com/anolis/cloud-kernel/pulls/2700
Comment 2 小龙 admin 2024-02-04 11:31:18 UTC
The PR Link: https://gitee.com/anolis/cloud-kernel/pulls/2753
Comment 3 xuyu alibaba_cloud_group 2024-02-04 12:29:51 UTC
https://gitee.com/anolis/cloud-kernel/pulls/2753

已合入。