[缺陷描述]: 执行ltp stress 5个小时左右,发生panic,"Unable to handle kernel paging request at virtual address fffffc100200001c",另外这台机器手动触发vmcore无法生成,但是执行压力测试生成了vmcore vmcore解析日志: [root@16f5Lab15 127.0.0.1-2025-02-28-08:29:49]# crash /usr/lib/debug/lib/modules/6.6.71-3_rc1.an23.aarch64/vmlinux vmcore crash 8.0.4-3.an23 Copyright (C) 2002-2022 Red Hat, Inc. Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005, 2011, 2020-2022 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. Copyright (C) 2015, 2021 VMware, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. GNU gdb (GDB) 10.2 Copyright (C) 2021 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "aarch64-unknown-linux-gnu". Type "show configuration" for configuration details. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... KERNEL: /usr/lib/debug/lib/modules/6.6.71-3_rc1.an23.aarch64/vmlinux [TAINTED] DUMPFILE: vmcore [PARTIAL DUMP] CPUS: 128 DATE: Thu Feb 27 20:36:31 CST 2025 UPTIME: 06:17:34 LOAD AVERAGE: 80.17, 80.30, 81.24 TASKS: 1888 NODENAME: 16f5Lab15 RELEASE: 6.6.71-3_rc1.an23.aarch64 VERSION: #1 SMP PREEMPT_DYNAMIC Fri Feb 21 11:39:27 CST 2025 MACHINE: aarch64 (unknown Mhz) MEMORY: 128 GB PANIC: "Unable to handle kernel paging request at virtual address fffffc100200001c" PID: 669 COMMAND: "kcompactd1" TASK: ffff000806500000 [THREAD_INFO: ffff000806500000] CPU: 97 STATE: TASK_RUNNING (PANIC) crash> bt PID: 669 TASK: ffff000806500000 CPU: 97 COMMAND: "kcompactd1" #0 [ffff800082ad31a0] machine_kexec at ffff800080033de8 #1 [ffff800082ad31d0] __crash_kexec at ffff80008015ac8c #2 [ffff800082ad3350] crash_kexec at ffff80008015be64 #3 [ffff800082ad3370] die at ffff8000800201ec #4 [ffff800082ad3420] die_kernel_fault at ffff800080037108 #5 [ffff800082ad3460] __do_kernel_fault at ffff8000800376d0 #6 [ffff800082ad3490] do_bad_area at ffff80008003775c #7 [ffff800082ad34b0] do_translation_fault at ffff800080d41f78 #8 [ffff800082ad34c0] do_mem_abort at ffff800080037214 #9 [ffff800082ad34f0] el1_abort at ffff800080d2e72c #10 [ffff800082ad3520] el1h_64_sync_handler at ffff800080d30aa4 #11 [ffff800082ad3660] el1h_64_sync at ffff800080011304 #12 [ffff800082ad3680] folio_remove_rmap_ptes at ffff80008032b9f4 #13 [ffff800082ad36b0] try_to_migrate_one at ffff80008032cc14 #14 [ffff800082ad37b0] rmap_walk_anon at ffff800080329408 #15 [ffff800082ad3810] try_to_migrate at ffff80008032dd04 #16 [ffff800082ad3860] migrate_folio_unmap at ffff800080388360 #17 [ffff800082ad38d0] migrate_pages_batch at ffff800080389b08 #18 [ffff800082ad3a10] migrate_pages_sync at ffff80008038a5f0 #19 [ffff800082ad3ad0] migrate_pages at ffff80008038ae9c #20 [ffff800082ad3bc0] compact_zone at ffff8000802fc9e0 #21 [ffff800082ad3c50] kcompactd_do_work at ffff8000802fd370 #22 [ffff800082ad3de0] kcompactd at ffff8000802fd900 #23 [ffff800082ad3e70] kthread at ffff800080085abc crash> q [重现概率] 目前仅出现一次 [重现环境] 内核: 6.6.71-3_rc1.an23.aarch64 # cat /proc/cmdline BOOT_IMAGE=(hd0,gpt2)/vmlinuz-6.6.71-3_rc1.an23.aarch64 root=UUID=bedec06f-d570-431d-bce1-749030567aeb ro rhgb selinux=0 console=tty0 cgroup.memory=nokmem iommu.passthrough=1 iommu.strict=0 nospectre_bhb ssbd=force-off no_hash_pointers crashkernel=0M-2G:0M,2G-64G:256M,64G-:512M # cat /etc/os-release NAME="Anolis OS" VERSION="23" ID="anolis" VERSION_ID="23" PLATFORM_ID="platform:an23" PRETTY_NAME="Anolis OS 23" ANSI_COLOR="0;31" HOME_URL="https://openanolis.cn/" BUG_REPORT_URL="https://bugzilla.openanolis.cn/" 内存信息: # free -h total used free shared buff/cache available Mem: 7.3Gi 290Mi 7.0Gi 716Ki 231Mi 7.0Gi Swap: 0B 0B 0B CPU信息: # lscpu Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 2 On-line CPU(s) list: 0,1 Vendor ID: ARM BIOS Vendor ID: Alibaba Cloud Model name: Neoverse-N2 BIOS Model name: virt-rhel7.6.0 CPU @ 3.0GHz BIOS CPU family: 1 Model: 0 Thread(s) per core: 1 Core(s) per socket: 2 Socket(s): 1 Stepping: r0p0 Frequency boost: disabled CPU(s) scaling MHz: 100% CPU max MHz: 3000.0000 CPU min MHz: 3000.0000 BogoMIPS: 100.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt f cma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb dcp odp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm b f16 dgh Caches (sum of all): L1d: 128 KiB (2 instances) L1i: 128 KiB (2 instances) L2: 2 MiB (2 instances) L3: 64 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0,1 Vulnerabilities: Gather data sampling: Not affected Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Reg file data sampling: Not affected Retbleed: Not affected Spec rstack overflow: Not affected Spec store bypass: Vulnerable Spectre v1: Mitigation; __user pointer sanitization Spectre v2: Mitigation; CSV2, but not BHB Srbds: Not affected Tsx async abort: Not affected [重现步骤]: 1、安装测试内核,reboot 2、# 下载并编译测试套 git clone http://code.alibaba-inc.com/alikernel/ltp.git export CFLAGS="-fcommon" # gcc 10 需要添加这个 make autotools ./configure make make install 环境设置: echo 1 > /proc/sys/kernel/panic echo 1 > /proc/sys/kernel/hardlockup_panic echo 1 > /proc/sys/kernel/softlockup_panic echo 60 > /proc/sys/kernel/watchdog_thresh echo 150 > /proc/sys/kernel/watchdog_thresh echo 1200 > /proc/sys/kernel/hung_task_timeout_secs echo 0 > /proc/sys/kernel/hung_task_panic echo '0 4 0 7' > /proc/sys/kernel/printk echo 1 > /proc/sys/kernel/sched_group_balancer # 准备测试脚本 cat <<-EOF > /opt/ltp/load.sh #!/bin/bash nr_cpu=$(nproc) mem_kb=$(grep ^MemTotal /proc/meminfo | awk '{print $2}') ./runltp \ -c $((nr_cpu / 2)) \ -m $((nr_cpu / 4)),4,$((mem_kb / nr_cpu / 2 * 1024)),1 \ -D $((nr_cpu / 10)),1,0,1 \ -i 2 \ -B ext4 \ -R -p -q \ -t 72h \ -d /disk1/tmpdir/ltp \ -b /dev/vdb1 -B ext4 -z /dev/vdb2 -Z ext4 EOF chmod a+x /opt/ltp/load.sh # 执行测试 nohup ./load.sh &> ltp-stress.log & [期望结果]: ltp stress正常执行结束 [实际结果]: ltp stress执行中发生crash,vmcore-dmesg.txt中的部分日志如下: [21675.385154] preadv203 (518047): drop_caches: 3 [21675.486344] preadv203 (518047): drop_caches: 3 [21677.017371] BTRFS info (device loop0): last unmount of filesystem a50115a2-f2bd-40ac-9ec7-7d733dbd1af8 [21678.530450] audit: type=1130 audit(1740658814.429:631): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=sysstat-collect comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [21678.549316] audit: type=1131 audit(1740658814.429:632): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=sysstat-collect comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [21679.497851] cgroup: Unknown subsys name 'debug' [21871.940851] memcg_stress_te (528017): drop_caches: 3 [21990.935182] audit: type=1130 audit(1740659126.833:633): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=pmlogger_farm_check comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [21990.956731] audit: type=1130 audit(1740659126.857:634): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=pmlogger_check comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [21991.775704] audit: type=1131 audit(1740659127.673:635): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=pmlogger_farm_check comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [21996.405505] audit: type=1131 audit(1740659132.305:636): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=pmlogger_check comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [22170.712688] audit: type=1130 audit(1740659306.613:637): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=pmie_farm_check comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [22170.801281] audit: type=1130 audit(1740659306.701:638): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=pmie_check comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [22172.467022] audit: type=1131 audit(1740659308.365:639): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=pmie_check comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [22172.511122] audit: type=1131 audit(1740659308.409:640): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=pmie_farm_check comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [22290.831310] audit: type=1130 audit(1740659426.729:641): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=sysstat-collect comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [22589.851051] audit: type=1130 audit(1740659725.749:643): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=dnf-makecache comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed' [22655.730431] Unable to handle kernel paging request at virtual address fffffc100200001c [22655.738361] Mem abort info: [22655.741148] ESR = 0x0000000096000006 [22655.744889] EC = 0x25: DABT (current EL), IL = 32 bits [22655.750189] SET = 0, FnV = 0 [22655.753231] EA = 0, S1PTW = 0 [22655.756360] FSC = 0x06: level 2 translation fault [22655.761226] Data abort info: [22655.764093] ISV = 0, ISS = 0x00000006, ISS2 = 0x00000000 [22655.769565] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [22655.774604] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [22655.779904] swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000000f3957000 [22655.786592] [fffffc100200001c] pgd=10000017fffef003, p4d=10000017fffef003, pud=10000417fffc4003, pmd=0000000000000000 [22655.797194] Internal error: Oops: 0000000096000006 [#1] SMP [22655.802756] Modules linked in: tls(E) vfio_iommu_type1(E) vfio(E) vhost_vsock(E) vhost_net(E) vmw_vsock_virtio_transport_common(E) vhost(E) vhost_iotlb(E) tap(E) vsock(E) brd(E) poly1305_generic(E) libpoly1305(E) poly1305_neon(E) chacha_generic(E) chacha_neon(E) libchacha(E) chacha20poly1305(E) nf_tables(E) nfsv3(E) nfs_acl(E) nfs(E) lockd(E) grace(E) fscache(E) netfs(E) uinput(E) tun(E) n_hdlc(E) slip(E) slhc(E) squashfs(E) binfmt_misc(E) authenc(E) pcrypt(E) veth(E) overlay(E) btrfs(E) blake2b_generic(E) xor(E) xor_neon(E) raid6_pq(E) zstd_compress(E) crypto_user(E) mlx5_ib(E) ib_uverbs(E) ib_core(E) rfkill(E) ipmi_ssif(E) sunrpc(E) vfat(E) fat(E) acpi_ipmi(E) ipmi_si(E) mlx5_core(E) ipmi_devintf(E) mlxfw(E) psample(E) pci_hyperv_intf(E) ipmi_msghandler(E) alibaba_uncore_drw_pmu(E) arm_spe_pmu(E) loop(E) fuse(E) nfnetlink(E) xfs(E) libcrc32c(E) crct10dif_ce(E) ghash_ce(E) sm4_ce_gcm(E) sm4_ce_ccm(E) sm4_ce(E) sd_mod(E) sm4_ce_cipher(E) sm4(E) ast(E) sm3_ce(E) i2c_algo_bit(E) drm_shmem_helper(E) sha2_ce(E) ahci(E) [22655.802824] nvme(E) sha256_arm64(E) drm_kms_helper(E) libahci(E) sha1_ce(E) nvme_core(E) sbsa_gwdt(E) t10_pi(E) drm(E) libata(E) dm_multipath(E) dm_mod(E) [last unloaded: ltp_tpci(OE)] [22655.909173] CPU: 97 PID: 669 Comm: kcompactd1 Kdump: loaded Tainted: G W OE 6.6.71-3_rc1.an23.aarch64 #1 [22655.919768] Hardware name: AlibabaCloud AliServer-Xuanwu2.0AM-1UC1P-5B/AS1111MG1, BIOS 1.2.M1.AL.P.139.00 02/14/2023 [22655.930273] pstate: 63401009 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--) [22655.937221] pc : folio_remove_rmap_ptes+0x68/0x140 [22655.942004] lr : try_to_migrate_one+0x3b8/0xd60 [22655.946521] sp : ffff800082ad3680 [22655.949822] x29: ffff800082ad3680 x28: 000030400fffffd0 x27: ffff040061250b18 [22655.956943] x26: fffffc1001ffffc0 x25: 0000000000000000 x24: ffff8000811ec000 [22655.964065] x23: 00000000f0000800 x22: 0000ffffad763000 x21: ffff00080bd42440 [22655.971186] x20: ffff00080bd42440 x19: fffffc1001ffffc0 x18: 0000000000000000 [22655.978308] x17: 3f5b6b352a2d294a x16: 2e3b5e266c327b66 x15: 5c677537442c4635 [22655.985429] x14: 48397a3e407c782d x13: ffff040b6901f000 x12: 514e2c2955347047 [22655.992550] x11: 0000000000000002 x10: ffff000806500000 x9 : ffff80008032cc18 [22655.999671] x8 : 00000000007fffff x7 : fffffc100200001c x6 : 0000000000000001 [22656.006792] x5 : 00000000ffffffff x4 : 00000000ffffffff x3 : 0000000000000012 [22656.013914] x2 : 00000000ffffffff x1 : ffff040004c32b61 x0 : fffffc1001ffffc0 [22656.021036] Call trace: [22656.023471] folio_remove_rmap_ptes+0x68/0x140 [22656.027900] try_to_migrate_one+0x3b8/0xd60 [22656.032069] rmap_walk_anon+0xdc/0x220 [22656.035805] try_to_migrate+0x120/0x148 [22656.039626] migrate_folio_unmap+0x394/0x438 [22656.043886] migrate_pages_batch+0x18c/0xbf0 [22656.048142] migrate_pages_sync+0x84/0x248 [22656.052225] migrate_pages+0x6e8/0x870 [22656.055961] compact_zone+0x3ac/0x738 [22656.059611] kcompactd_do_work+0x174/0x498 [22656.063693] kcompactd+0x26c/0x408 [22656.067081] kthread+0xf8/0x110 [22656.070210] ret_from_fork+0x10/0x20 [22656.073774] Code: 52800243 4b0603e2 aa1303e0 f9400e61 (b9405e75) [22656.079857] SMP: stopping secondary CPUs [22656.084361] Starting crashdump kernel... [22656.093266] Bye!
请提供一下vmcore和vmlinux文件
(In reply to baolinwang from comment #1) > 请提供一下vmcore和vmlinux文件 机器信息及vmcore,vmlinux请私聊我
倚天机器,安装 6.6.71-3_rc1.al8.aarch64 内核,开启 mTHP 后,执行stress-ng 串行测试3h左右出现类似crash #uname -r 6.6.71-3_rc1.al8.aarch64 #cat /proc/cmdline BOOT_IMAGE=(hd0,gpt2)/boot/vmlinuz-6.6.71-3_rc1.al8.aarch64 root=UUID=3de20859-85e2-4621-bd1a-173af0f6467f ro biosdevname=0 rd.driver.pre=ahci console=ttyS0,115200 fsck.repair=yes iommu.passthrough=1 iommu.strict=0 ssbd=force-off systemd.unified_cgroup_hierarchy=0 cgroup.memory=nokmem crashkernel=1024M nospectre_bhb no_hash_pointers transparent_hugepage_tmpfs=always thp_shmem=64K:always thp_anon=64K:always thp_file=2M:always+exec #crash /usr/lib/debug/usr/lib/modules/6.6.71-3_rc1.al8.aarch64/vmlinux vmcore KERNEL: /usr/lib/debug/usr/lib/modules/6.6.71-3_rc1.al8.aarch64/vmlinux [TAINTED] DUMPFILE: vmcore [PARTIAL DUMP] CPUS: 124 DATE: Wed Mar 5 17:29:08 CST 2025 UPTIME: 03:39:47 LOAD AVERAGE: 6.97, 6.11, 4.54 TASKS: 1412 NODENAME: v43c07452.sqa.na131 RELEASE: 6.6.71-3_rc1.al8.aarch64 VERSION: #1 SMP PREEMPT_DYNAMIC Fri Feb 21 11:47:20 CST 2025 MACHINE: aarch64 (unknown Mhz) MEMORY: 256 GB PANIC: "Unable to handle kernel paging request at virtual address fffffc100200001c" PID: 687862 COMMAND: "stress-ng-dev-s" TASK: ffff040beab19400 [THREAD_INFO: ffff040beab19400] CPU: 80 STATE: TASK_RUNNING (PANIC) crash> bt PID: 687862 TASK: ffff040beab19400 CPU: 80 COMMAND: "stress-ng-dev-s" #0 [ffff8000a73ab380] machine_kexec at ffff800080033de8 #1 [ffff8000a73ab3b0] __crash_kexec at ffff80008015ac8c #2 [ffff8000a73ab530] crash_kexec at ffff80008015be64 #3 [ffff8000a73ab550] die at ffff8000800201ec #4 [ffff8000a73ab600] die_kernel_fault at ffff800080037108 #5 [ffff8000a73ab640] __do_kernel_fault at ffff8000800376d0 #6 [ffff8000a73ab670] do_bad_area at ffff80008003775c #7 [ffff8000a73ab690] do_translation_fault at ffff800080d41f78 #8 [ffff8000a73ab6a0] do_mem_abort at ffff800080037214 #9 [ffff8000a73ab6d0] el1_abort at ffff800080d2e72c #10 [ffff8000a73ab700] el1h_64_sync_handler at ffff800080d30aa4 #11 [ffff8000a73ab840] el1h_64_sync at ffff800080011304 #12 [ffff8000a73ab860] folio_remove_rmap_ptes at ffff80008032b9f4 #13 [ffff8000a73ab890] zap_present_ptes at ffff80008030e8d4 #14 [ffff8000a73ab910] zap_pte_range at ffff80008030efd8 #15 [ffff8000a73ab9c0] zap_pmd_range at ffff80008031580c #16 [ffff8000a73aba20] unmap_page_range at ffff800080315a00 #17 [ffff8000a73aba90] unmap_single_vma.constprop.0 at ffff800080315ba8 #18 [ffff8000a73abad0] unmap_vmas at ffff800080317358 #19 [ffff8000a73abb60] unmap_region.constprop.0 at ffff80008031b85c #20 [ffff8000a73abc60] do_vmi_align_munmap at ffff80008031e5f0 #21 [ffff8000a73abd40] do_vmi_munmap at ffff80008031e884 #22 [ffff8000a73abd80] __vm_munmap at ffff80008031e998 #23 [ffff8000a73abe30] __arm64_sys_munmap at ffff80008031eac0 #24 [ffff8000a73abe40] el0_svc_common.constprop.0 at ffff80008002881c #25 [ffff8000a73abe70] do_el0_svc at ffff800080028914 #26 [ffff8000a73abe80] el0_svc at ffff800080d303f8 #27 [ffff8000a73abea0] el0t_64_sync_handler at ffff800080d30c4c #28 [ffff8000a73abfe0] el0t_64_sync at ffff800080011608 PC: 0000ffffbdb78a0c LR: 0000000000461200 SP: 0000ffffc8f95d90 X29: 0000ffffc8f95d90 X28: 00000000805da15b X27: 0000ffffb5801000 X26: 0000000000001000 X25: 0000000000000001 X24: 0000000000000005 X23: 0000000000642760 X22: 0000ffefb6800000 X21: 0000000fff001000 X20: 0000ffffbd94a928 X19: 0000000000000000 X18: 0000000000000000 X17: 0000000000640df8 X16: 0000ffffbdb78a00 X15: 0000ffffc8f95bdf X14: 0000ffffbdc7b220 X13: 0000000000000000 X12: 0000ffffbdb57320 X11: 0000009a77f58622 X10: 0000000000000000 X9: 0000ffffc8f95d90 X8: 00000000000000d7 X7: 0000ffffc8f95d70 X6: 00000000fffffff8 X5: 000000000063fd80 X4: 0000000035dda15b X3: 0000000000000000 X2: 0000000000000002 X1: 0000000fff001000 X0: 0000ffefb6800000 ORIG_X0: 0000ffefb6800000 SYSCALLNO: d7 PSTATE: 80001000
https://gitee.com/anolis/cloud-kernel/pulls/4773
Fix in rc2