Bug 19415 - [ANCK-6.6-3][aarch64][倚天] 安装 64k 内核后环境无法正常启动,串口有Call Trace,folio_remove_rmap_ptes+0x68/0x140
Summary: [ANCK-6.6-3][aarch64][倚天] 安装 64k 内核后环境无法正常启动,串口有Call Trace,folio_remove_rmap_...
Status: RESOLVED FIXED
Alias: None
Product: Antest
Classification: Infrastructures
Component: 测试用例 (show other bugs) 测试用例
Version: unspecified
Hardware: All Linux
: P2-High S2-major
Target Milestone: ---
Assignee: shuancue
QA Contact: shuming
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2025-03-10 11:55 UTC by Janos
Modified: 2025-03-17 20:55 UTC (History)
4 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Janos alibaba_cloud_group 2025-03-10 11:55:36 UTC
[缺陷描述]:
倚天安装 64k 内核后环境无法正常启动,串口有Call Trace,folio_remove_rmap_ptes+0x68/0x140


[机器信息]:
环境:物理机
机型:倚天


原内核版本: 
# uname -r
6.6.71-3_rc2.al8.aarch64

内存信息:
# free -h
              total        used        free      shared  buff/cache   available
Mem:          503Gi       3.0Gi       476Gi        12Mi        23Gi       497Gi
Swap:         2.0Gi          0B       2.0Gi

cpu信息:
# lscpu
Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              128
On-line CPU(s) list: 0-127
Thread(s) per core:  1
Core(s) per socket:  128
Socket(s):           1
NUMA node(s):        2
Vendor ID:           ARM
BIOS Vendor ID:      T-HEAD
Model:               0
Model name:          Neoverse-N2
BIOS Model name:     Yitian710-128
Stepping:            r0p0
CPU MHz:             2750.001
BogoMIPS:            100.00
L1d cache:           64K
L1i cache:           64K
L2 cache:            1024K
L3 cache:            65536K
NUMA node0 CPU(s):   0-63
NUMA node1 CPU(s):   64-127
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh


CMDLINE:
#cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt2)/boot/vmlinuz-6.6.71-3_rc2.al8.aarch64 root=UUID=1e1d9fc1-be93-4b6b-bb50-9f86448f8a4d ro biosdevname=0 rd.driver.pre=ahci console=ttyS0,115200 fsck.repair=yes cgroup.memory=nokmem crashkernel=0M-2G:0M,2G-64G:256M,64G-:384M iommu.passthrough=1 iommu.strict=0 ssbd=force-off nospectre_bhb no_hash_pointers transparent_hugepage_tmpfs=always thp_shmem=64K:always thp_anon=64K:always thp_file=2M:always+exec

[重现步骤]:
rpm -ivh --force http://koji.alibaba-inc.com/kojifiles/work/tasks/1956/731956/kernel-6.6.71-3.64k_rc1.al8.aarch64.rpm
rpm -ivh --force https://koji.alibaba-inc.com/kojifiles/work/tasks/1956/731956/kernel-devel-6.6.71-3.64k_rc1.al8.aarch64.rpm
rpm -ivh --force https://koji.alibaba-inc.com/kojifiles/work/tasks/1956/731956/kernel-headers-6.6.71-3.64k_rc1.al8.aarch64.rpm

reboot
[期望结果]:
机器可以正常启动,串口没有Call Trace

[实际结果]:
无法正常启动,串口有Call Trace,串口日志如下:

  17.450903] swapper pgtable: 64k pages, 48-bit VAs, pgdp=00000000f1bb0000
[   17.539629] [ffffffc10020001c] pgd=00000000f2960003, p4d=00000000f2960003, pud=00000000f2960003, pmd=10000447fff50003, pte=0000000000000000
[   17.539638] Internal error: Oops: 0000000096000007 [#1] SMP
[   17.557702] Modules linked in: libcrc32c(E) nfnetlink(E) ipmi_ssif(E) coresight_catu(E) crct10dif_ce(E) ghash_ce(E) sm4_ce_gcm(E) sm4_ce_ccm(E) sm4_ce(E) mlx5_ib(E) sm4_ce_cipher(E) sm4(E) sm3_ce(E) ib_uverbs(E) sha1_ce(E) ib_core(E) sbsa_gwdt(E) acpi_ipmi(E) arm_spe_pmu(E) arm_smmuv3_pmu(E) ipmi_si(E) alibaba_uncore_drw_pmu(E) coresight_stm(E) stm_core(E) coresight_etm4x(E) coresight_funnel(E) coresight_tmc(E) coresight(E) vfat(E) fat(E) ast(E) i2c_algo_bit(E) drm_shmem_helper(E) mlx5_core(E) sha2_ce(E) drm_kms_helper(E) nvme(E) sha256_arm64(E) mlxfw(E) drm(E) pci_hyperv_intf(E) nvme_core(E) psample(E) sd_mod(E) t10_pi(E) sg(E) ahci(E) libahci(E) libata(E) ipmi_devintf(E) ipmi_msghandler(E)
         Star[ti n g  1H7a.r6d1w8942] CPU: 113 PID: 2133 Comm: (chronyd) Tainted: G            E      6.6.71-3.64k_rc1.al8.aarch64 #1
re RNG Entropy Gatherer Wake threshold service...
[   17.618945] Hardware name: AlibabaCloud AS1212MG1/AS03MB07, BIOS 1.2.M1.AL.P.160.01 12/21/2023
[   17.618947] pstate: 63401009 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
[   17.618950] pc : folio_remove_rmap_ptes+0x68/0x140
         Starting Sy[s t e m1 7L.655500] lr : zap_present_ptes+0x210/0x618
gger Daemon...
[  OK  ] Reached target sshd-keygen.target.
[   17.668522] sp : ffff8000a798f720
[   17.668523] x29: ffff8000a798f720 x28: ffff8000a798f9e0 x27: 0000000000000015
[   17.668526] x26: ffff040016b09840 x25: 0000000000000001 x24: ffffffc1001fffc0
[   17.668528] x23: ffff8000a798f868 x22: 0000aaaabeee0000 x21: ffff04080fa2f770
 [       1 7 . 6 6S8t5a3r1t]i nxg0: ffff040016b09840 x19: ffffffc1001fffc0 x18: ffff8000a798fa78
 IBM Power Raid dump daemon...
[   17.668534] x17: 00000000ffffffff x16: ffff800080ec65c8 x15: 0000ffffa2d2ffff
[   17.704568] x14: 0000000000000000 x13: 1fffe080030748c1 x12: ffff8000a798fa78
[   17.704570] x11: 0000000000000001 x10: ffff0400183a460c x9 : ffff800080304248
[   17.704572] x8 : 00000000007fffff x7 : ffffffc10020001c x6 : 0000000000000001
[  OK  ] Started Self [   17.704575] x5 : 00000000ffffffff x4 : 00000000ffffffff x3 : 0000000000000012
Monitoring and R[   1e7p.o7r3t3i056] x2 : 00000000ffffffff x1 : ffff04004ba88421 x0 : ffffffc1001fffc0
g Technology (SMART) Daemon.
[FAILED] Failed to start Configure CPU power related settings.
[   17.733058] Call trace:
[   17.733060]  folio_remove_rmap_ptes+0x68/0x140
[   17.733063]  zap_present_ptes+0x210/0x618
[   17.733066]  zap_pte_range+0x2fc/0x670
[   17.776118]  zap_pmd_range+0xe8/0x1c8
[   17.776121]  unmap_page_range+0xd8/0x190
See 'systemctl status cpupower.service' for details.
[   17.788456]  unmap_single_vma.constprop.0+0x8c/0x108
[   17.788459]  unmap_vmas+0x84/0x3d8
[   17.788461]  exit_mmap+0xbc/0x3d0
[   17.788462]  __mmput+0x40/0x180
[[ 0 ;117;.3718m8F4A6I6L]E D put+0x6c/0x80
[0m] Failed to start TCG Core Services Daemon.
[   17.788468]  exec_mmap+0x148/0x268
[   17.815128]  begin_new_exec+0x10c/0x370
[   17.815130]  load_elf_binary+0x304/0xbc8
[   17.822864]  search_binary_handler+0xd4/0x260
See 'systemctl status tcsd.s[e r  17.827211]  exec_binprm+0x5c/0x1e0
ice' for details.
[   17.827212]  bprm_execve.part.0+0x190/0x228
[   17.827214]  bprm_execve+0x60/0x98
[   17.827214]  do_execveat_common+0x184/0x220
[[ 0 ;1372.m8 2 7O2K1 6 ] 0_arm64_sys_execve+0x3c/0x58
m] Started Restore /run/initramfs on shutdown.
[   17.827217]  do_el0_svc+0x70/0xf8
[   17.827222]  el0_svc+0x50/0x218
[   17.862812]  el0t_64_sync_handler+0xf8/0x128
[   17.862814]  el0t_64_sync+0x17c/0x180
[   17.862817] Code: 52800243 4b0603e2 aa1303e0 f9400e61 (b9405e75) 
[   17.862821] ---[ end trace 0000000000000000 ]---
[   17.862822] Kernel panic - not syncing: Oops: Fatal exception
[   17.862825] SMP: stopping secondary CPUs
[   17.903848] Kernel Offset: disabled
[   17.903851] CPU features: 0x2,00380001,e022cd43,1047fe0b
[   17.903852] Memory Limit: none
smc_fid: 84000009 INFO:    mpidr:181310000, stop s-wtd.
INFO:    PSCI Power Domain Map:
INFO:      Domain Node : Level 1, parent_node 4294967295, State ON (0x0)
INFO:      Domain Node : Level 1, parent_node 4294967295, State ON (0x0)
Comment 1 Janos alibaba_cloud_group 2025-03-10 13:16:41 UTC
机器陷入反复重启中,后面又启动成功了,可以进入系统
Comment 2 chenzhuo alibaba_cloud_group 2025-03-10 14:29:19 UTC
换了另一台倚天机器,装上内核后立刻crash了一次。现象类似,vmcore解析如下:
WARNING: active task ffff040043e28000 on cpu 112 not found in PID hash

      KERNEL: /usr/lib/debug/usr/lib/modules/6.6.71-3.64k_rc1.al8.aarch64/vmlinux  [TAINTED]
    DUMPFILE: /var/crash/127.0.0.1-2025-03-10-13:55:53/vmcore  [PARTIAL DUMP]
        CPUS: 124
        DATE: Mon Mar 10 13:54:44 CST 2025
      UPTIME: 00:00:35
LOAD AVERAGE: 2.91, 0.77, 0.26
       TASKS: 1323
    NODENAME: v43c07454.sqa.na131
     RELEASE: 6.6.71-3.64k_rc1.al8.aarch64
     VERSION: #1 SMP PREEMPT_DYNAMIC Fri Feb 28 10:42:23 CST 2025
     MACHINE: aarch64  (unknown Mhz)
      MEMORY: 256 GB
       PANIC: "Unable to handle kernel paging request at virtual address ffffffc10020001c"
         PID: 11601
     COMMAND: "user_account.sh"
        TASK: ffff040043e28000  [THREAD_INFO: ffff040043e28000]
         CPU: 112
       STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 11601    TASK: ffff040043e28000  CPU: 112  COMMAND: "user_account.sh"
 #0 [ffff8000dd66f240] machine_kexec at ffff80008002ffa0
 #1 [ffff8000dd66f270] __crash_kexec at ffff800080152adc
 #2 [ffff8000dd66f3f0] crash_kexec at ffff800080153cb4
 #3 [ffff8000dd66f410] die at ffff80008001f07c
 #4 [ffff8000dd66f4c0] die_kernel_fault at ffff800080033420
 #5 [ffff8000dd66f500] __do_kernel_fault at ffff8000800336e4
 #6 [ffff8000dd66f530] do_bad_area at ffff80008003377c
 #7 [ffff8000dd66f550] do_translation_fault at ffff800080d27cbc
 #8 [ffff8000dd66f560] do_mem_abort at ffff80008003352c
 #9 [ffff8000dd66f590] el1_abort at ffff800080d1487c
#10 [ffff8000dd66f5c0] el1h_64_sync_handler at ffff800080d1696c
#11 [ffff8000dd66f700] el1h_64_sync at ffff800080011304
#12 [ffff8000dd66f720] folio_remove_rmap_ptes at ffff80008032077c
#13 [ffff8000dd66f750] zap_present_ptes at ffff800080304244
#14 [ffff8000dd66f7d0] zap_pte_range at ffff800080304948
#15 [ffff8000dd66f880] zap_pmd_range at ffff80008030ab94
#16 [ffff8000dd66f8f0] unmap_page_range at ffff80008030ad4c
#17 [ffff8000dd66f950] unmap_single_vma.constprop.0 at ffff80008030ae90
#18 [ffff8000dd66f990] unmap_vmas at ffff80008030c5f0
#19 [ffff8000dd66fa20] exit_mmap at ffff800080313db0
#20 [ffff8000dd66fb40] __mmput at ffff80008004b05c
#21 [ffff8000dd66fb70] mmput at ffff80008004b208
#22 [ffff8000dd66fb90] exec_mmap at ffff8000803d15cc
#23 [ffff8000dd66fbd0] begin_new_exec at ffff8000803d3000
#24 [ffff8000dd66fc00] load_elf_binary at ffff800080449440
#25 [ffff8000dd66fce0] search_binary_handler at ffff8000803d0488
#26 [ffff8000dd66fd30] exec_binprm at ffff8000803d0b78
#27 [ffff8000dd66fd70] bprm_execve at ffff8000803d103c
#28 [ffff8000dd66fdb0] bprm_execve at ffff8000803d1134
#29 [ffff8000dd66fdf0] do_execveat_common at ffff8000803d2600
#30 [ffff8000dd66fe40] __arm64_sys_execve at ffff8000803d26d8
#31 [ffff8000dd66fe60] do_el0_svc at ffff800080026e5c
#32 [ffff8000dd66fe80] el0_svc at ffff800080d1660c
#33 [ffff8000dd66fea0] el0t_64_sync_handler at ffff800080d16b14
#34 [ffff8000dd66ffe0] el0t_64_sync at ffff800080011608
     PC: 0000ffffb89c8e4c   LR: 0000aaaacf176dec   SP: 0000ffffdf132cf0
    X29: 0000ffffdf132cf0  X28: 0000aaab00765410  X27: 0000000000000000
    X26: 00000000ffffffff  X25: 0000aaab00751c30  X24: 0000aaab007b2a70
    X23: 0000aaaacf26f944  X22: 0000000000000000  X21: 0000aaab007af6f0
    X20: 0000aaaacf25f000  X19: 0000aaab007af790  X18: 0000aaab00740018
    X17: 0000ffffb89c8e40  X16: 0000aaaacf25ed60  X15: 0000000000000030
    X14: 0000000000000002  X13: 0000000000000001  X12: 0000000000000000
    X11: 0000000000000000  X10: 0000000000000000   X9: 0000aaab007b2a60
     X8: 00000000000000dd   X7: 0000000000002791   X6: 0000000000000031
     X5: 0000aaab007b2a90   X4: 0000000000000000   X3: 0000ffffb8baf5d8
     X2: 0000aaab00751c30   X1: 0000aaab007b2a70   X0: 0000aaab007af790
    ORIG_X0: 0000aaab007af790  SYSCALLNO: dd  PSTATE: 60001000
Comment 3 chenzhuo alibaba_cloud_group 2025-03-10 15:54:21 UTC
rc2内核该问题已解决