Created attachment 143 [details] vmcore-dmesg Description of problem: 5.10 arm内核上ltp测试套ftrace_stress_test.sh导致crash:Unable to handle kernel paging request at virtual address 60ffff0040200210 复测十次未复现,x86上暂未发现 Version-Release number of selected component (if applicable): ]# uname -a Linux l57h15219.sqa.nu8 5.10.84-148.git.088d92b1e.an8.aarch64 #1 SMP Wed Jan 26 13:17:23 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux # cat /etc/os-release NAME="Anolis OS" VERSION="8.2" ID="anolis" ID_LIKE="rhel fedora centos" VERSION_ID="8.2" PLATFORM_ID="platform:an8" PRETTY_NAME="Anolis OS 8.2" ANSI_COLOR="0;31" HOME_URL="https://openanolis.org/" How reproducible: Steps to Reproduce: 1.git clone https://github.com/linux-test-project/ltp.git cd ltp make autotools ./configure make make install 2. ./runltp -f tracing -s ftrace-stress-test Actual results: arm上由于https://bugs.openanolis.cn/view.php?id=613 vmcore无法解析,部分日志如下,具体vmcore-dmesg文件见附件 [51291.525182] LTP: starting ftrace-stress-test (ftrace_stress_test.sh 90) [51304.145870] Unable to handle kernel paging request at virtual address 60ffff0040200210 [51304.154749] Mem abort info: [51304.158170] ESR = 0x96000004 [51304.161821] EC = 0x25: DABT (current EL), IL = 32 bits [51304.167758] SET = 0, FnV = 0 [51304.171434] EA = 0, S1PTW = 0 [51304.175184] Data abort info: [51304.178668] ISV = 0, ISS = 0x00000004 [51304.183104] CM = 0, WnR = 0 [51304.186671] [60ffff0040200210] address between user and kernel address ranges [51304.194414] Internal error: Oops: 96000004 [#1] SMP [51304.199909] Modules linked in: nft_fib_inet(E) nft_fib_ipv4(E) nft_fib_ipv6(E) nft_fib(E) nft_reject_inet(E) nf_reject_ipv6(E) ip6_tables(E) ip_tables(E) nf_log_ipv4(E) nf_log_common(E) nft_log(E) nft_reject_ipv4(E) nf_reject_ipv4(E) nft_reject(E) nft_limit(E) xt_limit(E) xt_multiport(E) xt_LOG(E) tcp_diag(E) nfsv3(E) rpcsec_gss_krb5(E) nfsv4(E) dns_resolver(E) nfs(E) nfs_ssc(E) fscache(E) nft_ct(E) sit(E) tcp_dctcp(E) ipvlan(E) ah4(E) macvtap(E) tap(E) macvlan(E) sch_sfq(E) sch_sfb(E) sch_pie(E) sch_hhf(E) sch_hfsc(E) sch_codel(E) tcp_bbr(E) sctp(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) isofs(E) cdrom(E) tun(E) fuse(E) n_gsm(E) pps_ldisc(E) ppp_synctty(E) ppp_async(E) ppp_generic(E) slcan(E) slip(E) slhc(E) n_hdlc(E) pcrypt(E) crypto_user(E) vmac(E) salsa20_generic(E) sha3_generic(E) msdos(E) xfs(E) sch_red(E) sch_prio(E) act_vlan(E) act_skbmod(E) act_csum(E) act_gact(E) act_pedit(E) nfnetlink_queue(E) mptcp_diag(E) inet_diag(E) xfrm_interface(E) xfrm6_tunnel(E) tunnel4(E) [51304.200093] des_generic(E) libdes(E) ifb(E) sch_netem(E) cls_matchall(E) sch_ingress(E) mpls_iptunnel(E) mpls_router(E) sch_fq(E) geneve(E) act_mirred(E) cls_basic(E) esp6(E) authenc(E) echainiv(E) nft_counter(E) xt_policy(E) esp4_offload(E) seqiv(E) esp4(E) macsec(E) vxlan(E) ip6_udp_tunnel(E) udp_tunnel(E) vrf(E) 8021q(E) garp(E) mrp(E) bridge(E) stp(E) llc(E) ip6_gre(E) ip6_tunnel(E) tunnel6(E) ip_gre(E) ip_tunnel(E) gre(E) cls_u32(E) sch_htb(E) dummy(E) ccm(E) poly1305_generic(E) libpoly1305(E) chacha_generic(E) chacha_neon(E) libchacha(E) chacha20poly1305(E) tls(E) raid0(E) dm_mod(E) veth(E) overlay(E) squashfs(E) loop(E) binfmt_misc(E) bonding(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) rfkill(E) nft_compat(E) ip_set(E) nf_tables(E) libcrc32c(E) nfnetlink(E) vfat(E) fat(E) rpcrdma(E) sunrpc(E) rdma_ucm(E) ib_srpt(E) ib_isert(E) iscsi_target_mod(E) target_core_mod(E) ib_iser(E) rdma_cm(E) iw_cm(E) ib_umad(E) libiscsi(E) ib_ipoib(E) [51304.293014] scsi_transport_iscsi(E) ib_cm(E) ipmi_ssif(E) crct10dif_ce(E) ghash_ce(E) sha1_ce(E) acpi_ipmi(E) sbsa_gwdt(E) mlx5_ib(E) ib_uverbs(E) ipmi_si(E) ib_core(E) ipmi_devintf(E) ipmi_msghandler(E) spi_dw_mmio(E) spi_dw(E) ext4(E) mbcache(E) jbd2(E) hibmc_drm(E) drm_vram_helper(E) drm_kms_helper(E) syscopyarea(E) realtek(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) cec(E) sha2_ce(E) drm_ttm_helper(E) hisi_sas_v3_hw(E) hns3(E) nvme(E) hisi_sas_main(E) ttm(E) sha256_arm64(E) i2c_designware_platform(E) nvme_core(E) mlx5_core(E) drm(E) nfit(E) libsas(E) i2c_designware_core(E) hclge(E) gpio_dwapb(E) sg(E) hnae3(E) scsi_transport_sas(E) mlxfw(E) i2c_core(E) gpio_generic(E) libnvdimm(E) [last unloaded: nf_reject_ipv4] [51304.455211] CPU: 88 PID: 3009112 Comm: hwlatd Kdump: loaded Tainted: G OE 5.10.84-147.git.1378144e5.an8.aarch64 #1 [51304.468859] Hardware name: H3C R4960 G3/BC82AMDDA, BIOS 1.70 01/07/2021 [51304.476573] pstate: 00c00089 (nzcv daIf +PAN +UAO -TCO BTYPE=--) [51304.483673] pc : ring_buffer_lock_reserve+0x28/0x450 [51304.489736] lr : trace_buffer_lock_reserve+0x24/0x58 [51304.495798] sp : ffff80004f4d3ce0 [51304.500214] x29: ffff80004f4d3ce0 x28: ffff800011f02660 [51304.506627] x27: 000000000007a121 x26: 00002ea92d63aef2 [51304.513032] x25: 0000000000000001 x24: 0000000000000000 [51304.519442] x23: 20c49ba5e353f7cf x22: 0000000000000000 [51304.525848] x21: 0000000000000080 x20: ffff80001108fcd8 [51304.532247] x19: 60ffff0040200208 x18: 0000000000000000 [51304.538653] x17: 0000000000000000 x16: 0000000000000000 [51304.545058] x15: 0000aaab068aabd0 x14: 0000000000000000 [51304.551459] x13: 0000000000000000 x12: ffff80005709bd80 [51304.557855] x11: ffff80005709bcf5 x10: ffff80004f4d3d60 [51304.564245] x9 : ffff800010a58a30 x8 : ffff00402b49c460 [51304.570636] x7 : ffff8000118bcfe8 x6 : 00000469790c1416 [51304.577027] x5 : 0000000000000040 x4 : 0000000000000000 [51304.583397] x3 : 0000000000000080 x2 : 0000000000000040 [51304.589793] x1 : 0000000000000000 x0 : 60ffff0040200208 [51304.596194] Call trace: [51304.599739] ring_buffer_lock_reserve+0x28/0x450 [51304.605436] trace_buffer_lock_reserve+0x24/0x58 [51304.611135] kthread_fn+0x20c/0x440 [51304.615685] kthread+0x114/0x118 [51304.619960] Code: f9400281 f9003fe1 d2800001 a9025bf5 (b9400815) [51304.627046] ---[ end trace f1a79247722caef4 ]--- [51304.632660] Kernel panic - not syncing: Oops: Fatal exception [51304.639397] SMP: stopping secondary CPUs [51305.073458] Kernel Offset: 0x1c0000 from 0xffff800010000000 [51305.079999] PHYS_OFFSET: 0x0 [51305.083819] CPU features: 0x8000002,22a08a38 [51305.088995] Memory Limit: none [51305.095214] Starting crashdump kernel... [51305.100032] Bye! Expected results: Additional info: # free -mh total used free shared buff/cache available Mem: 753Gi 4.6Gi 746Gi 17Mi 1.9Gi 745Gi Swap: 2.0Gi 153Mi 1.8Gi # lscpu Architecture: aarch64 Byte Order: Little Endian CPU(s): 96 On-line CPU(s) list: 0,1,7-9,65-95 Off-line CPU(s) list: 2-6,10-64 Thread(s) per core: 1 Core(s) per socket: 18 Socket(s): 2 NUMA node(s): 1 Vendor ID: HiSilicon BIOS Vendor ID: HiSilicon Model: 0 Model name: Kunpeng-920 BIOS Model name: HUAWEI Kunpeng 920 5250 Stepping: 0x1 CPU max MHz: 2600.0000 CPU min MHz: 200.0000 BogoMIPS: 200.00 L1d cache: 64K L1i cache: 64K L2 cache: 512K L3 cache: 24576K NUMA node0 CPU(s): 0,1,7-9,65-95 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm
ARM上的 crash 问题是版本过低的原因, 安装高版本的 crash 可以解析 vmcore. 帮忙上传一下 vmcore 吧.
(In reply to Shiloong from comment #1) > ARM上的 crash 问题是版本过低的原因, 安装高版本的 crash 可以解析 vmcore. > 帮忙上传一下 vmcore 吧. 文件太大了 传不上去
vmcore解析信息如下: # crash /usr/lib/debug/usr/lib/modules/5.10.84-148.git.088d92b1e.an8.aarch64/vmlinux vmcore crash 7.3.1-1.el8 Copyright (C) 2002-2021 Red Hat, Inc. Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005, 2011, 2020-2021 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. GNU gdb (GDB) 7.6 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "aarch64-unknown-linux-gnu"... WARNING: kernel relocated [1MB]: patching 106835 gdb minimal_symbol values WARNING: kernel version inconsistency between vmlinux and dumpfile WARNING: cpu 32: cannot find NT_PRSTATUS note WARNING: cpu 33: cannot find NT_PRSTATUS note WARNING: cpu 34: cannot find NT_PRSTATUS note WARNING: cpu 35: cannot find NT_PRSTATUS note WARNING: cpu 36: cannot find NT_PRSTATUS note WARNING: cpu 37: cannot find NT_PRSTATUS note WARNING: cpu 38: cannot find NT_PRSTATUS note WARNING: cpu 39: cannot find NT_PRSTATUS note WARNING: cpu 40: cannot find NT_PRSTATUS note WARNING: cpu 41: cannot find NT_PRSTATUS note WARNING: cpu 42: cannot find NT_PRSTATUS note WARNING: cpu 43: cannot find NT_PRSTATUS note WARNING: cpu 44: cannot find NT_PRSTATUS note WARNING: cpu 45: cannot find NT_PRSTATUS note WARNING: cpu 46: cannot find NT_PRSTATUS note WARNING: cpu 47: cannot find NT_PRSTATUS note WARNING: cpu 48: cannot find NT_PRSTATUS note WARNING: cpu 49: cannot find NT_PRSTATUS note WARNING: cpu 50: cannot find NT_PRSTATUS note WARNING: cpu 51: cannot find NT_PRSTATUS note WARNING: cpu 52: cannot find NT_PRSTATUS note WARNING: cpu 53: cannot find NT_PRSTATUS note WARNING: cpu 54: cannot find NT_PRSTATUS note WARNING: cpu 55: cannot find NT_PRSTATUS note WARNING: cpu 56: cannot find NT_PRSTATUS note WARNING: cpu 57: cannot find NT_PRSTATUS note WARNING: cpu 58: cannot find NT_PRSTATUS note WARNING: cpu 59: cannot find NT_PRSTATUS note WARNING: cpu 60: cannot find NT_PRSTATUS note WARNING: cpu 61: cannot find NT_PRSTATUS note WARNING: cpu 62: cannot find NT_PRSTATUS note WARNING: cpu 63: cannot find NT_PRSTATUS note WARNING: cpu 64: cannot find NT_PRSTATUS note WARNING: cpu 65: cannot find NT_PRSTATUS note WARNING: cpu 66: cannot find NT_PRSTATUS note WARNING: cpu 67: cannot find NT_PRSTATUS note WARNING: cpu 68: cannot find NT_PRSTATUS note KERNEL: /usr/lib/debug/usr/lib/modules/5.10.84-148.git.088d92b1e.an8.aarch64/vmlinux [TAINTED] DUMPFILE: vmcore [PARTIAL DUMP] CPUS: 96 [OFFLINE: 64] DATE: Thu Jan 27 00:07:55 CST 2022 UPTIME: 14:15:04 LOAD AVERAGE: 3.31, 5.40, 6.08 TASKS: 1113 NODENAME: l57h15219.sqa.nu8 RELEASE: 5.10.84-147.git.1378144e5.an8.aarch64 VERSION: #1 SMP Tue Jan 25 13:14:55 UTC 2022 MACHINE: aarch64 (unknown Mhz) MEMORY: 768 GB PANIC: "Unable to handle kernel paging request at virtual address 60ffff0040200210" PID: 3009112 COMMAND: "hwlatd" TASK: ffff00402b49b600 [THREAD_INFO: ffff00402b49b600] CPU: 88 STATE: TASK_RUNNING (PANIC) crash> bt PID: 3009112 TASK: ffff00402b49b600 CPU: 88 COMMAND: "hwlatd" #0 [ffff80004f4d3710] machine_kexec at ffff8000101f0cd0 #1 [ffff80004f4d3760] __crash_kexec at ffff800010339c64 #2 [ffff80004f4d3900] panic at ffff800010cae9c8 #3 [ffff80004f4d39e0] die at ffff8000101dd3d8 #4 [ffff80004f4d3a90] die_kernel_fault at ffff8000101fc6fc #5 [ffff80004f4d3ac0] __do_kernel_fault at ffff8000101fc7cc #6 [ffff80004f4d3af0] do_bad_area at ffff8000101fc8d0 #7 [ffff80004f4d3b10] do_translation_fault at ffff800010ccb3cc #8 [ffff80004f4d3b20] do_mem_abort at ffff8000101fc638 #9 [ffff80004f4d3b50] el1_abort at ffff800010cbb8c0 #10 [ffff80004f4d3b80] el1_sync_handler at ffff800010cbbcb8 #11 [ffff80004f4d3cc0] el1_sync at ffff8000101d1a00 #12 [ffff80004f4d3ce0] ring_buffer_lock_reserve at ffff80001038d1ac #13 [ffff80004f4d3d60] trace_buffer_lock_reserve at ffff800010398608 #14 [ffff80004f4d3d90] kthread_fn at ffff8000103a5d00 #15 [ffff80004f4d3e50] kthread at ffff80001027f928
*** Bug 427 has been marked as a duplicate of this bug. ***
调试发现,可能不是kernel问题,而是硬件问题 crash> p &global_trace $2 = (struct trace_array *) 0xffff800011c5bc60 crash> p hwlat_trace hwlat_trace = $3 = (struct trace_array *) 0xffff800011c5bc61 在vmcore中,正常情况下两个指针值应当相等,但vmcore中发现hwlat_trace指针低位bit翻转为1,导致基于错误的指针地址访问得到错误的数据 既不是越界覆盖,也不像是kernel软件问题,反而更像是硬件问题
非软件问题
换机器复现中,暂未复现 1.相同机器上有复现到此问题,vmcore显示发生在另外一个CPU上, p &global_trace和p hwlat_trace是一样的,不太明白在不同cpu上发生相同问题会是硬件问题吗? 2.跟此问题标为重复问题的427问题,在其他机器也是可以必现的
Created attachment 155 [details] cpu94 vmcore
(In reply to kangjiangbo from comment #7) > 换机器复现中,暂未复现 > 1.相同机器上有复现到此问题,vmcore显示发生在另外一个CPU上, p &global_trace和p > hwlat_trace是一样的,不太明白在不同cpu上发生相同问题会是硬件问题吗? > > > 2.跟此问题标为重复问题的427问题,在其他机器也是可以必现的 另外一台机器目前未能复现此问题
ftrace_stress_test.sh 这个是一个已知问题,关闭吧