Bug 408 - [aarch64]ltp测试套ftrace_stress_test.sh导致crash:Unable to handle kernel paging request at virtual address 60ffff0040200210
Summary: [aarch64]ltp测试套ftrace_stress_test.sh导致crash:Unable to handle kernel paging re...
Status: CONFIRMED
Alias: None
Product: ANCK 5.10 Dev
Classification: ANCK
Component: general/others (show other bugs) general/others
Version: unspecified
Hardware: aarch64 Linux
: P2-High S2-major
Target Milestone: ---
Assignee: fghui_kernel
QA Contact: shuming
URL:
Whiteboard:
Keywords: Bugfix
: 427 (view as bug list)
Depends on:
Blocks:
 
Reported: 2022-01-27 11:44 UTC by kangjiangbo
Modified: 2022-04-15 11:29 UTC (History)
1 user (show)

See Also:


Attachments
vmcore-dmesg (1.06 MB, text/plain)
2022-01-27 11:44 UTC, kangjiangbo
Details
cpu94 vmcore (126.74 KB, image/png)
2022-02-21 15:46 UTC, kangjiangbo
Details

Note You need to log in before you can comment on or make changes to this bug.
Description kangjiangbo 2022-01-27 11:44:20 UTC
Created attachment 143 [details]
vmcore-dmesg

Description of problem:
5.10 arm内核上ltp测试套ftrace_stress_test.sh导致crash:Unable to handle kernel paging request at virtual address 60ffff0040200210
复测十次未复现,x86上暂未发现

Version-Release number of selected component (if applicable):
]# uname -a
Linux l57h15219.sqa.nu8 5.10.84-148.git.088d92b1e.an8.aarch64 #1 SMP Wed Jan 26 13:17:23 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux
# cat /etc/os-release
NAME="Anolis OS"
VERSION="8.2"
ID="anolis"
ID_LIKE="rhel fedora centos"
VERSION_ID="8.2"
PLATFORM_ID="platform:an8"
PRETTY_NAME="Anolis OS 8.2"
ANSI_COLOR="0;31"
HOME_URL="https://openanolis.org/"

How reproducible:


Steps to Reproduce:
1.git clone https://github.com/linux-test-project/ltp.git
cd ltp
make autotools
./configure
make
make install
2.
./runltp -f tracing -s ftrace-stress-test

Actual results:
arm上由于https://bugs.openanolis.cn/view.php?id=613  vmcore无法解析,部分日志如下,具体vmcore-dmesg文件见附件

[51291.525182] LTP: starting ftrace-stress-test (ftrace_stress_test.sh 90)
[51304.145870] Unable to handle kernel paging request at virtual address 60ffff0040200210
[51304.154749] Mem abort info:
[51304.158170]   ESR = 0x96000004
[51304.161821]   EC = 0x25: DABT (current EL), IL = 32 bits
[51304.167758]   SET = 0, FnV = 0
[51304.171434]   EA = 0, S1PTW = 0
[51304.175184] Data abort info:
[51304.178668]   ISV = 0, ISS = 0x00000004
[51304.183104]   CM = 0, WnR = 0
[51304.186671] [60ffff0040200210] address between user and kernel address ranges
[51304.194414] Internal error: Oops: 96000004 [#1] SMP
[51304.199909] Modules linked in: nft_fib_inet(E) nft_fib_ipv4(E) nft_fib_ipv6(E) nft_fib(E) nft_reject_inet(E) nf_reject_ipv6(E) ip6_tables(E) ip_tables(E) nf_log_ipv4(E) nf_log_common(E) nft_log(E) nft_reject_ipv4(E) nf_reject_ipv4(E) nft_reject(E) nft_limit(E) xt_limit(E) xt_multiport(E) xt_LOG(E) tcp_diag(E) nfsv3(E) rpcsec_gss_krb5(E) nfsv4(E) dns_resolver(E) nfs(E) nfs_ssc(E) fscache(E) nft_ct(E) sit(E) tcp_dctcp(E) ipvlan(E) ah4(E) macvtap(E) tap(E) macvlan(E) sch_sfq(E) sch_sfb(E) sch_pie(E) sch_hhf(E) sch_hfsc(E) sch_codel(E) tcp_bbr(E) sctp(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) isofs(E) cdrom(E) tun(E) fuse(E) n_gsm(E) pps_ldisc(E) ppp_synctty(E) ppp_async(E) ppp_generic(E) slcan(E) slip(E) slhc(E) n_hdlc(E) pcrypt(E) crypto_user(E) vmac(E) salsa20_generic(E) sha3_generic(E) msdos(E) xfs(E) sch_red(E) sch_prio(E) act_vlan(E) act_skbmod(E) act_csum(E) act_gact(E) act_pedit(E) nfnetlink_queue(E) mptcp_diag(E) inet_diag(E) xfrm_interface(E) xfrm6_tunnel(E) tunnel4(E)
[51304.200093]  des_generic(E) libdes(E) ifb(E) sch_netem(E) cls_matchall(E) sch_ingress(E) mpls_iptunnel(E) mpls_router(E) sch_fq(E) geneve(E) act_mirred(E) cls_basic(E) esp6(E) authenc(E) echainiv(E) nft_counter(E) xt_policy(E) esp4_offload(E) seqiv(E) esp4(E) macsec(E) vxlan(E) ip6_udp_tunnel(E) udp_tunnel(E) vrf(E) 8021q(E) garp(E) mrp(E) bridge(E) stp(E) llc(E) ip6_gre(E) ip6_tunnel(E) tunnel6(E) ip_gre(E) ip_tunnel(E) gre(E) cls_u32(E) sch_htb(E) dummy(E) ccm(E) poly1305_generic(E) libpoly1305(E) chacha_generic(E) chacha_neon(E) libchacha(E) chacha20poly1305(E) tls(E) raid0(E) dm_mod(E) veth(E) overlay(E) squashfs(E) loop(E) binfmt_misc(E) bonding(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) rfkill(E) nft_compat(E) ip_set(E) nf_tables(E) libcrc32c(E) nfnetlink(E) vfat(E) fat(E) rpcrdma(E) sunrpc(E) rdma_ucm(E) ib_srpt(E) ib_isert(E) iscsi_target_mod(E) target_core_mod(E) ib_iser(E) rdma_cm(E) iw_cm(E) ib_umad(E) libiscsi(E) ib_ipoib(E)
[51304.293014]  scsi_transport_iscsi(E) ib_cm(E) ipmi_ssif(E) crct10dif_ce(E) ghash_ce(E) sha1_ce(E) acpi_ipmi(E) sbsa_gwdt(E) mlx5_ib(E) ib_uverbs(E) ipmi_si(E) ib_core(E) ipmi_devintf(E) ipmi_msghandler(E) spi_dw_mmio(E) spi_dw(E) ext4(E) mbcache(E) jbd2(E) hibmc_drm(E) drm_vram_helper(E) drm_kms_helper(E) syscopyarea(E) realtek(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) cec(E) sha2_ce(E) drm_ttm_helper(E) hisi_sas_v3_hw(E) hns3(E) nvme(E) hisi_sas_main(E) ttm(E) sha256_arm64(E) i2c_designware_platform(E) nvme_core(E) mlx5_core(E) drm(E) nfit(E) libsas(E) i2c_designware_core(E) hclge(E) gpio_dwapb(E) sg(E) hnae3(E) scsi_transport_sas(E) mlxfw(E) i2c_core(E) gpio_generic(E) libnvdimm(E) [last unloaded: nf_reject_ipv4]
[51304.455211] CPU: 88 PID: 3009112 Comm: hwlatd Kdump: loaded Tainted: G           OE     5.10.84-147.git.1378144e5.an8.aarch64 #1
[51304.468859] Hardware name: H3C R4960 G3/BC82AMDDA, BIOS 1.70 01/07/2021
[51304.476573] pstate: 00c00089 (nzcv daIf +PAN +UAO -TCO BTYPE=--)
[51304.483673] pc : ring_buffer_lock_reserve+0x28/0x450
[51304.489736] lr : trace_buffer_lock_reserve+0x24/0x58
[51304.495798] sp : ffff80004f4d3ce0
[51304.500214] x29: ffff80004f4d3ce0 x28: ffff800011f02660 
[51304.506627] x27: 000000000007a121 x26: 00002ea92d63aef2 
[51304.513032] x25: 0000000000000001 x24: 0000000000000000 
[51304.519442] x23: 20c49ba5e353f7cf x22: 0000000000000000 
[51304.525848] x21: 0000000000000080 x20: ffff80001108fcd8 
[51304.532247] x19: 60ffff0040200208 x18: 0000000000000000 
[51304.538653] x17: 0000000000000000 x16: 0000000000000000 
[51304.545058] x15: 0000aaab068aabd0 x14: 0000000000000000 
[51304.551459] x13: 0000000000000000 x12: ffff80005709bd80 
[51304.557855] x11: ffff80005709bcf5 x10: ffff80004f4d3d60 
[51304.564245] x9 : ffff800010a58a30 x8 : ffff00402b49c460 
[51304.570636] x7 : ffff8000118bcfe8 x6 : 00000469790c1416 
[51304.577027] x5 : 0000000000000040 x4 : 0000000000000000 
[51304.583397] x3 : 0000000000000080 x2 : 0000000000000040 
[51304.589793] x1 : 0000000000000000 x0 : 60ffff0040200208 
[51304.596194] Call trace:
[51304.599739]  ring_buffer_lock_reserve+0x28/0x450
[51304.605436]  trace_buffer_lock_reserve+0x24/0x58
[51304.611135]  kthread_fn+0x20c/0x440
[51304.615685]  kthread+0x114/0x118
[51304.619960] Code: f9400281 f9003fe1 d2800001 a9025bf5 (b9400815) 
[51304.627046] ---[ end trace f1a79247722caef4 ]---
[51304.632660] Kernel panic - not syncing: Oops: Fatal exception
[51304.639397] SMP: stopping secondary CPUs
[51305.073458] Kernel Offset: 0x1c0000 from 0xffff800010000000
[51305.079999] PHYS_OFFSET: 0x0
[51305.083819] CPU features: 0x8000002,22a08a38
[51305.088995] Memory Limit: none
[51305.095214] Starting crashdump kernel...
[51305.100032] Bye!


Expected results:


Additional info:
# free -mh
              total        used        free      shared  buff/cache   available
Mem:          753Gi       4.6Gi       746Gi        17Mi       1.9Gi       745Gi
Swap:         2.0Gi       153Mi       1.8Gi
# lscpu
Architecture:         aarch64
Byte Order:           Little Endian
CPU(s):               96
On-line CPU(s) list:  0,1,7-9,65-95
Off-line CPU(s) list: 2-6,10-64
Thread(s) per core:   1
Core(s) per socket:   18
Socket(s):            2
NUMA node(s):         1
Vendor ID:            HiSilicon
BIOS Vendor ID:       HiSilicon
Model:                0
Model name:           Kunpeng-920
BIOS Model name:      HUAWEI Kunpeng 920 5250
Stepping:             0x1
CPU max MHz:          2600.0000
CPU min MHz:          200.0000
BogoMIPS:             200.00
L1d cache:            64K
L1i cache:            64K
L2 cache:             512K
L3 cache:             24576K
NUMA node0 CPU(s):    0,1,7-9,65-95
Flags:                fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm
Comment 1 Shiloong admin 2022-01-28 15:40:31 UTC
ARM上的 crash 问题是版本过低的原因, 安装高版本的 crash 可以解析 vmcore.
帮忙上传一下 vmcore 吧.
Comment 2 kangjiangbo 2022-01-28 16:20:12 UTC
(In reply to Shiloong from comment #1)
> ARM上的 crash 问题是版本过低的原因, 安装高版本的 crash 可以解析 vmcore.
> 帮忙上传一下 vmcore 吧.

文件太大了 传不上去
Comment 3 kangjiangbo 2022-02-08 14:00:59 UTC
vmcore解析信息如下:
# crash /usr/lib/debug/usr/lib/modules/5.10.84-148.git.088d92b1e.an8.aarch64/vmlinux vmcore

crash 7.3.1-1.el8
Copyright (C) 2002-2021  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011, 2020-2021  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.

GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "aarch64-unknown-linux-gnu"...

WARNING: kernel relocated [1MB]: patching 106835 gdb minimal_symbol values

WARNING: kernel version inconsistency between vmlinux and dumpfile

WARNING: cpu 32: cannot find NT_PRSTATUS note
WARNING: cpu 33: cannot find NT_PRSTATUS note
WARNING: cpu 34: cannot find NT_PRSTATUS note
WARNING: cpu 35: cannot find NT_PRSTATUS note
WARNING: cpu 36: cannot find NT_PRSTATUS note
WARNING: cpu 37: cannot find NT_PRSTATUS note
WARNING: cpu 38: cannot find NT_PRSTATUS note
WARNING: cpu 39: cannot find NT_PRSTATUS note
WARNING: cpu 40: cannot find NT_PRSTATUS note
WARNING: cpu 41: cannot find NT_PRSTATUS note
WARNING: cpu 42: cannot find NT_PRSTATUS note
WARNING: cpu 43: cannot find NT_PRSTATUS note
WARNING: cpu 44: cannot find NT_PRSTATUS note
WARNING: cpu 45: cannot find NT_PRSTATUS note
WARNING: cpu 46: cannot find NT_PRSTATUS note
WARNING: cpu 47: cannot find NT_PRSTATUS note
WARNING: cpu 48: cannot find NT_PRSTATUS note
WARNING: cpu 49: cannot find NT_PRSTATUS note
WARNING: cpu 50: cannot find NT_PRSTATUS note
WARNING: cpu 51: cannot find NT_PRSTATUS note
WARNING: cpu 52: cannot find NT_PRSTATUS note
WARNING: cpu 53: cannot find NT_PRSTATUS note
WARNING: cpu 54: cannot find NT_PRSTATUS note
WARNING: cpu 55: cannot find NT_PRSTATUS note
WARNING: cpu 56: cannot find NT_PRSTATUS note
WARNING: cpu 57: cannot find NT_PRSTATUS note
WARNING: cpu 58: cannot find NT_PRSTATUS note
WARNING: cpu 59: cannot find NT_PRSTATUS note
WARNING: cpu 60: cannot find NT_PRSTATUS note
WARNING: cpu 61: cannot find NT_PRSTATUS note
WARNING: cpu 62: cannot find NT_PRSTATUS note
WARNING: cpu 63: cannot find NT_PRSTATUS note
WARNING: cpu 64: cannot find NT_PRSTATUS note
WARNING: cpu 65: cannot find NT_PRSTATUS note
WARNING: cpu 66: cannot find NT_PRSTATUS note
WARNING: cpu 67: cannot find NT_PRSTATUS note
WARNING: cpu 68: cannot find NT_PRSTATUS note
      KERNEL: /usr/lib/debug/usr/lib/modules/5.10.84-148.git.088d92b1e.an8.aarch64/vmlinux  [TAINTED]
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 96 [OFFLINE: 64]
        DATE: Thu Jan 27 00:07:55 CST 2022
      UPTIME: 14:15:04
LOAD AVERAGE: 3.31, 5.40, 6.08
       TASKS: 1113
    NODENAME: l57h15219.sqa.nu8
     RELEASE: 5.10.84-147.git.1378144e5.an8.aarch64
     VERSION: #1 SMP Tue Jan 25 13:14:55 UTC 2022
     MACHINE: aarch64  (unknown Mhz)
      MEMORY: 768 GB
       PANIC: "Unable to handle kernel paging request at virtual address 60ffff0040200210"
         PID: 3009112
     COMMAND: "hwlatd"
        TASK: ffff00402b49b600  [THREAD_INFO: ffff00402b49b600]
         CPU: 88
       STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 3009112  TASK: ffff00402b49b600  CPU: 88  COMMAND: "hwlatd"
 #0 [ffff80004f4d3710] machine_kexec at ffff8000101f0cd0
 #1 [ffff80004f4d3760] __crash_kexec at ffff800010339c64
 #2 [ffff80004f4d3900] panic at ffff800010cae9c8
 #3 [ffff80004f4d39e0] die at ffff8000101dd3d8
 #4 [ffff80004f4d3a90] die_kernel_fault at ffff8000101fc6fc
 #5 [ffff80004f4d3ac0] __do_kernel_fault at ffff8000101fc7cc
 #6 [ffff80004f4d3af0] do_bad_area at ffff8000101fc8d0
 #7 [ffff80004f4d3b10] do_translation_fault at ffff800010ccb3cc
 #8 [ffff80004f4d3b20] do_mem_abort at ffff8000101fc638
 #9 [ffff80004f4d3b50] el1_abort at ffff800010cbb8c0
#10 [ffff80004f4d3b80] el1_sync_handler at ffff800010cbbcb8
#11 [ffff80004f4d3cc0] el1_sync at ffff8000101d1a00
#12 [ffff80004f4d3ce0] ring_buffer_lock_reserve at ffff80001038d1ac
#13 [ffff80004f4d3d60] trace_buffer_lock_reserve at ffff800010398608
#14 [ffff80004f4d3d90] kthread_fn at ffff8000103a5d00
#15 [ffff80004f4d3e50] kthread at ffff80001027f928
Comment 4 Shiloong admin 2022-02-09 10:29:17 UTC
*** Bug 427 has been marked as a duplicate of this bug. ***
Comment 5 fghui_kernel 2022-02-17 11:45:36 UTC
调试发现,可能不是kernel问题,而是硬件问题
crash> p &global_trace
$2 = (struct trace_array *) 0xffff800011c5bc60
crash> p hwlat_trace
hwlat_trace = $3 = (struct trace_array *) 0xffff800011c5bc61

在vmcore中,正常情况下两个指针值应当相等,但vmcore中发现hwlat_trace指针低位bit翻转为1,导致基于错误的指针地址访问得到错误的数据
既不是越界覆盖,也不像是kernel软件问题,反而更像是硬件问题
Comment 6 fghui_kernel 2022-02-17 11:57:24 UTC
非软件问题
Comment 7 kangjiangbo 2022-02-21 15:43:47 UTC
换机器复现中,暂未复现
1.相同机器上有复现到此问题,vmcore显示发生在另外一个CPU上, p &global_trace和p hwlat_trace是一样的,不太明白在不同cpu上发生相同问题会是硬件问题吗?


2.跟此问题标为重复问题的427问题,在其他机器也是可以必现的
Comment 8 kangjiangbo 2022-02-21 15:46:09 UTC
Created attachment 155 [details]
cpu94 vmcore
Comment 9 kangjiangbo 2022-02-28 09:53:53 UTC
(In reply to kangjiangbo from comment #7)
> 换机器复现中,暂未复现
> 1.相同机器上有复现到此问题,vmcore显示发生在另外一个CPU上, p &global_trace和p
> hwlat_trace是一样的,不太明白在不同cpu上发生相同问题会是硬件问题吗?
> 
> 
> 2.跟此问题标为重复问题的427问题,在其他机器也是可以必现的




另外一台机器目前未能复现此问题
Comment 10 kangjiangbo 2022-03-23 14:02:36 UTC
ftrace_stress_test.sh 这个是一个已知问题,关闭吧