Bug 1835 - [ANCK-5.10 2208][x86_64]debug内核版本,修改crashkernel后,手动触发crash,无法生成vmcore
Summary: [ANCK-5.10 2208][x86_64]debug内核版本,修改crashkernel后,手动触发crash,无法生成vmcore
Status: RESOLVED FIXED
Alias: None
Product: ANCK 5.10 Dev
Classification: ANCK
Component: general/others (show other bugs) general/others
Version: 5.10.y-12
Hardware: x86_64 Linux
: P3-Medium S3-normal
Target Milestone: ---
Assignee: xiangzao
QA Contact: shuming
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-08-06 16:30 UTC by zhixin01
Modified: 2022-08-18 10:38 UTC (History)
8 users (show)

See Also:


Attachments
手动触发crash后的串口信息 (50.12 KB, text/plain)
2022-08-06 16:30 UTC, zhixin01
Details

Note You need to log in before you can comment on or make changes to this bug.
Description zhixin01 2022-08-06 16:30:06 UTC
Created attachment 352 [details]
手动触发crash后的串口信息

Description of problem:
线下VM环境,debug内核版本,修改crashkernel=0M-2G:0M,2G-256G:256M,256G-1024G:320M,1024G-:384M后,手动触发crash,无法生成vmcore

Version-Release number of selected component (if applicable):

[root@VM20210305-16 crash]# cat /etc/os-release
NAME="Anolis OS"
VERSION="8.4"
ID="anolis"
ID_LIKE="rhel fedora centos"
VERSION_ID="8.4"
PLATFORM_ID="platform:an8"
PRETTY_NAME="Anolis OS 8.4"
ANSI_COLOR="0;31"
HOME_URL="https://openanolis.cn/"

[root@VM20210305-16 ~]# uname -r
5.10.134-12_rc1.an8.x86_64+debug

[root@VM20210305-16 ~]# rpm -qa |grep kernel-debug-debuginfo
kernel-debug-debuginfo-5.10.134-12_rc1.an8.x86_64

[root@VM20210305-16 ~]# rpm -qa |egrep 'kexec|crash'
kexec-tools-2.0.21-1.3.an8.x86_64
crash-7.3.1-5.an8.x86_64

[root@VM20210305-16 ~]# systemctl status kdump
● kdump.service - Crash recovery kernel arming
   Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor preset: enabled)
   Active: active (exited) since Sat 2022-08-06 23:52:46 CST; 7h left
  Process: 828 ExecStart=/usr/bin/kdumpctl start (code=exited, status=0/SUCCESS)
 Main PID: 828 (code=exited, status=0/SUCCESS)
    Tasks: 0 (limit: 42476)
   Memory: 0B
   CGroup: /system.slice/kdump.service

Aug 06 23:52:33 VM20210305-16 systemd[1]: Starting Crash recovery kernel arming...
Aug 06 23:52:46 VM20210305-16 kdumpctl[828]: kdump: kexec: loaded kdump kernel
Aug 06 23:52:46 VM20210305-16 kdumpctl[828]: kdump: Starting kdump: [OK]
Aug 06 23:52:46 VM20210305-16 systemd[1]: Started Crash recovery kernel arming.

[root@VM20210305-16 ~]# cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-5.10.134-12_rc1.an8.x86_64+debug root=UUID=169a0746-c62d-49a2-bd6b-0eaec098d42c ro crashkernel=0M-2G:0M,2G-256G:256M,256G-1024G:320M,1024G-:384M rhgb console=tty0 console=ttyS0,115200 console=ttyAMA0,115200n8

How reproducible:
安装5.10.134-12_rc1.an8.x86_64+debug 版本内核后,手动触发crash,执行
echo c > /proc/sysrq-trigger

Steps to Reproduce:
1.安装5.10.134-12_rc1.an8.x86_64+debug 版本内核
2.手动触发crash,执行
echo c > /proc/sysrq-trigger

Actual results:
系统启动正常,未生成vmcore文件

Expected results:
系统启动正常并生成vmcore文件

Additional info:
串口日志见附件
Comment 1 shanxifanshi alibaba_cloud_group 2022-08-08 09:55:28 UTC
debug内核需要较大的预留内存,把crashkernel值改大试试呢
Comment 2 zhixin01 2022-08-08 15:30:30 UTC
1. 线下VM anolis8 x86 debug内核os,内存只有12G,将crashkernel修改为512M,还是未生成vmcore

[root@VM20210305-16 ~]# free -h
              total        used        free      shared  buff/cache   available
Mem:           12Gi       739Mi        11Gi        17Mi       354Mi        11Gi
Swap:            0B          0B          0B

2.将crashkernel修改为1024M后,执行grub2-mkconfig -o /boot/grub2/grub.cfg报错:
root@VM20210305-16 crash]# grub2-mkconfig -o /boot/grub2/grub.cfg
[  206.426529] device-mapper: uevent: version 1.0.3
[  206.429530] device-mapper: ioctl: 4.43.0-ioctl (2020-10-01) initialised: dm-devel@redhat.com
[  208.203019] page:000000000ac9243a refcount:2 mapcount:0 mapping:0000000017e01080 index:0x805fe pfn:0x3c01a0
[  208.205247] aops:def_blk_aops ino:0
[  208.205877] flags: 0x17ffffd0001003(locked|referenced|reserved|kfence)
[  208.207104] raw: 0017ffffd0001003 ffffea000f006808 ffffea000f006808 ffff88811f425150
[  208.208551] raw: 00000000000805fe 0000000000000000 00000002ffffffff ffff88811ecccaa1
[  208.209992] page dumped because: VM_BUG_ON_PAGE(page->mem_cgroup)
[  208.211158] page->mem_cgroup:ffff88811ecccaa1
[  208.212104] ------------[ cut here ]------------
[  208.213004] kernel BUG at mm/memcontrol.c:3178!
[  208.213907] invalid opcode: 0000 [#1] SMP KASAN PTI
[  208.214739] CPU: 0 PID: 825 Comm: staragentd Kdump: loaded Tainted: G            E     5.10.134-12_rc1.an8.x86_64+debug #1
[  208.216547] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS e623647 04/01/2014
[  208.217816] RIP: 0010:commit_charge.part.61+0x11/0x20
[  208.218679] Code: 00 00 00 e9 a1 fd ff ff 90 0f 1f 44 00 00 31 d2 e9 94 fd ff ff 0f 1f 40 00 0f 1f 44 00 00 48 c7 c6 00 3f 97 aa e8 7f 56ea ff <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 48 ba
[  208.221672] RSP: 0018:ffff888114f8f4e0 EFLAGS: 00010282
[  208.222526] RAX: 0000000000000000 RBX: ffffea000f006800 RCX: 0000000000000000
[  208.223665] RDX: 0000000000000021 RSI: 0000000000000008 RDI: ffffed10229f1e30
[  208.224805] RBP: ffff888114f8f538 R08: ffffed107b8c0305 R09: ffffed107b8c0305
[  208.225978] R10: ffff8883dc601827 R11: ffffed107b8c0304 R12: 0000000000000000
[  208.227143] R13: ffff8881015e9e00 R14: 0000000000000001 R15: ffff888118624000
[  208.228312] FS:  00007fbe753fe700(0000) GS:ffff8883dc400000(0000) knlGS:0000000000000000
[  208.229596] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  208.230403] CR2: 00007fbe5800a5e8 CR3: 0000000121444001 CR4: 00000000003706f0
[  208.231392] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  208.232380] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  208.233365] Call Trace:
[  208.233724]  mem_cgroup_charge+0x500/0x660
[  208.234315]  ? lock_downgrade+0x6e0/0x6e0
[  208.234880]  __add_to_page_cache_locked+0x7f3/0xac0
[  208.235568]  ? find_get_pages_contig+0x880/0x880
[  208.236226]  ? scan_shadow_nodes+0xc0/0xc0
[  208.236809]  ? __alloc_pages_nodemask+0x5d0/0x7d0
[  208.237465]  add_to_page_cache_lru+0xbd/0x1e0
[  208.238085]  ? add_to_page_cache_locked+0x10/0x10
[  208.238738]  ? pagecache_get_page.part.60+0x55/0x840
[  208.239431]  ? alloc_pages_current+0xc3/0x1b0
[  208.240054]  pagecache_get_page.part.60+0x20e/0x840
[  208.240740]  __getblk_gfp+0x209/0x750
[  208.241278]  ext4_sb_breadahead_unmovable+0x58/0xb0
[  208.241976]  ? __brelse+0x6a/0x80

3.将crashkernel修改为320M,手动触发crash,系统未起来,串口日志如下
[root@VM20210305-16 crash]# echo c >/proc/sysrq-trigger
[  258.571902] sysrq: Trigger a crash
[  258.573035] Kernel panic - not syncing: sysrq triggered crash
[  258.573849] CPU: 1 PID: 2917 Comm: bash Kdump: loaded Tainted: G            E     5.10.134-12_rc1.an8.x86_64+debug #1
[  258.575264] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS e623647 04/01/2014
[  258.576245] Call Trace:
[  258.576602]  dump_stack+0x99/0xcf
[  258.577108]  panic+0x271/0x512
[  258.577545]  ? print_oops_end_marker.cold.11+0x10/0x10
[  258.578220]  ? printk+0x96/0xb4
[  258.578662]  ? lock_downgrade+0x6e0/0x6e0
[  258.579218]  sysrq_handle_crash+0x1b/0x20
[  258.579763]  __handle_sysrq.cold.19+0x1b4/0x369
[  258.580401]  write_sysrq_trigger+0x4c/0x50
[  258.580957]  proc_reg_write+0x1a3/0x250
[  258.581486]  vfs_write+0x17c/0x910
[  258.581960]  ksys_write+0xe7/0x1b0
[  258.582424]  ? __ia32_sys_read+0xb0/0xb0
[  258.582961]  do_syscall_64+0x30/0x40
[  258.583449]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[  258.584108] RIP: 0033:0x7f9b025205a8
[  258.584588] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 f5 3f 2a 00 8b 00 85 c0 75 17 b8 01 00 00 000f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
[  258.586945] RSP: 002b:00007ffc71a524f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  258.587991] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f9b025205a8
[  258.588938] RDX: 0000000000000002 RSI: 000055edbc374e90 RDI: 0000000000000001
[  258.589885] RBP: 000055edbc374e90 R08: 000000000000000a R09: 00007f9b02580800
[  258.590821] R10: 000000000000000a R11: 0000000000000246 R12: 00007f9b027c06e0
[  258.591771] R13: 0000000000000002 R14: 00007f9b027bb860 R15: 0000000000000002
[  258.593792] Kernel Offset: 0xb000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
Comment 3 yixingrui alibaba_cloud_group 2022-08-11 16:06:47 UTC
由于debug版本内核开启了更多的config,导致需要预留内存空间大小更大。
根据测试,将crash kernel大小设置为640M可以正常执行dump过程,但是crash kernel在重启过程中卡死,此时和https://bugzilla.openanolis.cn/show_bug.cgi?id=1830现象一致。

当前卡死问题和https://bugzilla.openanolis.cn/show_bug.cgi?id=1830均为一台host(11.158.226.222),目前推测为libvirt的版本较老,建议对其和qemu-kvm进行升级后再进行测试。
Comment 4 xiangzao alibaba_cloud_group 2022-08-18 10:38:40 UTC
结论如上,请将libvirt和qemu-kvm进行升级后再进行测试。