1835 – [ANCK-5.10 2208][x86_64]debug内核版本，修改crashkernel后，手动触发crash,无法生成vmcore

Bug 1835 - [ANCK-5.10 2208][x86_64]debug内核版本，修改crashkernel后，手动触发crash,无法生成vmcore

Summary: [ANCK-5.10 2208][x86_64]debug内核版本，修改crashkernel后，手动触发crash,无法生成vmcore

Status:	RESOLVED FIXED

Alias:	None

Product:	ANCK 5.10 Dev
Classification:	ANCK
Component:	general/others (show other bugs)	general/others
Sub Component:
Version:	5.10.y-12
Hardware:	x86_64 Linux

Importance:	P3-Medium S3-normal
Target Milestone:	---
Assignee:	xiangzao
QA Contact:	shuming

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2022-08-06 16:30 UTC by zhixin01
Modified:	2022-08-18 10:38 UTC (History)
CC List:	8 users (show)

See Also:

Attachments
手动触发crash后的串口信息 (50.12 KB, text/plain) 2022-08-06 16:30 UTC, zhixin01	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description zhixin01 2022-08-06 16:30:06 UTC

Created attachment 352 [details]
手动触发crash后的串口信息

Description of problem:
线下VM环境，debug内核版本，修改crashkernel=0M-2G:0M,2G-256G:256M,256G-1024G:320M,1024G-:384M后，手动触发crash,无法生成vmcore

Version-Release number of selected component (if applicable):

[root@VM20210305-16 crash]# cat /etc/os-release
NAME="Anolis OS"
VERSION="8.4"
ID="anolis"
ID_LIKE="rhel fedora centos"
VERSION_ID="8.4"
PLATFORM_ID="platform:an8"
PRETTY_NAME="Anolis OS 8.4"
ANSI_COLOR="0;31"
HOME_URL="https://openanolis.cn/"

[root@VM20210305-16 ~]# uname -r
5.10.134-12_rc1.an8.x86_64+debug

[root@VM20210305-16 ~]# rpm -qa |grep kernel-debug-debuginfo
kernel-debug-debuginfo-5.10.134-12_rc1.an8.x86_64

[root@VM20210305-16 ~]# rpm -qa |egrep 'kexec|crash'
kexec-tools-2.0.21-1.3.an8.x86_64
crash-7.3.1-5.an8.x86_64

[root@VM20210305-16 ~]# systemctl status kdump
● kdump.service - Crash recovery kernel arming
   Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor preset: enabled)
   Active: active (exited) since Sat 2022-08-06 23:52:46 CST; 7h left
  Process: 828 ExecStart=/usr/bin/kdumpctl start (code=exited, status=0/SUCCESS)
 Main PID: 828 (code=exited, status=0/SUCCESS)
    Tasks: 0 (limit: 42476)
   Memory: 0B
   CGroup: /system.slice/kdump.service

Aug 06 23:52:33 VM20210305-16 systemd[1]: Starting Crash recovery kernel arming...
Aug 06 23:52:46 VM20210305-16 kdumpctl[828]: kdump: kexec: loaded kdump kernel
Aug 06 23:52:46 VM20210305-16 kdumpctl[828]: kdump: Starting kdump: [OK]
Aug 06 23:52:46 VM20210305-16 systemd[1]: Started Crash recovery kernel arming.

[root@VM20210305-16 ~]# cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-5.10.134-12_rc1.an8.x86_64+debug root=UUID=169a0746-c62d-49a2-bd6b-0eaec098d42c ro crashkernel=0M-2G:0M,2G-256G:256M,256G-1024G:320M,1024G-:384M rhgb console=tty0 console=ttyS0,115200 console=ttyAMA0,115200n8

How reproducible:
安装5.10.134-12_rc1.an8.x86_64+debug 版本内核后，手动触发crash，执行
echo c > /proc/sysrq-trigger

Steps to Reproduce:
1.安装5.10.134-12_rc1.an8.x86_64+debug 版本内核
2.手动触发crash，执行
echo c > /proc/sysrq-trigger

Actual results:
系统启动正常，未生成vmcore文件

Expected results:
系统启动正常并生成vmcore文件

Additional info:
串口日志见附件

Comment 1 shanxifanshi alibaba_cloud_group

2022-08-08 09:55:28 UTC

debug内核需要较大的预留内存，把crashkernel值改大试试呢

Comment 2 zhixin01 2022-08-08 15:30:30 UTC

1. 线下VM anolis8 x86 debug内核os,内存只有12G,将crashkernel修改为512M，还是未生成vmcore

[root@VM20210305-16 ~]# free -h
              total        used        free      shared  buff/cache   available
Mem:           12Gi       739Mi        11Gi        17Mi       354Mi        11Gi
Swap:            0B          0B          0B

2.将crashkernel修改为1024M后，执行grub2-mkconfig -o /boot/grub2/grub.cfg报错：
root@VM20210305-16 crash]# grub2-mkconfig -o /boot/grub2/grub.cfg
[  206.426529] device-mapper: uevent: version 1.0.3
[  206.429530] device-mapper: ioctl: 4.43.0-ioctl (2020-10-01) initialised: dm-devel@redhat.com
[  208.203019] page:000000000ac9243a refcount:2 mapcount:0 mapping:0000000017e01080 index:0x805fe pfn:0x3c01a0
[  208.205247] aops:def_blk_aops ino:0
[  208.205877] flags: 0x17ffffd0001003(locked|referenced|reserved|kfence)
[  208.207104] raw: 0017ffffd0001003 ffffea000f006808 ffffea000f006808 ffff88811f425150
[  208.208551] raw: 00000000000805fe 0000000000000000 00000002ffffffff ffff88811ecccaa1
[  208.209992] page dumped because: VM_BUG_ON_PAGE(page->mem_cgroup)
[  208.211158] page->mem_cgroup:ffff88811ecccaa1
[  208.212104] ------------[ cut here ]------------
[  208.213004] kernel BUG at mm/memcontrol.c:3178!
[  208.213907] invalid opcode: 0000 [#1] SMP KASAN PTI
[  208.214739] CPU: 0 PID: 825 Comm: staragentd Kdump: loaded Tainted: G            E     5.10.134-12_rc1.an8.x86_64+debug #1
[  208.216547] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS e623647 04/01/2014
[  208.217816] RIP: 0010:commit_charge.part.61+0x11/0x20
[  208.218679] Code: 00 00 00 e9 a1 fd ff ff 90 0f 1f 44 00 00 31 d2 e9 94 fd ff ff 0f 1f 40 00 0f 1f 44 00 00 48 c7 c6 00 3f 97 aa e8 7f 56ea ff <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 48 ba
[  208.221672] RSP: 0018:ffff888114f8f4e0 EFLAGS: 00010282
[  208.222526] RAX: 0000000000000000 RBX: ffffea000f006800 RCX: 0000000000000000
[  208.223665] RDX: 0000000000000021 RSI: 0000000000000008 RDI: ffffed10229f1e30
[  208.224805] RBP: ffff888114f8f538 R08: ffffed107b8c0305 R09: ffffed107b8c0305
[  208.225978] R10: ffff8883dc601827 R11: ffffed107b8c0304 R12: 0000000000000000
[  208.227143] R13: ffff8881015e9e00 R14: 0000000000000001 R15: ffff888118624000
[  208.228312] FS:  00007fbe753fe700(0000) GS:ffff8883dc400000(0000) knlGS:0000000000000000
[  208.229596] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  208.230403] CR2: 00007fbe5800a5e8 CR3: 0000000121444001 CR4: 00000000003706f0
[  208.231392] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  208.232380] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  208.233365] Call Trace:
[  208.233724]  mem_cgroup_charge+0x500/0x660
[  208.234315]  ? lock_downgrade+0x6e0/0x6e0
[  208.234880]  __add_to_page_cache_locked+0x7f3/0xac0
[  208.235568]  ? find_get_pages_contig+0x880/0x880
[  208.236226]  ? scan_shadow_nodes+0xc0/0xc0
[  208.236809]  ? __alloc_pages_nodemask+0x5d0/0x7d0
[  208.237465]  add_to_page_cache_lru+0xbd/0x1e0
[  208.238085]  ? add_to_page_cache_locked+0x10/0x10
[  208.238738]  ? pagecache_get_page.part.60+0x55/0x840
[  208.239431]  ? alloc_pages_current+0xc3/0x1b0
[  208.240054]  pagecache_get_page.part.60+0x20e/0x840
[  208.240740]  __getblk_gfp+0x209/0x750
[  208.241278]  ext4_sb_breadahead_unmovable+0x58/0xb0
[  208.241976]  ? __brelse+0x6a/0x80

3.将crashkernel修改为320M,手动触发crash,系统未起来，串口日志如下
[root@VM20210305-16 crash]# echo c >/proc/sysrq-trigger
[  258.571902] sysrq: Trigger a crash
[  258.573035] Kernel panic - not syncing: sysrq triggered crash
[  258.573849] CPU: 1 PID: 2917 Comm: bash Kdump: loaded Tainted: G            E     5.10.134-12_rc1.an8.x86_64+debug #1
[  258.575264] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS e623647 04/01/2014
[  258.576245] Call Trace:
[  258.576602]  dump_stack+0x99/0xcf
[  258.577108]  panic+0x271/0x512
[  258.577545]  ? print_oops_end_marker.cold.11+0x10/0x10
[  258.578220]  ? printk+0x96/0xb4
[  258.578662]  ? lock_downgrade+0x6e0/0x6e0
[  258.579218]  sysrq_handle_crash+0x1b/0x20
[  258.579763]  __handle_sysrq.cold.19+0x1b4/0x369
[  258.580401]  write_sysrq_trigger+0x4c/0x50
[  258.580957]  proc_reg_write+0x1a3/0x250
[  258.581486]  vfs_write+0x17c/0x910
[  258.581960]  ksys_write+0xe7/0x1b0
[  258.582424]  ? __ia32_sys_read+0xb0/0xb0
[  258.582961]  do_syscall_64+0x30/0x40
[  258.583449]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[  258.584108] RIP: 0033:0x7f9b025205a8
[  258.584588] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 f5 3f 2a 00 8b 00 85 c0 75 17 b8 01 00 00 000f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
[  258.586945] RSP: 002b:00007ffc71a524f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  258.587991] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f9b025205a8
[  258.588938] RDX: 0000000000000002 RSI: 000055edbc374e90 RDI: 0000000000000001
[  258.589885] RBP: 000055edbc374e90 R08: 000000000000000a R09: 00007f9b02580800
[  258.590821] R10: 000000000000000a R11: 0000000000000246 R12: 00007f9b027c06e0
[  258.591771] R13: 0000000000000002 R14: 00007f9b027bb860 R15: 0000000000000002
[  258.593792] Kernel Offset: 0xb000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

Comment 3 yixingrui alibaba_cloud_group

2022-08-11 16:06:47 UTC

由于debug版本内核开启了更多的config，导致需要预留内存空间大小更大。
根据测试，将crash kernel大小设置为640M可以正常执行dump过程，但是crash kernel在重启过程中卡死，此时和https://bugzilla.openanolis.cn/show_bug.cgi?id=1830现象一致。

当前卡死问题和https://bugzilla.openanolis.cn/show_bug.cgi?id=1830均为一台host（11.158.226.222），目前推测为libvirt的版本较老，建议对其和qemu-kvm进行升级后再进行测试。

Comment 4 xiangzao alibaba_cloud_group

2022-08-18 10:38:40 UTC

结论如上，请将libvirt和qemu-kvm进行升级后再进行测试。