Created attachment 352 [details] 手动触发crash后的串口信息 Description of problem: 线下VM环境,debug内核版本,修改crashkernel=0M-2G:0M,2G-256G:256M,256G-1024G:320M,1024G-:384M后,手动触发crash,无法生成vmcore Version-Release number of selected component (if applicable): [root@VM20210305-16 crash]# cat /etc/os-release NAME="Anolis OS" VERSION="8.4" ID="anolis" ID_LIKE="rhel fedora centos" VERSION_ID="8.4" PLATFORM_ID="platform:an8" PRETTY_NAME="Anolis OS 8.4" ANSI_COLOR="0;31" HOME_URL="https://openanolis.cn/" [root@VM20210305-16 ~]# uname -r 5.10.134-12_rc1.an8.x86_64+debug [root@VM20210305-16 ~]# rpm -qa |grep kernel-debug-debuginfo kernel-debug-debuginfo-5.10.134-12_rc1.an8.x86_64 [root@VM20210305-16 ~]# rpm -qa |egrep 'kexec|crash' kexec-tools-2.0.21-1.3.an8.x86_64 crash-7.3.1-5.an8.x86_64 [root@VM20210305-16 ~]# systemctl status kdump ● kdump.service - Crash recovery kernel arming Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor preset: enabled) Active: active (exited) since Sat 2022-08-06 23:52:46 CST; 7h left Process: 828 ExecStart=/usr/bin/kdumpctl start (code=exited, status=0/SUCCESS) Main PID: 828 (code=exited, status=0/SUCCESS) Tasks: 0 (limit: 42476) Memory: 0B CGroup: /system.slice/kdump.service Aug 06 23:52:33 VM20210305-16 systemd[1]: Starting Crash recovery kernel arming... Aug 06 23:52:46 VM20210305-16 kdumpctl[828]: kdump: kexec: loaded kdump kernel Aug 06 23:52:46 VM20210305-16 kdumpctl[828]: kdump: Starting kdump: [OK] Aug 06 23:52:46 VM20210305-16 systemd[1]: Started Crash recovery kernel arming. [root@VM20210305-16 ~]# cat /proc/cmdline BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-5.10.134-12_rc1.an8.x86_64+debug root=UUID=169a0746-c62d-49a2-bd6b-0eaec098d42c ro crashkernel=0M-2G:0M,2G-256G:256M,256G-1024G:320M,1024G-:384M rhgb console=tty0 console=ttyS0,115200 console=ttyAMA0,115200n8 How reproducible: 安装5.10.134-12_rc1.an8.x86_64+debug 版本内核后,手动触发crash,执行 echo c > /proc/sysrq-trigger Steps to Reproduce: 1.安装5.10.134-12_rc1.an8.x86_64+debug 版本内核 2.手动触发crash,执行 echo c > /proc/sysrq-trigger Actual results: 系统启动正常,未生成vmcore文件 Expected results: 系统启动正常并生成vmcore文件 Additional info: 串口日志见附件
debug内核需要较大的预留内存,把crashkernel值改大试试呢
1. 线下VM anolis8 x86 debug内核os,内存只有12G,将crashkernel修改为512M,还是未生成vmcore [root@VM20210305-16 ~]# free -h total used free shared buff/cache available Mem: 12Gi 739Mi 11Gi 17Mi 354Mi 11Gi Swap: 0B 0B 0B 2.将crashkernel修改为1024M后,执行grub2-mkconfig -o /boot/grub2/grub.cfg报错: root@VM20210305-16 crash]# grub2-mkconfig -o /boot/grub2/grub.cfg [ 206.426529] device-mapper: uevent: version 1.0.3 [ 206.429530] device-mapper: ioctl: 4.43.0-ioctl (2020-10-01) initialised: dm-devel@redhat.com [ 208.203019] page:000000000ac9243a refcount:2 mapcount:0 mapping:0000000017e01080 index:0x805fe pfn:0x3c01a0 [ 208.205247] aops:def_blk_aops ino:0 [ 208.205877] flags: 0x17ffffd0001003(locked|referenced|reserved|kfence) [ 208.207104] raw: 0017ffffd0001003 ffffea000f006808 ffffea000f006808 ffff88811f425150 [ 208.208551] raw: 00000000000805fe 0000000000000000 00000002ffffffff ffff88811ecccaa1 [ 208.209992] page dumped because: VM_BUG_ON_PAGE(page->mem_cgroup) [ 208.211158] page->mem_cgroup:ffff88811ecccaa1 [ 208.212104] ------------[ cut here ]------------ [ 208.213004] kernel BUG at mm/memcontrol.c:3178! [ 208.213907] invalid opcode: 0000 [#1] SMP KASAN PTI [ 208.214739] CPU: 0 PID: 825 Comm: staragentd Kdump: loaded Tainted: G E 5.10.134-12_rc1.an8.x86_64+debug #1 [ 208.216547] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS e623647 04/01/2014 [ 208.217816] RIP: 0010:commit_charge.part.61+0x11/0x20 [ 208.218679] Code: 00 00 00 e9 a1 fd ff ff 90 0f 1f 44 00 00 31 d2 e9 94 fd ff ff 0f 1f 40 00 0f 1f 44 00 00 48 c7 c6 00 3f 97 aa e8 7f 56ea ff <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 48 ba [ 208.221672] RSP: 0018:ffff888114f8f4e0 EFLAGS: 00010282 [ 208.222526] RAX: 0000000000000000 RBX: ffffea000f006800 RCX: 0000000000000000 [ 208.223665] RDX: 0000000000000021 RSI: 0000000000000008 RDI: ffffed10229f1e30 [ 208.224805] RBP: ffff888114f8f538 R08: ffffed107b8c0305 R09: ffffed107b8c0305 [ 208.225978] R10: ffff8883dc601827 R11: ffffed107b8c0304 R12: 0000000000000000 [ 208.227143] R13: ffff8881015e9e00 R14: 0000000000000001 R15: ffff888118624000 [ 208.228312] FS: 00007fbe753fe700(0000) GS:ffff8883dc400000(0000) knlGS:0000000000000000 [ 208.229596] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 208.230403] CR2: 00007fbe5800a5e8 CR3: 0000000121444001 CR4: 00000000003706f0 [ 208.231392] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 208.232380] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 208.233365] Call Trace: [ 208.233724] mem_cgroup_charge+0x500/0x660 [ 208.234315] ? lock_downgrade+0x6e0/0x6e0 [ 208.234880] __add_to_page_cache_locked+0x7f3/0xac0 [ 208.235568] ? find_get_pages_contig+0x880/0x880 [ 208.236226] ? scan_shadow_nodes+0xc0/0xc0 [ 208.236809] ? __alloc_pages_nodemask+0x5d0/0x7d0 [ 208.237465] add_to_page_cache_lru+0xbd/0x1e0 [ 208.238085] ? add_to_page_cache_locked+0x10/0x10 [ 208.238738] ? pagecache_get_page.part.60+0x55/0x840 [ 208.239431] ? alloc_pages_current+0xc3/0x1b0 [ 208.240054] pagecache_get_page.part.60+0x20e/0x840 [ 208.240740] __getblk_gfp+0x209/0x750 [ 208.241278] ext4_sb_breadahead_unmovable+0x58/0xb0 [ 208.241976] ? __brelse+0x6a/0x80 3.将crashkernel修改为320M,手动触发crash,系统未起来,串口日志如下 [root@VM20210305-16 crash]# echo c >/proc/sysrq-trigger [ 258.571902] sysrq: Trigger a crash [ 258.573035] Kernel panic - not syncing: sysrq triggered crash [ 258.573849] CPU: 1 PID: 2917 Comm: bash Kdump: loaded Tainted: G E 5.10.134-12_rc1.an8.x86_64+debug #1 [ 258.575264] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS e623647 04/01/2014 [ 258.576245] Call Trace: [ 258.576602] dump_stack+0x99/0xcf [ 258.577108] panic+0x271/0x512 [ 258.577545] ? print_oops_end_marker.cold.11+0x10/0x10 [ 258.578220] ? printk+0x96/0xb4 [ 258.578662] ? lock_downgrade+0x6e0/0x6e0 [ 258.579218] sysrq_handle_crash+0x1b/0x20 [ 258.579763] __handle_sysrq.cold.19+0x1b4/0x369 [ 258.580401] write_sysrq_trigger+0x4c/0x50 [ 258.580957] proc_reg_write+0x1a3/0x250 [ 258.581486] vfs_write+0x17c/0x910 [ 258.581960] ksys_write+0xe7/0x1b0 [ 258.582424] ? __ia32_sys_read+0xb0/0xb0 [ 258.582961] do_syscall_64+0x30/0x40 [ 258.583449] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 258.584108] RIP: 0033:0x7f9b025205a8 [ 258.584588] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 f5 3f 2a 00 8b 00 85 c0 75 17 b8 01 00 00 000f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55 [ 258.586945] RSP: 002b:00007ffc71a524f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 258.587991] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f9b025205a8 [ 258.588938] RDX: 0000000000000002 RSI: 000055edbc374e90 RDI: 0000000000000001 [ 258.589885] RBP: 000055edbc374e90 R08: 000000000000000a R09: 00007f9b02580800 [ 258.590821] R10: 000000000000000a R11: 0000000000000246 R12: 00007f9b027c06e0 [ 258.591771] R13: 0000000000000002 R14: 00007f9b027bb860 R15: 0000000000000002 [ 258.593792] Kernel Offset: 0xb000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
由于debug版本内核开启了更多的config,导致需要预留内存空间大小更大。 根据测试,将crash kernel大小设置为640M可以正常执行dump过程,但是crash kernel在重启过程中卡死,此时和https://bugzilla.openanolis.cn/show_bug.cgi?id=1830现象一致。 当前卡死问题和https://bugzilla.openanolis.cn/show_bug.cgi?id=1830均为一台host(11.158.226.222),目前推测为libvirt的版本较老,建议对其和qemu-kvm进行升级后再进行测试。
结论如上,请将libvirt和qemu-kvm进行升级后再进行测试。