Description of problem: amd的cpu,安装tensorflow2后,import tensorflow出现coredum Version-Release number of selected component (if applicable): [root@localhost test-results]# uname -a Linux localhost.localdomain 5.10.134-14.1.an23.x86_64 #1 SMP Thu May 25 19:57:17 CST 2023 x86_64 GNU/Linux [root@localhost test-results]# cat /etc/os-release NAME="Anolis OS" VERSION="23" ID="anolis" VERSION_ID="23" PLATFORM_ID="platform:an23" PRETTY_NAME="Anolis OS 23" ANSI_COLOR="0;31" HOME_URL="https://openanolis.cn/" BUG_REPORT_URL="https://bugzilla.openanolis.cn/" [root@localhost test-results]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Vendor ID: AuthenticAMD BIOS Vendor ID: Alibaba Cloud Model name: AMD EPYC 7T83 64-Core Processor BIOS Model name: pc-i440fx-2.1 CPU @ 0.0GHz BIOS CPU family: 1 CPU family: 25 Model: 1 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 1 Stepping: 1 BogoMIPS: 5090.43 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apic id tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xs ave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osv w topoext invpcid_single vmmcall tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflush opt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru wbnoinvd arat vaes vpclmulqdq rdpid fsrm Virtualization features: Hypervisor vendor: KVM Virtualization type: full Caches (sum of all): L1d: 64 KiB (2 instances) L1i: 64 KiB (2 instances) L2: 1 MiB (2 instances) L3: 32 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-3 Vulnerabilities: Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Retbleed: Not affected Spec store bypass: Vulnerable Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected Srbds: Not affected Tsx async abort: Not affected How reproducible: 安装tensorflow2 [root@localhost test-results]# rpm -qi tensorflow2 Name : tensorflow2 Version : 2.12.0 Release : 4.an23 Architecture: x86_64 Install Date: Mon 07 Aug 2023 03:26:55 PM CST Group : Development/Languages/Python Size : 1098009814 License : Apache-2.0 AND BSD-2-Clause AND BSD-3-Clause AND FSFUL AND MIT AND MPL-2.0 AND OpenSSL AND Python-2.0 Signature : RSA/SHA256, Thu 27 Jul 2023 02:07:18 PM CST, Key ID 619140084873f7c5 Source RPM : tensorflow2-2.12.0-4.an23.src.rpm Build Date : Fri 21 Jul 2023 09:03:14 PM CST Build Host : iZ2ze8vdmdyl66lfybi1hzZ Packager : OpenAnolis Community Vendor : OpenAnolis Community URL : https://www.tensorflow.org/ Summary : A framework used for deep learning Description : This open source software library for numerical computation is used for data flow graphs. The graph nodes represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) that flow between them. This flexible architecture enables you to deploy computation to one or more CPUs in a desktop, server, or mobile device without rewriting code. import 后coredump [root@localhost test-results]# python --version Python 3.10.12 [root@localhost test-results]# python Python 3.10.12 (main, Jun 7 2023, 00:00:00) [GCC 12.2.1 20221121 (Anolis OS 12.2.1-2)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import tensorflow Illegal instruction (core dumped) Steps to Reproduce: 如上 Actual results: import tensorflow后coredump Expected results: 正常import Additional info: coredump信息 [root@localhost test-results]# coredumpctl info 3297584 PID: 3297584 (python) UID: 0 (root) GID: 0 (root) Signal: 4 (ILL) Timestamp: Mon 2023-08-07 15:53:11 CST (1min 27s ago) Command Line: python Executable: /usr/bin/python3.10 Control Group: /user.slice/user-0.slice/session-30.scope Unit: session-30.scope Slice: user-0.slice Session: 30 Owner UID: 0 (root) Boot ID: 208cdfbbf99f450c959b6ad71d54f5da Machine ID: 69185a58f8174a67b57acdb31605288b Hostname: localhost.localdomain Storage: /var/lib/systemd/coredump/core.python.0.208cdfbbf99f450c959b6ad71d54f5da.3297584.1691394791000000.zst (prese> Size on Disk: 6.1M Message: Process 3297584 (python) of user 0 dumped core. Stack trace of thread 3297584: #0 0x00007fc8d7e017c0 _ZN6google8protobuf8internal13OnShutdownRunEPFvPKvES3_ (libtensorflow_framework.so.2 > #1 0x00007fc8d7df9321 _ZN6google8protobuf8internal24InitProtobufDefaultsSlowEv (libtensorflow_framework.so.> #2 0x00007fc8da003d9e call_init (ld-linux-x86-64.so.2 + 0x4d9e) #3 0x00007fc8da003e8c _dl_init (ld-linux-x86-64.so.2 + 0x4e8c) #4 0x00007fc8d9c176b4 _dl_catch_exception (libc.so.6 + 0x14a6b4) #5 0x00007fc8da00a7f6 dl_open_worker (ld-linux-x86-64.so.2 + 0xb7f6) #6 0x00007fc8d9c1765e _dl_catch_exception (libc.so.6 + 0x14a65e) #7 0x00007fc8da00ab8c _dl_open (ld-linux-x86-64.so.2 + 0xbb8c) #8 0x00007fc8d9b500bc dlopen_doit (libc.so.6 + 0x830bc) #9 0x00007fc8d9c1765e _dl_catch_exception (libc.so.6 + 0x14a65e) #10 0x00007fc8d9c17713 _dl_catch_error (libc.so.6 + 0x14a713) #11 0x00007fc8d9b4fb8f _dlerror_run (libc.so.6 + 0x82b8f) #12 0x00007fc8d9b50171 dlopen@GLIBC_2.2.5 (libc.so.6 + 0x83171) #13 0x00007fc8d9e65401 _PyImport_FindSharedFuncptr (libpython3.10.so.1.0 + 0x1c0401) #14 0x00007fc8d9e630f9 _imp_create_dynamic (libpython3.10.so.1.0 + 0x1be0f9) #15 0x00007fc8d9dc2c8f cfunction_vectorcall_FASTCALL (libpython3.10.so.1.0 + 0x11dc8f) #16 0x00007fc8d9db887b _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x11387b) #17 0x00007fc8d9db4743 _PyEval_Vector (libpython3.10.so.1.0 + 0x10f743) #18 0x00007fc8d9dba94c _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x11594c) #19 0x00007fc8d9db4743 _PyEval_Vector (libpython3.10.so.1.0 + 0x10f743) #20 0x00007fc8d9db5d1d _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x110d1d) #21 0x00007fc8d9db4743 _PyEval_Vector (libpython3.10.so.1.0 + 0x10f743) #22 0x00007fc8d9db595e _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x11095e) #23 0x00007fc8d9db4743 _PyEval_Vector (libpython3.10.so.1.0 + 0x10f743) #24 0x00007fc8d9db595e _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x11095e) #25 0x00007fc8d9db4743 _PyEval_Vector (libpython3.10.so.1.0 + 0x10f743) #26 0x00007fc8d9db595e _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x11095e) #27 0x00007fc8d9db4743 _PyEval_Vector (libpython3.10.so.1.0 + 0x10f743) #28 0x00007fc8d9dc231f object_vacall (libpython3.10.so.1.0 + 0x11d31f) #29 0x00007fc8d9dcbc2d _PyObject_CallMethodIdObjArgs (libpython3.10.so.1.0 + 0x126c2d) #30 0x00007fc8d9dcb924 PyImport_ImportModuleLevelObject (libpython3.10.so.1.0 + 0x126924) #31 0x00007fc8d9dd316c builtin___import__ (libpython3.10.so.1.0 + 0x12e16c) #32 0x00007fc8d9dc2631 cfunction_call (libpython3.10.so.1.0 + 0x11d631) #33 0x00007fc8d9dcad5c _PyObject_Call (libpython3.10.so.1.0 + 0x125d5c) #34 0x00007fc8d9db887b _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x11387b) #35 0x00007fc8d9db4743 _PyEval_Vector (libpython3.10.so.1.0 + 0x10f743) #36 0x00007fc8d9db595e _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x11095e) #37 0x00007fc8d9db4743 _PyEval_Vector (libpython3.10.so.1.0 + 0x10f743) #38 0x00007fc8d9dc231f object_vacall (libpython3.10.so.1.0 + 0x11d31f) #39 0x00007fc8d9dcbc2d _PyObject_CallMethodIdObjArgs (libpython3.10.so.1.0 + 0x126c2d) #40 0x00007fc8d9dcba91 PyImport_ImportModuleLevelObject (libpython3.10.so.1.0 + 0x126a91) #41 0x00007fc8d9db95fd _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x1145fd) #42 0x00007fc8d9db4743 _PyEval_Vector (libpython3.10.so.1.0 + 0x10f743) #43 0x00007fc8d9dba94c _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x11594c) #44 0x00007fc8d9db4743 _PyEval_Vector (libpython3.10.so.1.0 + 0x10f743) #45 0x00007fc8d9e2fe9e PyEval_EvalCode (libpython3.10.so.1.0 + 0x18ae9e) #46 0x00007fc8d9e36ecb builtin_exec (libpython3.10.so.1.0 + 0x191ecb) #47 0x00007fc8d9dc2c8f cfunction_vectorcall_FASTCALL (libpython3.10.so.1.0 + 0x11dc8f) #48 0x00007fc8d9db887b _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x11387b) #49 0x00007fc8d9db4743 _PyEval_Vector (libpython3.10.so.1.0 + 0x10f743) #50 0x00007fc8d9dba94c _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x11594c) #51 0x00007fc8d9db4743 _PyEval_Vector (libpython3.10.so.1.0 + 0x10f743) #52 0x00007fc8d9db5d1d _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x110d1d) #53 0x00007fc8d9db4743 _PyEval_Vector (libpython3.10.so.1.0 + 0x10f743) #54 0x00007fc8d9db595e _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x11095e) #55 0x00007fc8d9db4743 _PyEval_Vector (libpython3.10.so.1.0 + 0x10f743) #56 0x00007fc8d9db595e _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x11095e) #57 0x00007fc8d9db4743 _PyEval_Vector (libpython3.10.so.1.0 + 0x10f743) #58 0x00007fc8d9dc231f object_vacall (libpython3.10.so.1.0 + 0x11d31f) #59 0x00007fc8d9dcbc2d _PyObject_CallMethodIdObjArgs (libpython3.10.so.1.0 + 0x126c2d) #60 0x00007fc8d9dcb924 PyImport_ImportModuleLevelObject (libpython3.10.so.1.0 + 0x126924) #61 0x00007fc8d9dd316c builtin___import__ (libpython3.10.so.1.0 + 0x12e16c) #62 0x00007fc8d9dc2631 cfunction_call (libpython3.10.so.1.0 + 0x11d631) #63 0x00007fc8d9dcad5c _PyObject_Call (libpython3.10.so.1.0 + 0x125d5c) ELF object binary architecture: AMD x86-64
只有amd的cpu实例有问题。看着是跟cpu强相关的问题
tensorflow2-2.12.0-3版本会发生coredump吗?
gdb调试发现发生core的指令: 0x00007fa644e49d88 <+184>: vmovdqu %xmm0,(%rax) 0x00007fa644e49d8c <+188>: vpxor %xmm0,%xmm0,%xmm0 => 0x00007fa644e49d90 <+192>: vmovdqu8 %ymm0,0x18(%rax) 0x00007fa644e49d9a <+202>: vzeroupper 是一个avx512的指令。 构建tensorflow时默认开启了avx512相关的选项: %ifarch x86_64 %define bz_copts \\\ %{?copts} \\\ --copt=-mavx \\\ --copt=-mavx2 \\\ --copt=-mfma \\\ --copt=-mavx512f \\\ --copt=-mavx512pf \\\ --copt=-mavx512cd \\\ --copt=-mavx512bw \\\ --copt=-mavx512dq \\\ --copt=-mavx512er \\\ --config=mkl \\\ --copt=-DENABLE_INTEL_MKL_BFLOAT16 %endif 而当前发生core的amd cpu不支持avx512。
tensorflow2-2.12.0-6.an23 已去除AVX512的默认enable.