Bug 6115 - [anolis23][x86_64][amd][ai][nightly]amd的cpu,安装tensorflow2后,import tensorflow出现coredump
Summary: [anolis23][x86_64][amd][ai][nightly]amd的cpu,安装tensorflow2后,import tensorflow出...
Status: RESOLVED FIXED
Alias: None
Product: Anolis OS 23
Classification: Anolis OS
Component: BaseOS Packages (show other bugs) BaseOS Packages
Version: 23.0
Hardware: All Linux
: P2-High S2-major
Target Milestone: ---
Assignee: xuchunmei
QA Contact: bolong_tbl
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-08-07 17:03 UTC by yunmeng365524
Modified: 2023-08-25 14:23 UTC (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description yunmeng365524 2023-08-07 17:03:32 UTC
Description of problem:
amd的cpu,安装tensorflow2后,import tensorflow出现coredum

Version-Release number of selected component (if applicable):
[root@localhost test-results]# uname -a
Linux localhost.localdomain 5.10.134-14.1.an23.x86_64 #1 SMP Thu May 25 19:57:17 CST 2023 x86_64 GNU/Linux
[root@localhost test-results]# cat /etc/os-release
NAME="Anolis OS"
VERSION="23"
ID="anolis"
VERSION_ID="23"
PLATFORM_ID="platform:an23"
PRETTY_NAME="Anolis OS 23"
ANSI_COLOR="0;31"
HOME_URL="https://openanolis.cn/"
BUG_REPORT_URL="https://bugzilla.openanolis.cn/"

[root@localhost test-results]# lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         48 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  4
  On-line CPU(s) list:   0-3
Vendor ID:               AuthenticAMD
  BIOS Vendor ID:        Alibaba Cloud
  Model name:            AMD EPYC 7T83 64-Core Processor
    BIOS Model name:     pc-i440fx-2.1  CPU @ 0.0GHz
    BIOS CPU family:     1
    CPU family:          25
    Model:               1
    Thread(s) per core:  2
    Core(s) per socket:  2
    Socket(s):           1
    Stepping:            1
    BogoMIPS:            5090.43
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht
                          syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apic
                         id tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xs
                         ave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osv
                         w topoext invpcid_single vmmcall tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflush
                         opt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru wbnoinvd arat vaes vpclmulqdq
                          rdpid fsrm
Virtualization features:
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):
  L1d:                   64 KiB (2 instances)
  L1i:                   64 KiB (2 instances)
  L2:                    1 MiB (2 instances)
  L3:                    32 MiB (1 instance)
NUMA:
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-3
Vulnerabilities:
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec store bypass:     Vulnerable
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected

How reproducible:
安装tensorflow2
[root@localhost test-results]# rpm -qi tensorflow2
Name        : tensorflow2
Version     : 2.12.0
Release     : 4.an23
Architecture: x86_64
Install Date: Mon 07 Aug 2023 03:26:55 PM CST
Group       : Development/Languages/Python
Size        : 1098009814
License     : Apache-2.0 AND BSD-2-Clause AND BSD-3-Clause AND FSFUL AND MIT AND MPL-2.0 AND OpenSSL AND Python-2.0
Signature   : RSA/SHA256, Thu 27 Jul 2023 02:07:18 PM CST, Key ID 619140084873f7c5
Source RPM  : tensorflow2-2.12.0-4.an23.src.rpm
Build Date  : Fri 21 Jul 2023 09:03:14 PM CST
Build Host  : iZ2ze8vdmdyl66lfybi1hzZ
Packager    : OpenAnolis Community
Vendor      : OpenAnolis Community
URL         : https://www.tensorflow.org/
Summary     : A framework used for deep learning
Description :
This open source software library for numerical computation is used for data
flow graphs. The graph nodes represent mathematical operations, while the graph
edges represent the multidimensional data arrays (tensors) that flow between
them. This flexible architecture enables you to deploy computation to one or
more CPUs in a desktop, server, or mobile device without rewriting code.

import 后coredump
[root@localhost test-results]# python --version
Python 3.10.12
[root@localhost test-results]# python
Python 3.10.12 (main, Jun  7 2023, 00:00:00) [GCC 12.2.1 20221121 (Anolis OS 12.2.1-2)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
Illegal instruction (core dumped)

Steps to Reproduce:
如上

Actual results:
import tensorflow后coredump

Expected results:
正常import

Additional info:
coredump信息
[root@localhost test-results]# coredumpctl info 3297584
           PID: 3297584 (python)
           UID: 0 (root)
           GID: 0 (root)
        Signal: 4 (ILL)
     Timestamp: Mon 2023-08-07 15:53:11 CST (1min 27s ago)
  Command Line: python
    Executable: /usr/bin/python3.10
 Control Group: /user.slice/user-0.slice/session-30.scope
          Unit: session-30.scope
         Slice: user-0.slice
       Session: 30
     Owner UID: 0 (root)
       Boot ID: 208cdfbbf99f450c959b6ad71d54f5da
    Machine ID: 69185a58f8174a67b57acdb31605288b
      Hostname: localhost.localdomain
       Storage: /var/lib/systemd/coredump/core.python.0.208cdfbbf99f450c959b6ad71d54f5da.3297584.1691394791000000.zst (prese>
  Size on Disk: 6.1M
       Message: Process 3297584 (python) of user 0 dumped core.

                Stack trace of thread 3297584:
                #0  0x00007fc8d7e017c0 _ZN6google8protobuf8internal13OnShutdownRunEPFvPKvES3_ (libtensorflow_framework.so.2 >
                #1  0x00007fc8d7df9321 _ZN6google8protobuf8internal24InitProtobufDefaultsSlowEv (libtensorflow_framework.so.>
                #2  0x00007fc8da003d9e call_init (ld-linux-x86-64.so.2 + 0x4d9e)
                #3  0x00007fc8da003e8c _dl_init (ld-linux-x86-64.so.2 + 0x4e8c)
                #4  0x00007fc8d9c176b4 _dl_catch_exception (libc.so.6 + 0x14a6b4)
                #5  0x00007fc8da00a7f6 dl_open_worker (ld-linux-x86-64.so.2 + 0xb7f6)
                #6  0x00007fc8d9c1765e _dl_catch_exception (libc.so.6 + 0x14a65e)
                #7  0x00007fc8da00ab8c _dl_open (ld-linux-x86-64.so.2 + 0xbb8c)
                #8  0x00007fc8d9b500bc dlopen_doit (libc.so.6 + 0x830bc)
                #9  0x00007fc8d9c1765e _dl_catch_exception (libc.so.6 + 0x14a65e)
                #10 0x00007fc8d9c17713 _dl_catch_error (libc.so.6 + 0x14a713)
                #11 0x00007fc8d9b4fb8f _dlerror_run (libc.so.6 + 0x82b8f)
                #12 0x00007fc8d9b50171 dlopen@GLIBC_2.2.5 (libc.so.6 + 0x83171)
                #13 0x00007fc8d9e65401 _PyImport_FindSharedFuncptr (libpython3.10.so.1.0 + 0x1c0401)
                #14 0x00007fc8d9e630f9 _imp_create_dynamic (libpython3.10.so.1.0 + 0x1be0f9)
                #15 0x00007fc8d9dc2c8f cfunction_vectorcall_FASTCALL (libpython3.10.so.1.0 + 0x11dc8f)
                #16 0x00007fc8d9db887b _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x11387b)
                #17 0x00007fc8d9db4743 _PyEval_Vector (libpython3.10.so.1.0 + 0x10f743)
                #18 0x00007fc8d9dba94c _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x11594c)
                #19 0x00007fc8d9db4743 _PyEval_Vector (libpython3.10.so.1.0 + 0x10f743)
                #20 0x00007fc8d9db5d1d _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x110d1d)
                #21 0x00007fc8d9db4743 _PyEval_Vector (libpython3.10.so.1.0 + 0x10f743)
                #22 0x00007fc8d9db595e _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x11095e)
                #23 0x00007fc8d9db4743 _PyEval_Vector (libpython3.10.so.1.0 + 0x10f743)
                #24 0x00007fc8d9db595e _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x11095e)
                #25 0x00007fc8d9db4743 _PyEval_Vector (libpython3.10.so.1.0 + 0x10f743)
                #26 0x00007fc8d9db595e _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x11095e)
                #27 0x00007fc8d9db4743 _PyEval_Vector (libpython3.10.so.1.0 + 0x10f743)
                #28 0x00007fc8d9dc231f object_vacall (libpython3.10.so.1.0 + 0x11d31f)
                #29 0x00007fc8d9dcbc2d _PyObject_CallMethodIdObjArgs (libpython3.10.so.1.0 + 0x126c2d)
                #30 0x00007fc8d9dcb924 PyImport_ImportModuleLevelObject (libpython3.10.so.1.0 + 0x126924)
                #31 0x00007fc8d9dd316c builtin___import__ (libpython3.10.so.1.0 + 0x12e16c)
                #32 0x00007fc8d9dc2631 cfunction_call (libpython3.10.so.1.0 + 0x11d631)
                #33 0x00007fc8d9dcad5c _PyObject_Call (libpython3.10.so.1.0 + 0x125d5c)
                #34 0x00007fc8d9db887b _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x11387b)
                #35 0x00007fc8d9db4743 _PyEval_Vector (libpython3.10.so.1.0 + 0x10f743)
                #36 0x00007fc8d9db595e _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x11095e)
                #37 0x00007fc8d9db4743 _PyEval_Vector (libpython3.10.so.1.0 + 0x10f743)
                #38 0x00007fc8d9dc231f object_vacall (libpython3.10.so.1.0 + 0x11d31f)
                #39 0x00007fc8d9dcbc2d _PyObject_CallMethodIdObjArgs (libpython3.10.so.1.0 + 0x126c2d)
                #40 0x00007fc8d9dcba91 PyImport_ImportModuleLevelObject (libpython3.10.so.1.0 + 0x126a91)
                #41 0x00007fc8d9db95fd _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x1145fd)
                #42 0x00007fc8d9db4743 _PyEval_Vector (libpython3.10.so.1.0 + 0x10f743)
                #43 0x00007fc8d9dba94c _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x11594c)
                #44 0x00007fc8d9db4743 _PyEval_Vector (libpython3.10.so.1.0 + 0x10f743)
                #45 0x00007fc8d9e2fe9e PyEval_EvalCode (libpython3.10.so.1.0 + 0x18ae9e)
                #46 0x00007fc8d9e36ecb builtin_exec (libpython3.10.so.1.0 + 0x191ecb)
                #47 0x00007fc8d9dc2c8f cfunction_vectorcall_FASTCALL (libpython3.10.so.1.0 + 0x11dc8f)
                #48 0x00007fc8d9db887b _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x11387b)
                #49 0x00007fc8d9db4743 _PyEval_Vector (libpython3.10.so.1.0 + 0x10f743)
                #50 0x00007fc8d9dba94c _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x11594c)
                #51 0x00007fc8d9db4743 _PyEval_Vector (libpython3.10.so.1.0 + 0x10f743)
                #52 0x00007fc8d9db5d1d _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x110d1d)
                #53 0x00007fc8d9db4743 _PyEval_Vector (libpython3.10.so.1.0 + 0x10f743)
                #54 0x00007fc8d9db595e _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x11095e)
                #55 0x00007fc8d9db4743 _PyEval_Vector (libpython3.10.so.1.0 + 0x10f743)
                #56 0x00007fc8d9db595e _PyEval_EvalFrameDefault (libpython3.10.so.1.0 + 0x11095e)
                #57 0x00007fc8d9db4743 _PyEval_Vector (libpython3.10.so.1.0 + 0x10f743)
                #58 0x00007fc8d9dc231f object_vacall (libpython3.10.so.1.0 + 0x11d31f)
                #59 0x00007fc8d9dcbc2d _PyObject_CallMethodIdObjArgs (libpython3.10.so.1.0 + 0x126c2d)
                #60 0x00007fc8d9dcb924 PyImport_ImportModuleLevelObject (libpython3.10.so.1.0 + 0x126924)
                #61 0x00007fc8d9dd316c builtin___import__ (libpython3.10.so.1.0 + 0x12e16c)
                #62 0x00007fc8d9dc2631 cfunction_call (libpython3.10.so.1.0 + 0x11d631)
                #63 0x00007fc8d9dcad5c _PyObject_Call (libpython3.10.so.1.0 + 0x125d5c)
                ELF object binary architecture: AMD x86-64
Comment 1 yunmeng365524 2023-08-07 17:05:38 UTC
只有amd的cpu实例有问题。看着是跟cpu强相关的问题
Comment 2 xuchunmei alibaba_cloud_group 2023-08-08 13:42:04 UTC
tensorflow2-2.12.0-3版本会发生coredump吗?
Comment 3 xuchunmei alibaba_cloud_group 2023-08-08 15:41:54 UTC
gdb调试发现发生core的指令:
   0x00007fa644e49d88 <+184>:	vmovdqu %xmm0,(%rax)
   0x00007fa644e49d8c <+188>:	vpxor  %xmm0,%xmm0,%xmm0
=> 0x00007fa644e49d90 <+192>:	vmovdqu8 %ymm0,0x18(%rax)
   0x00007fa644e49d9a <+202>:	vzeroupper

是一个avx512的指令。

构建tensorflow时默认开启了avx512相关的选项:

%ifarch x86_64
%define bz_copts \\\
  %{?copts} \\\
  --copt=-mavx \\\
  --copt=-mavx2 \\\
  --copt=-mfma \\\
  --copt=-mavx512f \\\
  --copt=-mavx512pf \\\
  --copt=-mavx512cd \\\
  --copt=-mavx512bw \\\
  --copt=-mavx512dq \\\
  --copt=-mavx512er \\\
  --config=mkl \\\
  --copt=-DENABLE_INTEL_MKL_BFLOAT16
%endif


而当前发生core的amd cpu不支持avx512。
Comment 4 xuchunmei alibaba_cloud_group 2023-08-25 14:23:48 UTC
tensorflow2-2.12.0-6.an23 已去除AVX512的默认enable.