6793 – [Anolis23][docker image tests][x86]tensorflow镜像在GPU环境无法开机即用

Bug 6793 - [Anolis23][docker image tests][x86]tensorflow镜像在GPU环境无法开机即用

Summary: [Anolis23][docker image tests][x86]tensorflow镜像在GPU环境无法开机即用

Status:	RESOLVED FIXED

Alias:	None

Product:	Anolis OS 23
Classification:	Anolis OS
Component:	BaseOS Packages (show other bugs)	BaseOS Packages
Sub Component:
Version:	23.0
Hardware:	All Linux

Importance:	P3-Medium S3-normal
Target Milestone:	---
Assignee:	xuchunmei
QA Contact:	bolong_tbl

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2023-10-12 20:09 UTC by yunmeng365524
Modified:	2023-10-16 17:54 UTC (History)
CC List:	0 users

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description yunmeng365524 2023-10-12 20:09:26 UTC

Description of problem:
ensorflow镜像在GPU环境无法开机即用

Version-Release number of selected component (if applicable):
镜像地址：
registry.openanolis.cn/openanolis/tensorflow:2.12.0-23

host信息：
[root@localhost ~]# uname -a
Linux localhost.localdomain 5.10.134-14.1.an23.x86_64 #1 SMP Thu May 25 19:57:17 CST 2023 x86_64 GNU/Linux

[root@localhost ~]# nvidia-smi
Thu Oct 12 20:03:37 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P100-PCIE-16GB            Off| 00000000:00:08.0 Off |                    0 |
| N/A   30C    P0               25W / 250W|      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE-16GB            Off| 00000000:00:09.0 Off |                    0 |
| N/A   31C    P0               27W / 250W|      0MiB / 16384MiB |      2%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

镜像内的rpm信息
[root@e72499a522c1 tensorflow2]# rpm -qa | grep tensorflow
libtensorflow_framework2-2.12.0-6.an23.x86_64
python3-tensorflow-io-gcs-filesystem-0.32.0-2.an23.x86_64
python3-tensorflow-estimator-2.12.0-1.an23.noarch
tensorflow2-2.12.0-6.an23.x86_64
[root@e72499a522c1 tensorflow2]# ls
tc_ai_tensorflow2_auto_graph.py  tc_ai_tensorflow2_eager_execution.py  tc_ai_tensorflow2_sample.py
tc_ai_tensorflow2_distribute.py  tc_ai_tensorflow2_keras.py
[root@e72499a522c1 tensorflow2]# rpm -qa | greo cuda
bash: greo: command not found
[root@e72499a522c1 tensorflow2]# rpm -qa | grep cuda
cuda-toolkit-config-common-12.1.105-4.an23.noarch
cuda-toolkit-12-config-common-12.1.105-4.an23.noarch
cuda-toolkit-12-1-config-common-12.1.105-4.an23.noarch
cuda-cudart-12-1-12.1.105-4.an23.x86_64
cuda-opencl-12-1-12.1.105-4.an23.x86_64
libnccl-2.18.3-2.cuda12.1.an23.x86_64
cuda-nvrtc-12-1-12.1.105-4.an23.x86_64
cuda-libraries-12-1-12.1.1-4.an23.x86_64
libcudnn-8.9.3.28-1.cuda12.1.an23.x86_64
[root@e72499a522c1 tensorflow2]# rpm -qa | grep tensorflow
libtensorflow_framework2-2.12.0-6.an23.x86_64
python3-tensorflow-io-gcs-filesystem-0.32.0-2.an23.x86_64
python3-tensorflow-estimator-2.12.0-1.an23.noarch
tensorflow2-2.12.0-6.an23.x86_64


How reproducible:
1、 在host上拉取镜像
2、 带GPU启动镜像
[root@localhost ~]# docker images
REPOSITORY                                     TAG                 IMAGE ID       CREATED       SIZE
registry.openanolis.cn/openanolis/cuda         12.1.1-23-devel     d713bee7bd5b   2 weeks ago   6.47GB
registry.openanolis.cn/openanolis/cuda         12.1.1-23-runtime   b8c3adb5af23   4 weeks ago   2.06GB
registry.openanolis.cn/openanolis/pytorch      2.0.1-23            cf5996f70ced   4 weeks ago   4.73GB
registry.openanolis.cn/openanolis/tensorflow   2.12.0-23           c40f63cf9d9c   5 weeks ago   4.95GB
[root@localhost ~]# docker run --gpus all -it -v /tmp:/tmp c40f63cf9d9c

3、 运行tensorflow的测试脚本
[root@e72499a522c1 tensorflow2]# avocado run --nrunner-max-parallel-tasks 1 *.py
JOB ID     : da0fd9e4526c7fa279c9d2ab4cbb1693704eb8fe
JOB LOG    : /root/avocado/job-results/job-2023-10-12T09.43-da0fd9e/job.log
 (1/5) tc_ai_tensorflow2_auto_graph.py:Test.test: STARTED
 (1/5) tc_ai_tensorflow2_auto_graph.py:Test.test: PASS (6.99 s)
 (2/5) tc_ai_tensorflow2_distribute.py:Test.test: STARTED
 (2/5) tc_ai_tensorflow2_distribute.py:Test.test: ERROR: Command 'python3 ../../../res/ai/tensorflow/tensorflow2_distribute.py' failed.\nstdout: b'Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz\n\r    8192/11490434 [..............................] - ETA: 0s\x08\x08\x0... (16.97 s)
 (3/5) tc_ai_tensorflow2_eager_execution.py:Test.test: STARTED
 (3/5) tc_ai_tensorflow2_eager_execution.py:Test.test: ERROR: Command 'python3 ../../../res/ai/tensorflow/tensorflow2_eager_execution.py' failed.\nstdout: b''\nstderr: b"2023-10-12 09:44:32.096672: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:530] Can't find libdevice directory ${CUDA_DIR}... (25.17 s)
 (4/5) tc_ai_tensorflow2_keras.py:Test.test: STARTED
 (4/5) tc_ai_tensorflow2_keras.py:Test.test: ERROR: Command 'python3 ../../../res/ai/tensorflow/tensorflow2_keras.py' failed.\nstdout: b'Epoch 1/5\n'\nstderr: b'2023-10-12 09:44:59.394691: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:530] Can\'t find libdevice directory ${CUDA_DI... (10.15 s)
 (5/5) tc_ai_tensorflow2_sample.py:Test.test: STARTED
 (5/5) tc_ai_tensorflow2_sample.py:Test.test: ERROR: Command 'python3 ../../../res/ai/tensorflow/tensorflow2_sample.py' failed.\nstdout: b'Epoch 1/1000\n'\nstderr: b"2023-10-12 09:45:09.103549: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:530] Can't find libdevice directory ${CUDA... (24.71 s)
RESULTS    : PASS 1 | ERROR 4 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 0
JOB TIME   : 87.71 s

查看其中一个日志信息，报错内容如下：
[stdlog] 2023-10-12 09:44:25,905 avocado.utils.process INFO | Running 'python3 ../../../res/ai/tensorflow/tensorflow2_eager_execution.py'
[stdlog] 2023-10-12 09:44:32,096 avocado.utils.process DEBUG| [stderr] 2023-10-12 09:44:32.096672: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:530] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
[stdlog] 2023-10-12 09:44:32,097 avocado.utils.process DEBUG| [stderr] Searched for CUDA in the following directories:
[stdlog] 2023-10-12 09:44:32,097 avocado.utils.process DEBUG| [stderr]   ./cuda_sdk_lib
[stdlog] 2023-10-12 09:44:32,097 avocado.utils.process DEBUG| [stderr]   /usr/local/cuda-12.1
[stdlog] 2023-10-12 09:44:32,097 avocado.utils.process DEBUG| [stderr]   /usr/local/cuda
[stdlog] 2023-10-12 09:44:32,097 avocado.utils.process DEBUG| [stderr]   .
[stdlog] 2023-10-12 09:44:32,097 avocado.utils.process DEBUG| [stderr] You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions.  For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.
[stdlog] 2023-10-12 09:44:32,102 avocado.utils.process DEBUG| [stderr] 2023-10-12 09:44:32.102458: W tensorflow/compiler/xla/stream_executor/gpu/asm_compiler.cc:109] Couldn't get ptxas version : FAILED_PRECONDITION: Couldn't get ptxas/nvlink version string: INTERNAL: Couldn't invoke ptxas --version
[stdlog] 2023-10-12 09:44:32,103 avocado.utils.process DEBUG| [stderr] 2023-10-12 09:44:32.103043: F tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:504] ptxas returned an error during compilation of ptx to sass: 'INTERNAL: Failed to launch ptxas'  If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided.
[stdlog] 2023-10-12 09:44:50,665 avocado.utils.process INFO | Command 'python3 ../../../res/ai/tensorflow/tensorflow2_eager_execution.py' finished with -6 after 24.759210355s

手动yum 安装cuda的包后运行通过


Steps to Reproduce:
如上

Actual results:
tensorflow的镜像开箱即用

Expected results:
需要手动安装cuda才能使用

Additional info:

Comment 1 xuchunmei alibaba_cloud_group

2023-10-12 20:43:25 UTC

测试脚本里面需要用cuda来构建什么吗？

Comment 2 yunmeng365524 2023-10-13 10:37:21 UTC

用例代码
# cat tensorflow2_eager_execution.py
import tensorflow as tf

# 数据
x_train = [1, 2, 3, 4]
y_train = [0, -1, -2, -3]

# 模型参数
W = tf.Variable([0.3], dtype=tf.float32)
b = tf.Variable([-0.3], dtype=tf.float32)

# 损失函数
def loss_fn(x, y):
    linear_model = W * x + b
    return tf.reduce_sum(tf.square(linear_model - y))

# 优化器
optimizer = tf.keras.optimizers.SGD(0.01)

# 训练
for i in range(1000):
    with tf.GradientTape() as tape:
        loss = loss_fn(x_train, y_train)
    grads = tape.gradient(loss, [W, b])
    optimizer.apply_gradients(zip(grads, [W, b]))

# 输出模型参数
print("W = %s, b = %s" % (W.numpy(), b.numpy()))

Comment 3 xuchunmei alibaba_cloud_group

2023-10-13 11:00:05 UTC

(In reply to yunmeng365524 from comment #2)
> 用例代码
> # cat tensorflow2_eager_execution.py
> import tensorflow as tf
> 
> # 数据
> x_train = [1, 2, 3, 4]
> y_train = [0, -1, -2, -3]
> 
> # 模型参数
> W = tf.Variable([0.3], dtype=tf.float32)
> b = tf.Variable([-0.3], dtype=tf.float32)
> 
> # 损失函数
> def loss_fn(x, y):
>     linear_model = W * x + b
>     return tf.reduce_sum(tf.square(linear_model - y))
> 
> # 优化器
> optimizer = tf.keras.optimizers.SGD(0.01)
> 
> # 训练
> for i in range(1000):
>     with tf.GradientTape() as tape:
>         loss = loss_fn(x_train, y_train)
>     grads = tape.gradient(loss, [W, b])
>     optimizer.apply_gradients(zip(grads, [W, b]))
> 
> # 输出模型参数
> print("W = %s, b = %s" % (W.numpy(), b.numpy()))

从报错信息：
W tensorflow/compiler/xla/stream_executor/gpu/asm_compiler.cc:109] Couldn't get ptxas version : FAILED_PRECONDITION: Couldn't get ptxas/nvlink version string: INTERNAL: Couldn't invoke ptxas --version

缺少ptxas，由cuda-nvcc-12-1提供，
执行dnf install cuda-nvcc-12-1 -y安装后成功执行。

Comment 4 xuchunmei alibaba_cloud_group

2023-10-16 17:54:50 UTC

最新镜像已解决。