Description of problem: ensorflow镜像在GPU环境无法开机即用 Version-Release number of selected component (if applicable): 镜像地址: registry.openanolis.cn/openanolis/tensorflow:2.12.0-23 host信息: [root@localhost ~]# uname -a Linux localhost.localdomain 5.10.134-14.1.an23.x86_64 #1 SMP Thu May 25 19:57:17 CST 2023 x86_64 GNU/Linux [root@localhost ~]# nvidia-smi Thu Oct 12 20:03:37 2023 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla P100-PCIE-16GB Off| 00000000:00:08.0 Off | 0 | | N/A 30C P0 25W / 250W| 0MiB / 16384MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 Tesla P100-PCIE-16GB Off| 00000000:00:09.0 Off | 0 | | N/A 31C P0 27W / 250W| 0MiB / 16384MiB | 2% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+ 镜像内的rpm信息 [root@e72499a522c1 tensorflow2]# rpm -qa | grep tensorflow libtensorflow_framework2-2.12.0-6.an23.x86_64 python3-tensorflow-io-gcs-filesystem-0.32.0-2.an23.x86_64 python3-tensorflow-estimator-2.12.0-1.an23.noarch tensorflow2-2.12.0-6.an23.x86_64 [root@e72499a522c1 tensorflow2]# ls tc_ai_tensorflow2_auto_graph.py tc_ai_tensorflow2_eager_execution.py tc_ai_tensorflow2_sample.py tc_ai_tensorflow2_distribute.py tc_ai_tensorflow2_keras.py [root@e72499a522c1 tensorflow2]# rpm -qa | greo cuda bash: greo: command not found [root@e72499a522c1 tensorflow2]# rpm -qa | grep cuda cuda-toolkit-config-common-12.1.105-4.an23.noarch cuda-toolkit-12-config-common-12.1.105-4.an23.noarch cuda-toolkit-12-1-config-common-12.1.105-4.an23.noarch cuda-cudart-12-1-12.1.105-4.an23.x86_64 cuda-opencl-12-1-12.1.105-4.an23.x86_64 libnccl-2.18.3-2.cuda12.1.an23.x86_64 cuda-nvrtc-12-1-12.1.105-4.an23.x86_64 cuda-libraries-12-1-12.1.1-4.an23.x86_64 libcudnn-8.9.3.28-1.cuda12.1.an23.x86_64 [root@e72499a522c1 tensorflow2]# rpm -qa | grep tensorflow libtensorflow_framework2-2.12.0-6.an23.x86_64 python3-tensorflow-io-gcs-filesystem-0.32.0-2.an23.x86_64 python3-tensorflow-estimator-2.12.0-1.an23.noarch tensorflow2-2.12.0-6.an23.x86_64 How reproducible: 1、 在host上拉取镜像 2、 带GPU启动镜像 [root@localhost ~]# docker images REPOSITORY TAG IMAGE ID CREATED SIZE registry.openanolis.cn/openanolis/cuda 12.1.1-23-devel d713bee7bd5b 2 weeks ago 6.47GB registry.openanolis.cn/openanolis/cuda 12.1.1-23-runtime b8c3adb5af23 4 weeks ago 2.06GB registry.openanolis.cn/openanolis/pytorch 2.0.1-23 cf5996f70ced 4 weeks ago 4.73GB registry.openanolis.cn/openanolis/tensorflow 2.12.0-23 c40f63cf9d9c 5 weeks ago 4.95GB [root@localhost ~]# docker run --gpus all -it -v /tmp:/tmp c40f63cf9d9c 3、 运行tensorflow的测试脚本 [root@e72499a522c1 tensorflow2]# avocado run --nrunner-max-parallel-tasks 1 *.py JOB ID : da0fd9e4526c7fa279c9d2ab4cbb1693704eb8fe JOB LOG : /root/avocado/job-results/job-2023-10-12T09.43-da0fd9e/job.log (1/5) tc_ai_tensorflow2_auto_graph.py:Test.test: STARTED (1/5) tc_ai_tensorflow2_auto_graph.py:Test.test: PASS (6.99 s) (2/5) tc_ai_tensorflow2_distribute.py:Test.test: STARTED (2/5) tc_ai_tensorflow2_distribute.py:Test.test: ERROR: Command 'python3 ../../../res/ai/tensorflow/tensorflow2_distribute.py' failed.\nstdout: b'Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz\n\r 8192/11490434 [..............................] - ETA: 0s\x08\x08\x0... (16.97 s) (3/5) tc_ai_tensorflow2_eager_execution.py:Test.test: STARTED (3/5) tc_ai_tensorflow2_eager_execution.py:Test.test: ERROR: Command 'python3 ../../../res/ai/tensorflow/tensorflow2_eager_execution.py' failed.\nstdout: b''\nstderr: b"2023-10-12 09:44:32.096672: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:530] Can't find libdevice directory ${CUDA_DIR}... (25.17 s) (4/5) tc_ai_tensorflow2_keras.py:Test.test: STARTED (4/5) tc_ai_tensorflow2_keras.py:Test.test: ERROR: Command 'python3 ../../../res/ai/tensorflow/tensorflow2_keras.py' failed.\nstdout: b'Epoch 1/5\n'\nstderr: b'2023-10-12 09:44:59.394691: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:530] Can\'t find libdevice directory ${CUDA_DI... (10.15 s) (5/5) tc_ai_tensorflow2_sample.py:Test.test: STARTED (5/5) tc_ai_tensorflow2_sample.py:Test.test: ERROR: Command 'python3 ../../../res/ai/tensorflow/tensorflow2_sample.py' failed.\nstdout: b'Epoch 1/1000\n'\nstderr: b"2023-10-12 09:45:09.103549: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:530] Can't find libdevice directory ${CUDA... (24.71 s) RESULTS : PASS 1 | ERROR 4 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 0 JOB TIME : 87.71 s 查看其中一个日志信息,报错内容如下: [stdlog] 2023-10-12 09:44:25,905 avocado.utils.process INFO | Running 'python3 ../../../res/ai/tensorflow/tensorflow2_eager_execution.py' [stdlog] 2023-10-12 09:44:32,096 avocado.utils.process DEBUG| [stderr] 2023-10-12 09:44:32.096672: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:530] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice. [stdlog] 2023-10-12 09:44:32,097 avocado.utils.process DEBUG| [stderr] Searched for CUDA in the following directories: [stdlog] 2023-10-12 09:44:32,097 avocado.utils.process DEBUG| [stderr] ./cuda_sdk_lib [stdlog] 2023-10-12 09:44:32,097 avocado.utils.process DEBUG| [stderr] /usr/local/cuda-12.1 [stdlog] 2023-10-12 09:44:32,097 avocado.utils.process DEBUG| [stderr] /usr/local/cuda [stdlog] 2023-10-12 09:44:32,097 avocado.utils.process DEBUG| [stderr] . [stdlog] 2023-10-12 09:44:32,097 avocado.utils.process DEBUG| [stderr] You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions. For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work. [stdlog] 2023-10-12 09:44:32,102 avocado.utils.process DEBUG| [stderr] 2023-10-12 09:44:32.102458: W tensorflow/compiler/xla/stream_executor/gpu/asm_compiler.cc:109] Couldn't get ptxas version : FAILED_PRECONDITION: Couldn't get ptxas/nvlink version string: INTERNAL: Couldn't invoke ptxas --version [stdlog] 2023-10-12 09:44:32,103 avocado.utils.process DEBUG| [stderr] 2023-10-12 09:44:32.103043: F tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:504] ptxas returned an error during compilation of ptx to sass: 'INTERNAL: Failed to launch ptxas' If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided. [stdlog] 2023-10-12 09:44:50,665 avocado.utils.process INFO | Command 'python3 ../../../res/ai/tensorflow/tensorflow2_eager_execution.py' finished with -6 after 24.759210355s 手动yum 安装cuda的包后运行通过 Steps to Reproduce: 如上 Actual results: tensorflow的镜像开箱即用 Expected results: 需要手动安装cuda才能使用 Additional info:
测试脚本里面需要用cuda来构建什么吗?
用例代码 # cat tensorflow2_eager_execution.py import tensorflow as tf # 数据 x_train = [1, 2, 3, 4] y_train = [0, -1, -2, -3] # 模型参数 W = tf.Variable([0.3], dtype=tf.float32) b = tf.Variable([-0.3], dtype=tf.float32) # 损失函数 def loss_fn(x, y): linear_model = W * x + b return tf.reduce_sum(tf.square(linear_model - y)) # 优化器 optimizer = tf.keras.optimizers.SGD(0.01) # 训练 for i in range(1000): with tf.GradientTape() as tape: loss = loss_fn(x_train, y_train) grads = tape.gradient(loss, [W, b]) optimizer.apply_gradients(zip(grads, [W, b])) # 输出模型参数 print("W = %s, b = %s" % (W.numpy(), b.numpy()))
(In reply to yunmeng365524 from comment #2) > 用例代码 > # cat tensorflow2_eager_execution.py > import tensorflow as tf > > # 数据 > x_train = [1, 2, 3, 4] > y_train = [0, -1, -2, -3] > > # 模型参数 > W = tf.Variable([0.3], dtype=tf.float32) > b = tf.Variable([-0.3], dtype=tf.float32) > > # 损失函数 > def loss_fn(x, y): > linear_model = W * x + b > return tf.reduce_sum(tf.square(linear_model - y)) > > # 优化器 > optimizer = tf.keras.optimizers.SGD(0.01) > > # 训练 > for i in range(1000): > with tf.GradientTape() as tape: > loss = loss_fn(x_train, y_train) > grads = tape.gradient(loss, [W, b]) > optimizer.apply_gradients(zip(grads, [W, b])) > > # 输出模型参数 > print("W = %s, b = %s" % (W.numpy(), b.numpy())) 从报错信息: W tensorflow/compiler/xla/stream_executor/gpu/asm_compiler.cc:109] Couldn't get ptxas version : FAILED_PRECONDITION: Couldn't get ptxas/nvlink version string: INTERNAL: Couldn't invoke ptxas --version 缺少ptxas,由cuda-nvcc-12-1提供, 执行dnf install cuda-nvcc-12-1 -y安装后成功执行。
最新镜像已解决。