Bug 6183 - yum 源提供的nccl 不可用,pytorch跟tensorflow 均无法运行localhost:4286:4286 [0] misc/strongstream.cc:60 NCCL WARN Cuda failure 'CUDA driver is a stub library' ,同时缺少 NCCL RDMA Sharp Plugin 的相关库文件 libnccl-net.so
Summary: yum 源提供的nccl 不可用,pytorch跟tensorflow 均无法运行localhost:4286:4286 [0] misc/strongs...
Status: RESOLVED FIXED
Alias: None
Product: Anolis OS 23
Classification: Anolis OS
Component: BaseOS Packages (show other bugs) BaseOS Packages
Version: 23.0
Hardware: x86_64 Linux
: P2-High S1-blocker
Target Milestone: ---
Assignee: xuchunmei
QA Contact: bolong_tbl
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-08-18 11:56 UTC by feitian200603
Modified: 2023-08-23 16:21 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description feitian200603 alibaba_cloud_group 2023-08-18 11:56:41 UTC
报错信息:
localhost:4286:4286 [0] NCCL INFO cudaDriverVersion 12010
localhost:4286:4286 [0] NCCL INFO Bootstrap : Using eth0:172.16.0.178<0>
localhost:4286:4286 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
localhost:4286:4286 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
NCCL version 2.18.3+cuda12.1
localhost:4286:4761 [0] NCCL INFO NET/IB : No device found.
localhost:4286:4761 [0] NCCL INFO NET/Socket : Using [0]eth0:172.16.0.178<0>
localhost:4286:4761 [0] NCCL INFO Using network Socket
localhost:4286:4762 [1] NCCL INFO Using network Socket
localhost:4286:4762 [1] NCCL INFO comm 0x56503de1c960 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 90 commId 0xd6215a784943b62 - Init START
localhost:4286:4761 [0] NCCL INFO comm 0x56503de19db0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 80 commId 0xd6215a784943b62 - Init START
localhost:4286:4762 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
localhost:4286:4762 [1] NCCL INFO P2P Chunksize set to 131072
localhost:4286:4761 [0] NCCL INFO Channel 00/02 :    0   1
localhost:4286:4761 [0] NCCL INFO Channel 01/02 :    0   1
localhost:4286:4761 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
localhost:4286:4761 [0] NCCL INFO P2P Chunksize set to 131072
localhost:4286:4761 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
localhost:4286:4762 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
localhost:4286:4761 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
localhost:4286:4762 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
localhost:4286:4762 [1] NCCL INFO Connected all rings
localhost:4286:4762 [1] NCCL INFO Connected all trees
localhost:4286:4761 [0] NCCL INFO Connected all rings
localhost:4286:4762 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
localhost:4286:4762 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
localhost:4286:4761 [0] NCCL INFO Connected all trees
localhost:4286:4761 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
localhost:4286:4761 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
localhost:4286:4762 [1] NCCL INFO comm 0x56503de1c960 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 90 commId 0xd6215a784943b62 - Init COMPLETE
localhost:4286:4761 [0] NCCL INFO comm 0x56503de19db0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 80 commId 0xd6215a784943b62 - Init COMPLETE

localhost:4286:4286 [0] misc/strongstream.cc:60 NCCL WARN Cuda failure 'CUDA driver is a stub library'
localhost:4286:4286 [0] NCCL INFO enqueue.cc:1548 -> 1
localhost:4286:4286 [0] NCCL INFO enqueue.cc:1589 -> 1
localhost:4286:4286 [0] NCCL INFO group.cc:96 -> 1
terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
Comment 1 xuchunmei alibaba_cloud_group 2023-08-23 16:21:37 UTC
nccl构建开启了lto导致。
已在libnccl-2.18.3-2.cuda12.1.an23 版本修复。