报错信息: localhost:4286:4286 [0] NCCL INFO cudaDriverVersion 12010 localhost:4286:4286 [0] NCCL INFO Bootstrap : Using eth0:172.16.0.178<0> localhost:4286:4286 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory localhost:4286:4286 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation NCCL version 2.18.3+cuda12.1 localhost:4286:4761 [0] NCCL INFO NET/IB : No device found. localhost:4286:4761 [0] NCCL INFO NET/Socket : Using [0]eth0:172.16.0.178<0> localhost:4286:4761 [0] NCCL INFO Using network Socket localhost:4286:4762 [1] NCCL INFO Using network Socket localhost:4286:4762 [1] NCCL INFO comm 0x56503de1c960 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 90 commId 0xd6215a784943b62 - Init START localhost:4286:4761 [0] NCCL INFO comm 0x56503de19db0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 80 commId 0xd6215a784943b62 - Init START localhost:4286:4762 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 localhost:4286:4762 [1] NCCL INFO P2P Chunksize set to 131072 localhost:4286:4761 [0] NCCL INFO Channel 00/02 : 0 1 localhost:4286:4761 [0] NCCL INFO Channel 01/02 : 0 1 localhost:4286:4761 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 localhost:4286:4761 [0] NCCL INFO P2P Chunksize set to 131072 localhost:4286:4761 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct localhost:4286:4762 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct localhost:4286:4761 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct localhost:4286:4762 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct localhost:4286:4762 [1] NCCL INFO Connected all rings localhost:4286:4762 [1] NCCL INFO Connected all trees localhost:4286:4761 [0] NCCL INFO Connected all rings localhost:4286:4762 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 localhost:4286:4762 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer localhost:4286:4761 [0] NCCL INFO Connected all trees localhost:4286:4761 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 localhost:4286:4761 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer localhost:4286:4762 [1] NCCL INFO comm 0x56503de1c960 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 90 commId 0xd6215a784943b62 - Init COMPLETE localhost:4286:4761 [0] NCCL INFO comm 0x56503de19db0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 80 commId 0xd6215a784943b62 - Init COMPLETE localhost:4286:4286 [0] misc/strongstream.cc:60 NCCL WARN Cuda failure 'CUDA driver is a stub library' localhost:4286:4286 [0] NCCL INFO enqueue.cc:1548 -> 1 localhost:4286:4286 [0] NCCL INFO enqueue.cc:1589 -> 1 localhost:4286:4286 [0] NCCL INFO group.cc:96 -> 1 terminate called after throwing an instance of 'std::runtime_error' what(): NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
nccl构建开启了lto导致。 已在libnccl-2.18.3-2.cuda12.1.an23 版本修复。