NCCL WARN Failed to open libibverbs.so[.1] #12219
-
|
Just received qty 2 of A6000 and these are not compatible So upgraded my docker to I also made changed to my code the for the lightning braking change from to When I try to train it just stops. So set env NCCL_DEBUG=WARN Same happens when I try My old setup was 2xRTX Titan with nvlink while the new setup is 2xA6000 without a nvlink. nvidia doc says that PCI is used but unclear if I need to do something to use this. Distributed communication docs say "NCCL backends are built and included in PyTorch distributed (NCCL only when building with CUDA)" . I suspect I am missing something about the breaking changes from pl 1.0 to 1.5. Would appreciate hints as to what to look for. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
|
I reduced the delta to only staying on |
Beta Was this translation helpful? Give feedback.
-
|
Duplicate of #12235. |
Beta Was this translation helpful? Give feedback.
Duplicate of #12235.