Why does the distributed training get stuck here and doesn't move. #15266
-
|
I have requested two GPUs on slurm cluster for distributed training, but the program does not move? When I use only one GPU, the model is trained normally. |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 5 replies
-
|
I guess it deadlocked when creating the missing folder? |
Beta Was this translation helpful? Give feedback.
-
|
When I switch the communication method from ‘nccl’ to 'gloo', it works. I don't know what the problem is, but I hope I can help you. |
Beta Was this translation helpful? Give feedback.
-
|
I am having the same issue, did you find what what was the problem? |
Beta Was this translation helpful? Give feedback.
-
|
Similar issues, but a little bit different. My training could start successfully no matter using Start distribution successfullyGot stuck |
Beta Was this translation helpful? Give feedback.


When I switch the communication method from ‘nccl’ to 'gloo', it works. I don't know what the problem is, but I hope I can help you.