Multi-GPU : training stuck at beginning

On MUST computing center (htcondor job management), in a multi-gpu context experiment preparation code is executed twice, and then the training phase is stuck at the beginning. Might be related to one these pytorch lightning issue:

Software settings:

Pytorch lightning version: 1.6.3 / 1.6.5 / 1.7.2
strategy = 'ddp'

(glearn_dev_mik) bash-4.2$ gammalearn gammalearn-data/experiments/2022_08/mikael/R_1000_mae.py --logfile
[INFO] - load settings from gammalearn-data/experiments/2022_08/mikael/R_1000_mae.py
[INFO] - prepare folders
[INFO] - The experiment R_1000_mae already exists !
[INFO] - Experiment directory: /uds_data/glearn/Data/experiments/R_1000_mae/ 
[INFO] - gammalearn 0.10.dev44+g1709516
[INFO] - save configuration file
[INFO] - Tensorboard run directory: /uds_data/glearn/Data/experiments/runs/R_1000_mae 
[INFO] - Global seed set to 1978
[INFO] - Start creating datasets
[INFO] - look for data files
[INFO] - length of data file list : 3499
Load data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 3499/3499 [01:57<00:00, 29.76it/s]
[INFO] - training set length : 1498885
[INFO] - validating set length : 374722
[INFO] - mp start method: fork
[INFO] - Global seed set to 1978
[INFO] - Save net definition file
[INFO] - network parameters number : 328501262
[INFO] - GPU available: True (cuda), used: True
[INFO] - TPU available: False, using: 0 TPU cores
[INFO] - IPU available: False, using: 0 IPUs
[INFO] - HPU available: False, using: 0 HPUs
[INFO] - Train model
[INFO] - training loader length : 5855 batches
[INFO] - validating loader length : 1463 batches
[INFO] - Global seed set to 1978
[INFO] - Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
[INFO] - load settings from gammalearn-data/experiments/2022_08/mikael/R_1000_mae.py
[INFO] - prepare folders
[INFO] - The experiment R_1000_mae already exists !
[INFO] - Experiment directory: /uds_data/glearn/Data/experiments/R_1000_mae/ 
[INFO] - gammalearn 0.10.dev44+g1709516
[INFO] - save configuration file
[INFO] - Tensorboard run directory: /uds_data/glearn/Data/experiments/runs/R_1000_mae 
[INFO] - Global seed set to 1978
[INFO] - Start creating datasets
[INFO] - look for data files
[INFO] - length of data file list : 3499
Load data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 3499/3499 [01:58<00:00, 29.54it/s]
[INFO] - training set length : 1498885
[INFO] - validating set length : 374722
[INFO] - mp start method: fork
[INFO] - Global seed set to 1978
[INFO] - Save net definition file
[INFO] - network parameters number : 328501262
[INFO] - Train model
[INFO] - training loader length : 5855 batches
[INFO] - validating loader length : 1463 batches
[INFO] - Global seed set to 1978
[INFO] - Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
[INFO] - ----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------

Edited Aug 09, 2022 by Mikael Jacquemont