Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Register
  • Sign in
  • gammalearn gammalearn
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 26
    • Issues 26
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 9
    • Merge requests 9
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Container Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • GammaLearnGammaLearn
  • gammalearngammalearn
  • Issues
  • #94
Closed
Open
Issue created Aug 09, 2022 by Mikael Jacquemont@jacquemontMaintainer

Multi-GPU : training stuck at beginning

On MUST computing center (htcondor job management), in a multi-gpu context experiment preparation code is executed twice, and then the training phase is stuck at the beginning. Might be related to one these pytorch lightning issue:

  • https://forums.pytorchlightning.ai/t/ddp-training-stuck-while-gpu-utilization-is-100/1574
  • https://github.com/Lightning-AI/lightning/issues/11910
  • https://github.com/Lightning-AI/lightning/issues/5604#issuecomment-785314359

Software settings:

  • Pytorch lightning version: 1.6.3 / 1.6.5 / 1.7.2
  • strategy = 'ddp'
(glearn_dev_mik) bash-4.2$ gammalearn gammalearn-data/experiments/2022_08/mikael/R_1000_mae.py --logfile
[INFO] - load settings from gammalearn-data/experiments/2022_08/mikael/R_1000_mae.py
[INFO] - prepare folders
[INFO] - The experiment R_1000_mae already exists !
[INFO] - Experiment directory: /uds_data/glearn/Data/experiments/R_1000_mae/ 
[INFO] - gammalearn 0.10.dev44+g1709516
[INFO] - save configuration file
[INFO] - Tensorboard run directory: /uds_data/glearn/Data/experiments/runs/R_1000_mae 
[INFO] - Global seed set to 1978
[INFO] - Start creating datasets
[INFO] - look for data files
[INFO] - length of data file list : 3499
Load data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 3499/3499 [01:57<00:00, 29.76it/s]
[INFO] - training set length : 1498885
[INFO] - validating set length : 374722
[INFO] - mp start method: fork
[INFO] - Global seed set to 1978
[INFO] - Save net definition file
[INFO] - network parameters number : 328501262
[INFO] - GPU available: True (cuda), used: True
[INFO] - TPU available: False, using: 0 TPU cores
[INFO] - IPU available: False, using: 0 IPUs
[INFO] - HPU available: False, using: 0 HPUs
[INFO] - Train model
[INFO] - training loader length : 5855 batches
[INFO] - validating loader length : 1463 batches
[INFO] - Global seed set to 1978
[INFO] - Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
[INFO] - load settings from gammalearn-data/experiments/2022_08/mikael/R_1000_mae.py
[INFO] - prepare folders
[INFO] - The experiment R_1000_mae already exists !
[INFO] - Experiment directory: /uds_data/glearn/Data/experiments/R_1000_mae/ 
[INFO] - gammalearn 0.10.dev44+g1709516
[INFO] - save configuration file
[INFO] - Tensorboard run directory: /uds_data/glearn/Data/experiments/runs/R_1000_mae 
[INFO] - Global seed set to 1978
[INFO] - Start creating datasets
[INFO] - look for data files
[INFO] - length of data file list : 3499
Load data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 3499/3499 [01:58<00:00, 29.54it/s]
[INFO] - training set length : 1498885
[INFO] - validating set length : 374722
[INFO] - mp start method: fork
[INFO] - Global seed set to 1978
[INFO] - Save net definition file
[INFO] - network parameters number : 328501262
[INFO] - Train model
[INFO] - training loader length : 5855 batches
[INFO] - validating loader length : 1463 batches
[INFO] - Global seed set to 1978
[INFO] - Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
[INFO] - ----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------
Edited Aug 09, 2022 by Mikael Jacquemont
Assignee
Assign to
Time tracking