Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
G
GPU AI On LIGER
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 15
    • Issues 15
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Operations
    • Operations
    • Metrics
    • Incidents
  • Packages & Registries
    • Packages & Registries
    • Package Registry
    • Container Registry
  • Analytics
    • Analytics
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Members
    • Members
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
Collapse sidebar

Gitlab is now running v13.9.2 - More info -> here <-

  • ECN Site Collaborations
  • GPU AI On LIGER
  • Wiki
  • Interactive Submission

Last edited by RANDRIATOAMANANA Richard Dec 01, 2020
Page history

Interactive Submission

  • Connection on the front end
  • Obtaining a terminal on a GPU compute node
  • Interactive execution on the GPU partition
  • Reserving reusable resources for more than one interactive execution

Connection on the front end

Access to the front end is done via an ssh connection:

$ ssh login@liger.ec-nantes.fr

The resources of this interactive node are shared between all the connected users:

For a reminder the interactive on the front end is reserved exclusively for compilation and script development. The Liger front end nodes are not equipped with GPU. Therefore, these nodes cannot be used for executions requiring one or more GPU.

To effectuate interactive executions of your GPU codes on compute nodes with GPUs like turing01 and all viz nodes viz[01-14], you must use one of the following two commands:

  • The srun command:
    • to obtain a terminal on a GPU compute node within which you can execute your code,
    • or to directly execute your code on the GPU partition.
  • The salloc command to reserve GPU resources which allows you to do more than one execution consecutively.

However, if the computations require a large amount of GPU resources (in number of cores, memory, or elapsed time), it is necessary to submit a batch job.

Obtaining a terminal on a GPU compute node

It is possible to open a terminal directly on an accelerated compute node on which the resources have been reserved for you (here, 1 GPU on the default gpu partition) by using the following command:

$ srun --pty --ntasks=1 --cpus-per-task=12 --gres=gpu:1 --hint=nomultithread [--other-options] bash

Comments

  • An interactive terminal is obtained with the --pty option.
  • The reservation of physical cores is assured with the --hint=nomultithread option (no hyperthreading).
  • The memory allocated for the job is proportional to the number of requested CPU cores . For example, if you request 1/4 of the cores of a node, you will have access to 1/4 of its memory. On the default gpu partition on Turing01, the --cpus-per-task=12 option allows reserving 1/4 of the node memory per GPU. You may consult our documentation on this subject: Memory allocation on GPU partitions
  • --other-options contains the usual Slurm options for job configuration (--time=, etc.): See the documentation on batch submission scripts
  • For multi-project users and those having both CPU and GPU hours, it is necessary to specify on which project account to count the computing hours of the job.
  • We strongly recommend that you consult our documentation detailing computing hours management on Liger to ensure that the hours consumed by your jobs are deducted from the correct allocation.

The terminal is operational after the resources have been granted:

[randria@login02 ~]$ srun --pty -p gpus --ntasks=1 --cpus-per-task=12 --gres=gpu:1 --hint=nomultithread bash
[randria@turing01 ~]$ hostname
turing01
[randria@turing01 ~]$ printenv | grep CUDA
CUDA_VISIBLE_DEVICES=0  <-- GPU 0
[randria@turing01 ~]$ nvidia-smi -L
GPU 0: Tesla V100-SXM2-32GB (UUID: GPU-5a80af23-787c-cbcb-92de-c80574883c5d) <-- allocated
GPU 1: Tesla V100-SXM2-32GB (UUID: GPU-233f07d9-5e4c-9309-bf20-3ae74f0495b4)
GPU 2: Tesla V100-SXM2-32GB (UUID: GPU-a1a1cbc1-8747-d8cd-9028-3e2db40deb04)
GPU 3: Tesla V100-SXM2-32GB (UUID: GPU-8d5f775d-70d9-62b2-b46c-97d30eea732f)
[randria@turing01 ~]$ squeue -j $SLURM_JOB_ID
    JOBID PARTITION         USER ST         TIME NODES  CPUS QOS          PRIORITY NODELIST(REASON)     NAME
  1730514      gpus      randria  R         3:03     1    12 normal         396309 turing01             bash
[randria@turing01 ~]$ scontrol show job $SLURM_JOB_ID
JobId=1730514 JobName=bash
   Priority=396309 Nice=0 Account=ici QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   RunTime=00:03:33 TimeLimit=01:00:00 TimeMin=N/A
   Partition=gpus AllocNode:Sid=login02:22331
   NodeList=turing01
   BatchHost=turing01
   NumNodes=1 NumCPUs=12 CPUs/Task=12 ReqB:S:C:T=0:0:*:1
   MinCPUsNode=12 MinMemoryCPU=8G MinTmpDiskNode=0
   Command=bash

Comments

  • CUDA_VISIBLE_DEVICES=0 means here we have allocated only 1 GPU, the GPU 0 (if we had 2 GPUs requested, it would be 0,1) . You can also use the variable GPU_DEVICE_ORDINAL
  • the scontrol show job results show JobState=RUNNING means that your session is active and running.

To leave the interactive mode, use exit command :

[randria@turing01 ~]$ exit

Caution: If you do not yourself leave the interactive mode, the maximum allocation duration (by default or specified with the --time option) is applied and this amount of hours is then counted for the project you have specified.

Interactive execution on the GPU partition

If you don't need to open a terminal on a compute node, it is also possible to start the interactive execution of a code on the compute nodes directly from the front end by using the following command (here, with 2 GPU on the default gpu partition) :

$ srun -p gpus --ntasks=2 --cpus-per-task=12 --gres=gpu:2 --hint=nomultithread [--other-options] ./my_executable_file

Reserving reusable resources for more than one interactive execution

Each interactive execution started as described in the preceding section is equivalent to a different job. As with all the jobs, they are susceptible to being placed in a wait queue for a certain length of time if the computing resources are not available.

If you wish to do more than one interactive execution in a row, it may be pertinent to reserve all the resources in advance so that they can be reused for the consecutive executions. You should wait until all the resources are available at one time at the moment of the reservation and not reserve for each execution separately.

Reserving resources (here, for 2 GPU on the default gpu partition) is done via the following command:

The reservation becomes usable after the resources have been granted:

$ salloc -p gpus --ntasks=2 --cpus-per-task=12 --gres=gpu:2 --hint=nomultithread [--other-options]
salloc: Granted job allocation 1730516

You can verify that your reservation is active by using the squeue command. Complete information about the status of the job can be obtained by using the scontrol show job <job identifier> command.

You can then start the interactive executions by using the srun command:

$ srun [--other-options] ./code

Comments: if you do not specify any option for the srun command, the options for salloc (for example, the number of tasks) will be used by default.

Important

  • After reserving resources with salloc, you are still connected on the front end (you can verify this with the hostname command). It is imperative to use the srun command so that your executions use the reserved resources.
  • If you forget to cancel the reservation with scancel, the maximum allocation duration (by default or specified with the --time option) is applied and this amount of hours is then counted for the project you have specified. Therefore, in order to cancel the reservation, you must manually enter:
$ exit
exit
salloc: Relinquishing job allocation 1730516
Clone repository
  • Batch job commands
  • Control Your GPUs
  • Direct SSH connection on Turing node
  • Disk Spaces Policy
  • Environment Information
  • Getting started with Liger
  • Hours Accounting
  • How To Use Module Command
  • How to access to Turing
  • Interactive Submission
  • Liger For AI
  • Memory allocation on GPU partition
  • Policy of submission and accounting
  • Python
  • Quick start
View All Pages