- Connection on the front end
- Obtaining a terminal on a GPU compute node
- Interactive execution on the GPU partition
- Reserving reusable resources for more than one interactive execution
Connection on the front end
Access to the front end is done via an ssh connection:
$ ssh firstname.lastname@example.org
The resources of this interactive node are shared between all the connected users:
For a reminder the interactive on the front end is reserved exclusively for compilation and script development. The Liger front end nodes are not equipped with GPU. Therefore, these nodes cannot be used for executions requiring one or more GPU.
To effectuate interactive executions of your GPU codes on compute nodes with GPUs like
turing01 and all viz nodes
viz[01-14], you must use one of the following two commands:
salloccommand to reserve GPU resources which allows you to do more than one execution consecutively.
However, if the computations require a large amount of GPU resources (in number of cores, memory, or elapsed time), it is necessary to submit a batch job.
Obtaining a terminal on a GPU compute node
It is possible to open a terminal directly on an accelerated compute node on which the resources have been reserved for you (here, 1 GPU on the default gpu partition) by using the following command:
$ srun --pty --ntasks=1 --cpus-per-task=12 --gres=gpu:1 --hint=nomultithread [--other-options] bash
- An interactive terminal is obtained with the
- The reservation of physical cores is assured with the
--hint=nomultithread option(no hyperthreading).
- The memory allocated for the job is proportional to the number of requested CPU cores . For example, if you request 1/4 of the cores of a node, you will have access to 1/4 of its memory. On the default gpu partition on Turing01, the
--cpus-per-task=12option allows reserving 1/4 of the node memory per GPU. You may consult our documentation on this subject: Memory allocation on GPU partitions
--other-optionscontains the usual Slurm options for job configuration (--time=, etc.): See the documentation on batch submission scripts
- For multi-project users and those having both CPU and GPU hours, it is necessary to specify on which project account to count the computing hours of the job.
- We strongly recommend that you consult our documentation detailing computing hours management on Liger to ensure that the hours consumed by your jobs are deducted from the correct allocation.
The terminal is operational after the resources have been granted:
[randria@login02 ~]$ srun --pty -p gpus --ntasks=1 --cpus-per-task=12 --gres=gpu:1 --hint=nomultithread bash [randria@turing01 ~]$ hostname turing01 [randria@turing01 ~]$ printenv | grep CUDA CUDA_VISIBLE_DEVICES=0 <-- GPU 0 [randria@turing01 ~]$ nvidia-smi -L GPU 0: Tesla V100-SXM2-32GB (UUID: GPU-5a80af23-787c-cbcb-92de-c80574883c5d) <-- allocated GPU 1: Tesla V100-SXM2-32GB (UUID: GPU-233f07d9-5e4c-9309-bf20-3ae74f0495b4) GPU 2: Tesla V100-SXM2-32GB (UUID: GPU-a1a1cbc1-8747-d8cd-9028-3e2db40deb04) GPU 3: Tesla V100-SXM2-32GB (UUID: GPU-8d5f775d-70d9-62b2-b46c-97d30eea732f) [randria@turing01 ~]$ squeue -j $SLURM_JOB_ID JOBID PARTITION USER ST TIME NODES CPUS QOS PRIORITY NODELIST(REASON) NAME 1730514 gpus randria R 3:03 1 12 normal 396309 turing01 bash [randria@turing01 ~]$ scontrol show job $SLURM_JOB_ID JobId=1730514 JobName=bash Priority=396309 Nice=0 Account=ici QOS=normal JobState=RUNNING Reason=None Dependency=(null) RunTime=00:03:33 TimeLimit=01:00:00 TimeMin=N/A Partition=gpus AllocNode:Sid=login02:22331 NodeList=turing01 BatchHost=turing01 NumNodes=1 NumCPUs=12 CPUs/Task=12 ReqB:S:C:T=0:0:*:1 MinCPUsNode=12 MinMemoryCPU=8G MinTmpDiskNode=0 Command=bash
CUDA_VISIBLE_DEVICES=0means here we have allocated only 1 GPU, the
GPU 0(if we had 2 GPUs requested, it would be
0,1) . You can also use the variable
scontrol show jobresults show
JobState=RUNNINGmeans that your session is active and running.
To leave the interactive mode, use exit command :
[randria@turing01 ~]$ exit
Caution: If you do not yourself leave the interactive mode, the maximum allocation duration (by default or specified with the
--timeoption) is applied and this amount of hours is then counted for the project you have specified.
Interactive execution on the GPU partition
If you don't need to open a terminal on a compute node, it is also possible to start the interactive execution of a code on the compute nodes directly from the front end by using the following command (here, with 2 GPU on the default gpu partition) :
$ srun -p gpus --ntasks=2 --cpus-per-task=12 --gres=gpu:2 --hint=nomultithread [--other-options] ./my_executable_file
Reserving reusable resources for more than one interactive execution
Each interactive execution started as described in the preceding section is equivalent to a different job. As with all the jobs, they are susceptible to being placed in a wait queue for a certain length of time if the computing resources are not available.
If you wish to do more than one interactive execution in a row, it may be pertinent to reserve all the resources in advance so that they can be reused for the consecutive executions. You should wait until all the resources are available at one time at the moment of the reservation and not reserve for each execution separately.
Reserving resources (here, for 2 GPU on the default gpu partition) is done via the following command:
The reservation becomes usable after the resources have been granted:
$ salloc -p gpus --ntasks=2 --cpus-per-task=12 --gres=gpu:2 --hint=nomultithread [--other-options] salloc: Granted job allocation 1730516
You can verify that your reservation is active by using the
squeue command. Complete information about the status of the job can be obtained by using the scontrol show job
<job identifier> command.
You can then start the interactive executions by using the
$ srun [--other-options] ./code
Comments: if you do not specify any option for the srun command, the options for
salloc(for example, the number of tasks) will be used by default.
- After reserving resources with
salloc, you are still connected on the front end (you can verify this with the
hostnamecommand). It is imperative to use the
sruncommand so that your executions use the reserved resources.
- If you forget to cancel the reservation with
scancel, the maximum allocation duration (by default or specified with the
--timeoption) is applied and this amount of hours is then counted for the project you have specified. Therefore, in order to cancel the reservation, you must manually enter:
$ exit exit salloc: Relinquishing job allocation 1730516