The Slurm options
--mem-per-gpu do not currently allow you to suitably configure the memory allocation of your job on Turing. The memory allocation is automatically determined by the number of reserved CPUs.
To adjust the amount of memory allocated to your job, you must adjust the number of CPUs reserved per task (or GPU) by specifying the following option in your batch scripts, or when using salloc in interactive mode:
--cpus-per-task=... # --cpus-per-task=1 by default
--cpus-per-task=1 is by default. If you do not modify its value, as explained below, you will not have access to as much memory per GPU as you could have and this could rapidly result in memory overflow.
On the default gpu partition
On Turing node by default gpu partition offers 384 GB of usable memory. The memory allocation is automatically computed on the basis of:
- 8 GB per reserved CPU core if hyperthreading is deactivated (Slurm option
The default gpu partition is composed of 4 GPUs and 48 CPU cores: you can reserve for instance 1/4 of the node memory per GPU by reserving 12 CPU cores (i.e. 1/4 of 48 CPU cores) per GPU:
--cpus-per-task=12 # reserves 1/4 of the node memory per GPU (default gpu partition)
In this way, you have access to 96 GB of memory per GPU if hyperthreading is deactivated (if not, half of that memory).
You can ask for more memory per GPU by increasing the value of
--cpus-per-task as long as it does not exceed the total amount of memory available on Turing (here 384 GB). Be careful, the computing hours are counted proportionately. For example, if you ask for 1 GPU on the default gpu partition by specifying
--ntasks=1 --gres=gpu:1 --cpus-per-task=24, the invoice will be the same as for a job running on 2 GPUs due to
If you reserve a node in exclusive mode, you have access to the entire memory capacity of the node, regardless of the value of
--cpus-per-task. The invoice will be the same as for a job running on an entire node.
The amount of memory allocated to your job can be seen by running the command:
$ scontrol show job $JOBID # searches for value of the "mem" variable
Important: While the job is in the wait queue (PENDING), Slurm estimates the memory allocated to a job based on logical cores. Therefore, if you have reserved physical cores (with --hint=nomultithread), the value indicated can be two times inferior to the expected value. This is updated and becomes correct when the job is started.
To reserve resources on the prepost partition, you may refer to: Memory allocation with Slurm on CPU partitions. The GPU which is available on each node of the prepost partition is automatically allocated to you without needing to specify the