When you submit a job with Slurm on Liger, you must specify:
- A partition which defines the type of compute nodes you wish to reserve.
- A QoS (Quality of Service) which calibrates your resource needs (number of nodes,execution time, ...)
There is 1 partition on Liger for Turing's ressources, in general for GPUs, so called
gpus. For reminder,
turing01 node has 48 CPUs cores shareable.
Slurm partition added on
PartitionName=gpus AllowGroups=ALL AllowAccounts=gpu-coquake,gpu-milcom,gpu-others AllowQos=ALL AllocNodes=ALL Default=NO DefaultTime=01:00:00 DisableRootJobs=NO GraceTime=0 Hidden=NO MaxNodes=4 MaxTime=1-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=turing01 Priority=1 RootOnly=NO ReqResv=NO Shared=YES:4 PreemptMode=OFF State=UP TotalCPUs=48 TotalNodes=1 SelectTypeParameters=N/A DefMemPerCPU=8192 MaxMemPerNode=368640
That means here we have on Turing01 :
- 8192 MB ram per core
- 12 cores per GPU
- a total of 368 GB ram
To request GPU nodes:
1 node with 1 core and 1 GPU card
1 node with 2 cores and 2 GPU cards
1 node with 3 cores and 3 GPU cards, specifically the type of Tesla V100 cards. Note that It is always best to request at least as many CPU cores are GPUs
The available GPU node configurations are shown here.
When you request GPUs, the system will set two environment variables - we strongly recommend you do not change or unset these variables:
To your application, it will look like you have GPU 0,1,.. (up to as many GPUs as you requested). So if for example, there are two jobs from different users: the first one requesting 1 GPU card, the second 3 GPU cards, and they happen landing on the same node gpu-08: