More than 2 multiple tasks fail
When running more than 2 tasks, the third fails. These issues were reported by users running container applications but I replicated them with simple tasks unrelated to GPUs. This seems like an issue in the slurm configuration.
TEST:
- run
ping localhost
in different shell tabs -> the third will fail with the following error:
[drovelli@login02 ~]$ srun -p gpus ping login01
slurmstepd: write(/dev/cpuset/slurm1732361/slurm1732361.0_0/cpuset.cpus): Invalid argument
slurmstepd: read(/dev/cpuset/slurm1732361/slurm1732361.0_0/tasks): Invalid argument
slurmstepd: Failed task affinity setup
srun: error: turing01: task 0: Exited with exit code 1
srun: Terminating job step 1732361.0
-
srun -p gpus -n $T hostname
where 2 < T < 46. Half of the tasks seem to systematically fail (tasks with an even number, perhaps unrelated):
[drovelli@login02 ~]$ srun -p gpus -n 41 hostname
slurmstepd: write(/dev/cpuset/slurm1732342/slurm1732342.0_4/cpuset.cpus): Invalid argument
slurmstepd: read(/dev/cpuset/slurm1732342/slurm1732342.0_4/tasks): Invalid argument
slurmstepd: Failed task affinity setup
...
srun: error: turing01: tasks 2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40: Exited with exit code 1
srun: Terminating job step 1732342.0
Edited by Davide Rovelli