Irregular storage fault
As reported by @Mickael.Tardy, the storage seem to fail randomly throwing a "File not found" error and interrupting the NN training (unless handled):
- When this occurs, and we try keep reading the file, the file is found again in a random time, always less than 2h.
- This error happens very randomly.
This could be tied to the folders bound to the containers with Singularity that, somehow, refresh or fail to attach randomly.
More investigation is needed.