liger-ai-tools issueshttps://gitlab.in2p3.fr/ecn-collaborations/liger-ai-tools/-/issues2021-03-09T17:16:26+01:00https://gitlab.in2p3.fr/ecn-collaborations/liger-ai-tools/-/issues/40Benchmark container + GPU/CPU on different cards2021-03-09T17:16:26+01:00Davide RovelliBenchmark container + GPU/CPU on different cardsTo give an estimate on our server capabilities with the possible inclusion of viz nodes as GPU AI resources, we can test our different cards to provide information on resources available to our users. I tested MNIST performance as follow...To give an estimate on our server capabilities with the possible inclusion of viz nodes as GPU AI resources, we can test our different cards to provide information on resources available to our users. I tested MNIST performance as follows:
- TensorFlow NGC container vs dockerhub container
- K80 vs V100 GPU
- K80 vs V100 multi CPU
We can add the same tests with a "more expensive" benchmark. This would do :arrow_right: https://github.com/stanford-futuredata/dawn-bench-entriesDavide RovelliDavide Rovellihttps://gitlab.in2p3.fr/ecn-collaborations/liger-ai-tools/-/issues/39Enable Singularity build on Liger2021-03-15T10:52:07+01:00Davide RovelliEnable Singularity build on LigerRight now it is not possible for users to build containers on Liger, as building requires root permissions.
It could be useful to exploit the `--fakeroot` option that allows unprivileged users to build container safely. Right now the a...Right now it is not possible for users to build containers on Liger, as building requires root permissions.
It could be useful to exploit the `--fakeroot` option that allows unprivileged users to build container safely. Right now the above option gives the following error:
```bash
FATAL: could not use fakeroot: no mapping entry found in /etc/subuid for drovelli
```
It would be good to configure ithttps://gitlab.in2p3.fr/ecn-collaborations/liger-ai-tools/-/issues/37Irregular storage fault2021-01-08T15:08:22+01:00Davide RovelliIrregular storage faultAs reported by @Mickael.Tardy, the storage seem to fail randomly throwing a "File not found" error and interrupting the NN training (unless handled):
- When this occurs, and we try keep reading the file, the file is found again in a rand...As reported by @Mickael.Tardy, the storage seem to fail randomly throwing a "File not found" error and interrupting the NN training (unless handled):
- When this occurs, and we try keep reading the file, the file is found again in a random time, always less than 2h.
- This error happens very randomly.
This could be tied to the folders bound to the containers with Singularity that, somehow, refresh or fail to attach randomly.
More investigation is needed.https://gitlab.in2p3.fr/ecn-collaborations/liger-ai-tools/-/issues/36check new reinstallation turing2020-12-15T18:10:31+01:00RANDRIATOAMANANA Richardrichard.randriatoamanana@cnrs.frcheck new reinstallation turingseems that slurm configuration on CentOS7 differs from CentOS6.
## path of investigation
- cgroupseems that slurm configuration on CentOS7 differs from CentOS6.
## path of investigation
- cgrouphttps://gitlab.in2p3.fr/ecn-collaborations/liger-ai-tools/-/issues/31local storage disk with quota2020-12-15T15:45:22+01:00RANDRIATOAMANANA Richardrichard.randriatoamanana@cnrs.frlocal storage disk with quota- configure a quota policy
/local-scratch- configure a quota policy
/local-scratchhttps://gitlab.in2p3.fr/ecn-collaborations/liger-ai-tools/-/issues/29mpi cuda-aware code2020-12-01T18:25:36+01:00RANDRIATOAMANANA Richardrichard.randriatoamanana@cnrs.frmpi cuda-aware codeA CUDA-aware MPI implementation must handle buffers differently depending on whether it resides in host or device memory. An MPI implementation could offer different APIs for host and device buffers, or it could add an additional argumen...A CUDA-aware MPI implementation must handle buffers differently depending on whether it resides in host or device memory. An MPI implementation could offer different APIs for host and device buffers, or it could add an additional argument indicating where the passed buffer lives. Fortunately, neither of these approaches is necessary because of the Unified Virtual Addressing (UVA) feature introduced in CUDA 4.0 (on Compute Capability 2.0 and later GPUs). With UVA the host memory and the memory of all GPUs in a system (a single node) are combined into one large (virtual) address space.
## Refs
- https://developer.nvidia.com/blog/introduction-cuda-aware-mpi
- http://www.idris.fr/eng/jean-zay/gpu/jean-zay-gpu-exec_multi_mpi_cuda_aware_gpudirect_batch-eng.htmlhttps://gitlab.in2p3.fr/ecn-collaborations/liger-ai-tools/-/issues/26Enable GPU in Container Runtime2020-12-07T09:54:05+01:00RANDRIATOAMANANA Richardrichard.randriatoamanana@cnrs.frEnable GPU in Container Runtime- https://nvidia.github.io/nvidia-container-runtime
- https://developer.nvidia.com/blog/gpu-containers-runtime- https://nvidia.github.io/nvidia-container-runtime
- https://developer.nvidia.com/blog/gpu-containers-runtimehttps://gitlab.in2p3.fr/ecn-collaborations/liger-ai-tools/-/issues/25Conflicts between ENV vars and container environment2021-01-20T17:10:22+01:00Davide RovelliConflicts between ENV vars and container environmentI found out that loading modules like python in the server makes compromises the environment inside the container:
- `module load python` makes python in the container crash
- After unistalling the cuda libraries the nvidia cards could b...I found out that loading modules like python in the server makes compromises the environment inside the container:
- `module load python` makes python in the container crash
- After unistalling the cuda libraries the nvidia cards could be seen from inside the container.
For now it is sufficient to load Singularity uniquely in the server environment. Perhaps a bit more investigation or a note in the **Getting Started* guide is required.https://gitlab.in2p3.fr/ecn-collaborations/liger-ai-tools/-/issues/23update bios firmware2020-12-07T09:43:38+01:00RANDRIATOAMANANA Richardrichard.randriatoamanana@cnrs.frupdate bios firmware- [ ] update to Dell EMC Server PowerEdge [BIOS C4140 Version 2.9.3](https://www.dell.com/support/home/fr-fr/product-support/servicetag/0-RHl2R3lwQ0xHV3ZYdWxRdWtVckViUT090/drivers) -> this will need **a hard reboot after**- [ ] update to Dell EMC Server PowerEdge [BIOS C4140 Version 2.9.3](https://www.dell.com/support/home/fr-fr/product-support/servicetag/0-RHl2R3lwQ0xHV3ZYdWxRdWtVckViUT090/drivers) -> this will need **a hard reboot after**Milestone 1Pierre-Emmanuel GuerinPierre-Emmanuel Guerin