Quick start guide
Pre-requisites
- Have an user account on Liger
- General Linux shell commands knowledge is recommended
- Familiarity with slurm is recommended
MNIST: handwritten digits classification algorithm
This repository contains sample Python code that implements a classifier for the MNIST dataset: a classical AI - computer vision task that consists in recognising handwritten digits. The MNIST dataset is relatively simple and is therefore often used as benchmark to test deep learning models and environments.
The programs implementing the MNIST task use TensorFlow and Keras for neural network operations and PyLab (Numpy, Matplotlib) for data manipulation and visualisation. The relevant files can be found in the dl-examples directory :
- mnist.npz: MNIST dataset containing 70000 28x28 pixel images of handwirtten digits.
- mnist_train.py: data loading, processing + model creation and training.
- mnist_predict.py: use the generated model to predict the digits on unseen data.
These programs and the related processes can be used as a reference to implement your DL algorithms in the Liger environment.
The Tensorflow + PyLab environment is provided by a pre-built container already present in Liger at /softs/singularity/containers/ai/tensorflow-ngc-plot.simg
.
Training the classifier
SSH into Liger with visualisation enabled:
localhost:~$ ssh -X <LIGER-UID>@liger
Clone this repository using your credentials:
login02:~$ git clone https://<GIT-USERNAME>@gitlab.in2p3.fr/ecn-collaborations/gpu-ai-liger.git
Password:
Move into the repository:
login02:~$ cd gpu-ai-liger
Run the training via the submission script specifying your account:
login02:~$ sbatch --account=<project-id> --qos=qos_gpu exec.sl
If turing01 has any available GPU, the job will be submitted to the node. TensorFlow binds to one of the GPUs that will perform the model training on the MNIST dataset. The model will be saved in the dl-examples folder at the end of the training for later use.
Follow the script execution with the following command:
login02:~$ tail -f sjob.txt
Or you can run the Nvidia monitoring tool, shortly after the job submission, too see the GPU utilisation:
login02:~$ srun -p gpus -w turing01 nvidia-smi -l
##output
Tue Dec 1 16:26:24 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:18:00.0 Off | 0 |
| N/A 38C P0 56W / 300W | 354MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:3B:00.0 Off | 0 |
| N/A 34C P0 52W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... Off | 00000000:86:00.0 Off | 0 |
| N/A 35C P0 55W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... Off | 00000000:AF:00.0 Off | 0 |
| N/A 38C P0 56W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 53079 C python3 351MiB |
+-----------------------------------------------------------------------------+
Making predictions
Once the classifier is trained, the mnist_predict.py program can use the final output model to make a guess on a new handwirtten digit (a sample from the test set not used for the training).
This time the program will be executed through an interactive session, as opposed the sbatch job submission that was used for the training. The following steps will go through he execution of commands through an IPython shell session in the TensorFlow container available in Liger.
Reserve the GPU server (turing01):
login02:~$ salloc -p gpus -w turing01 --account=<project-id>
SSH into the GPU server with visualisation enabled:
localhost:~$ ssh -X turing01
Move into the repository:
login02:~$ cd gpu-ai-liger
Start the TensorFlow plot-enanbled container with Singularity.:
login02:~$ module load singularity
login02:~$ singularity run --nv -B ./:/app /softs/singularity/containers/ai/tensorflow-ngc-plot.simg
This is a relevant command to start a shell session in any kind of container. It's important to keep in mind that when inside a container, all the tools that were built into the container at its creation (programs, data, etc.) are available even if not present in the host environment. In this case, the container was pre-built by us but any container can be further customised by the user to suit their needs.
Furthermore, files and folders can be copied into the container at runtime with the -B
Singularity option. Use singularity help
for more information.
Move into the examples folder:
Singularity> cd dl-examples
Start an IPython session:
Singularity> ipython3
#output
Python 3.6.9 (default, Jul 17 2020, 12:50:27)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.16.1 -- An enhanced Interactive Python. Type '?' for help.
In [1]:
From this environment you will be able to run Python code interactively, run the following to make prediction exploiting the previously created model:
In [1]: run mnist_predict.py
After a few seconds, you should see the original picture and the corresponding prediction of the algorithm (in the title!). You can re-run the program with different sample numbers to check if the program manages to guess different digits.
Make sure to log out (Ctrl-D) from the container, turing01 and clean up the node allocation with scancel
Troubleshooting
Refer to the Troubleshooting page.