|
|
# Test des machines à notre disposition #
|
|
|
## Test sur la base d'une intégration MonteCarlo pour l'estimation de Pi ##
|
|
|
Écriture basique permettant de mettre le doigt sur les principes de base d'utilisation des machines.
|
|
|
### Où trouver les codes ? ###
|
|
|
https://gitlab.in2p3.fr/lpnhe/HPC/tree/master/gpu/pi-test
|
|
|
### Mono-coeur
|
|
|
|
|
|
* Avec GCC
|
|
|
```bash
|
|
|
gcc -O0 -lm pi_onecore.c # Default optimization
|
|
|
time ./a.out # typ. 38 seconds
|
|
|
```
|
|
|
```bash
|
|
|
gcc -O2 -lm pi_onecore.c # Standard good optimization (-O3 is not better)
|
|
|
time ./a.out # typ. 31 seconds
|
|
|
```
|
|
|
|
|
|
* Avec Intel (voir [site lpnhe](http://lpnhe.in2p3.fr/spip.php?article1116) pour y avoir accès)
|
|
|
```bash
|
|
|
icc -O0 pi_onecore.c # No optimization
|
|
|
time ./a.out # typ. 38 seconds
|
|
|
```
|
|
|
```bash
|
|
|
icc -O2 pi_onecore.c # Standard good optimization
|
|
|
time ./a.out # typ. 31 seconds
|
|
|
```
|
|
|
```bash
|
|
|
icc -O2 -parallel pi_onecore.c # Optimization and automatic parallelize code by compiler
|
|
|
time ./a.out # typ. 31 seconds : does not work !
|
|
|
```
|
|
|
```bash
|
|
|
icc -m64 -O3 -Wall -fPIC -ipo -msse4.2 pi_onecore.c
|
|
|
time ./a.out # typ. 30.7 seconds
|
|
|
```
|
|
|
```bash
|
|
|
icc -m64 -O3 -Wall -fPIC -ipo -xavx pi_onecore.c # Works on lpnp110 not on lpnws5232
|
|
|
time ./a.out # typ. 29 seconds
|
|
|
```
|
|
|
|
|
|
* Avec python
|
|
|
```bash
|
|
|
python pi_onecore.py # typ. 976 seconds (yes, python is slow!)
|
|
|
```
|
|
|
|
|
|
Avec optimisation Numba (automatic generation of optimized machine code using LLVM)
|
|
|
```bash
|
|
|
python pi_onecore_numba.py # typ. 35.4 seconds
|
|
|
```
|
|
|
|
|
|
### OpenMP
|
|
|
|
|
|
* Avec GCC
|
|
|
```bash
|
|
|
gcc -O2 -lm -fopenmp pi_omp.c # OpenMP implementation
|
|
|
export OMP_NUM_THREADS=4 # nb of cpu cores on the gpu machine, set to 16 on p110
|
|
|
time ./a.out # typ. 7.7 seconds
|
|
|
```
|
|
|
|
|
|
* Avec Intel
|
|
|
```bash
|
|
|
icc -m64 -O3 -Wall -fPIC -ipo -msse4.2 -restrict -fargument-noalias-global -qopenmp pi_omp.c
|
|
|
export KMP_AFFINITY=physical,0
|
|
|
export OMP_NUM_THREADS=4
|
|
|
time ./a.out # typ. 9.3 seconds ( !!!! )
|
|
|
```
|
|
|
* Sur la machine Phi (machine hôte)
|
|
|
```bash
|
|
|
gcc -O3 -lm -fopenmp pi_omp.c
|
|
|
export KMP_AFFINITY=physical,0
|
|
|
export OMP_NUM_THREADS=16
|
|
|
time ./a.out # typ. 1.298 seconds BEST RESULT EVER on a local LPNHE machine
|
|
|
```
|
|
|
* Sur la machine Phi (device Phi)
|
|
|
```bash
|
|
|
# c.f. compilation et execution des phi, README.me in the phi folder of the HPC project
|
|
|
export KMP_AFFINITY=physical,0
|
|
|
export OMP_NUM_THREADS=239
|
|
|
time ./a.out # typ. 3.96 seconds (no effort to be done with the code, works out of the box)
|
|
|
```
|
|
|
|
|
|
### MPI
|
|
|
|
|
|
* Avec GCC
|
|
|
|
|
|
```bash
|
|
|
mpicc -O2 pi_mpi.c # MPI for gcc (see mpicc -showme for args)
|
|
|
time mpirun -np 4 ./a.out # typ. 9.65 seconds
|
|
|
```
|
|
|
|
|
|
* Avec Intel (pour placer des options, voir [ici](https://www.open-mpi.org/faq/?category=mpi-apps#override-wrappers-after-v1.0))
|
|
|
|
|
|
```bash
|
|
|
export OMPI_CC=icc
|
|
|
mpicc -m64 -O3 -Wall -fPIC -msse4.2 -restrict -fargument-noalias-global pi_mpi.cc
|
|
|
time mpirun -np 4 ./a.out # typ. 9.95 seconds
|
|
|
```
|
|
|
|
|
|
### GPU Implementation
|
|
|
|
|
|
* Version 1 : 1 seul device (i.e. 1 seule carte GPU), reduction sur l'hôte (le cpu).
|
|
|
|
|
|
- Version C++
|
|
|
|
|
|
```bash
|
|
|
. /usr/local/bin/cuda-setup.sh
|
|
|
nvcc -O3 pi_gpu.cu # One may prefer the use of Makefile from CUDA examples (very very slightly faster)
|
|
|
time ./a.out # typ. 6.32 seconds using 1D grid, 1 GPU card K2200 (over 2 avail), asking for 10 000 blocks
|
|
|
```
|
|
|
|
|
|
- Version python
|
|
|
|
|
|
```python
|
|
|
python pi_gpu_cuda.py # typ. 0.15 seconds using 1D grid, 1 GPU crad K2200 (over 2 avail.), asking for 1024 blocks, but giving wrong result...
|
|
|
```
|
|
|
|
|
|
* Version 2 : 1 seul device, reduction par block, plusieurs threads par block, puis reduction finale sur l'hôte.
|
|
|
|
|
|
```bash
|
|
|
. /usr/local/bin/cuda-setup.sh
|
|
|
nvcc -O3 pi_gpu_v2.cu
|
|
|
time ./a.out # typ. 0.231 seconds : 1D block grid (N=32), 1024 thread by block. Time mainly spent in init now !
|
|
|
```
|
|
|
|
|
|
* Version 3 : 1 seul device mais on utilise thrust
|
|
|
|
|
|
```bash
|
|
|
nvcc -O3 -Xcompiler "-O3" -gencode arch=compute_50,code=sm_50 -o pi_gpu_thrust.exe pi_gpu_thrust.cu --ptxas-options -v
|
|
|
export CUDA_VISIBLE_DEVICES=1 # Use of GPU # 1 , just to change...
|
|
|
time ./pi_gpu_thrust.cu # Now ERR=1e-5 (should be 100 times longer...)
|
|
|
# typ. 5.5 seconds
|
|
|
```
|
|
|
|
|
|
* Version 4 : tous les GPU devices (reduction à la main)
|
|
|
|
|
|
```bash
|
|
|
nvcc -O3 -Xcompiler "-O3" -gencode arch=compute_50,code=sm_50 -o pi_gpu_multiGPU.exe pi_gpu_multiGPU.cu --ptxas-options -v
|
|
|
export CUDA_VISIBLE_DEVICES=0,1
|
|
|
time pi_gpu_multiGPU.exe # typ. 4.4 seconds, still with ERR=1e-5
|
|
|
nvvp ./pi_gpu_multiGPU.exe # Nvidia profiling UI : one should check that cuda calls are async and cuda devices are runnning in // (good !)
|
|
|
```
|
|
|
|