|
|
# Test des machines à notre disposition #
|
|
|
*À mettre en page plus clairement !*
|
|
|
# Testing available machines
|
|
|
*The layout of that page needs some clarifications...*
|
|
|
|
|
|
## Test sur la base d'une intégration MonteCarlo pour l'estimation de Pi ##
|
|
|
Écriture basique permettant de mettre le doigt sur les principes de base d'utilisation des machines.
|
|
|
### Où trouver les codes ? ###
|
|
|
## Test based on MonteCarlo integration for Pi estimate
|
|
|
Basic code allowing to understand basic principles of the use of machines.
|
|
|
### Where to find code ?
|
|
|
https://gitlab.in2p3.fr/lpnhe/HPC/tree/master/gpu/pi-test
|
|
|
### Mono-coeur
|
|
|
### One core
|
|
|
|
|
|
* Avec GCC
|
|
|
* With GCC
|
|
|
```bash
|
|
|
gcc -O0 -lm pi_onecore.c # Default optimization
|
|
|
time ./a.out # typ. 38 seconds
|
... | ... | @@ -17,7 +17,7 @@ gcc -O2 -lm pi_onecore.c # Standard good optimization (-O3 is not better) |
|
|
time ./a.out # typ. 31 seconds
|
|
|
```
|
|
|
|
|
|
* Avec Intel (voir [site lpnhe](http://lpnhe.in2p3.fr/spip.php?article1116) pour y avoir accès)
|
|
|
* With Intel (see [lpnhe website](http://lpnhe.in2p3.fr/spip.php?article1116) to have access to)
|
|
|
```bash
|
|
|
icc -O0 pi_onecore.c # No optimization
|
|
|
time ./a.out # typ. 38 seconds
|
... | ... | @@ -39,42 +39,42 @@ icc -m64 -O3 -Wall -fPIC -ipo -xavx pi_onecore.c # Works on lpnp110 not on lpnws |
|
|
time ./a.out # typ. 29 seconds
|
|
|
```
|
|
|
|
|
|
* Avec python
|
|
|
* With python
|
|
|
```bash
|
|
|
python pi_onecore.py # typ. 976 seconds (yes, python is slow!)
|
|
|
```
|
|
|
|
|
|
Avec optimisation Numba (automatic generation of optimized machine code using LLVM)
|
|
|
With Numba optimization (automatic generation of optimized machine code using LLVM)
|
|
|
```bash
|
|
|
python pi_onecore_numba.py # typ. 35.4 seconds
|
|
|
```
|
|
|
|
|
|
### OpenMP
|
|
|
|
|
|
* Avec GCC
|
|
|
* With GCC
|
|
|
```bash
|
|
|
gcc -O2 -lm -fopenmp pi_omp.c # OpenMP implementation
|
|
|
export OMP_NUM_THREADS=4 # nb of cpu cores on the gpu machine, set to 16 on p110
|
|
|
time ./a.out # typ. 7.7 seconds
|
|
|
```
|
|
|
|
|
|
* Avec Intel
|
|
|
* With Intel
|
|
|
```bash
|
|
|
icc -m64 -O3 -Wall -fPIC -ipo -msse4.2 -restrict -fargument-noalias-global -qopenmp pi_omp.c
|
|
|
export KMP_AFFINITY=physical,0
|
|
|
export OMP_NUM_THREADS=4
|
|
|
time ./a.out # typ. 9.3 seconds ( !!!! )
|
|
|
```
|
|
|
* Sur la machine Phi (machine hôte)
|
|
|
* On the Phi machine (host machine)
|
|
|
```bash
|
|
|
gcc -O3 -lm -fopenmp pi_omp.c
|
|
|
export KMP_AFFINITY=physical,0
|
|
|
export OMP_NUM_THREADS=16
|
|
|
time ./a.out # typ. 1.298 seconds BEST RESULT EVER on a local LPNHE machine
|
|
|
```
|
|
|
* Sur la machine Phi (device Phi)
|
|
|
* On the Phi (Phi device)
|
|
|
```bash
|
|
|
# c.f. compilation et execution des phi, README.me in the phi folder of the HPC project
|
|
|
# c.f. compilation and execution on phi, README.me in the phi folder of the HPC project
|
|
|
export KMP_AFFINITY=physical,0
|
|
|
export OMP_NUM_THREADS=239
|
|
|
time ./a.out # typ. 3.96 seconds (no effort to be done with the code, works out of the box)
|
... | ... | @@ -82,14 +82,14 @@ time ./a.out # typ. 3.96 seconds (no effort to be done with the |
|
|
|
|
|
### MPI
|
|
|
|
|
|
* Avec GCC
|
|
|
* With GCC
|
|
|
|
|
|
```bash
|
|
|
mpicc -O2 pi_mpi.c # MPI for gcc (see mpicc -showme for args)
|
|
|
time mpirun -np 4 ./a.out # typ. 9.65 seconds
|
|
|
```
|
|
|
|
|
|
* Avec Intel (pour placer des options, voir [ici](https://www.open-mpi.org/faq/?category=mpi-apps#override-wrappers-after-v1.0))
|
|
|
* With Intel (to add options, see [here](https://www.open-mpi.org/faq/?category=mpi-apps#override-wrappers-after-v1.0))
|
|
|
|
|
|
```bash
|
|
|
export OMPI_CC=icc
|
... | ... | @@ -98,10 +98,9 @@ time mpirun -np 4 ./a.out # typ. 9.95 seconds |
|
|
```
|
|
|
|
|
|
### GPU Implementation
|
|
|
|
|
|
* Version 1 : 1 seul device (i.e. 1 seule carte GPU), reduction sur l'hôte (le cpu).
|
|
|
|
|
|
- Version C++
|
|
|
- C++ version
|
|
|
|
|
|
```bash
|
|
|
. /usr/local/bin/cuda-setup.sh
|
... | ... | @@ -109,13 +108,13 @@ nvcc -O3 pi_gpu.cu # One may prefer the use of Makefile from CUDA |
|
|
time ./a.out # typ. 6.32 seconds using 1D grid, 1 GPU card K2200 (over 2 avail), asking for 10 000 blocks
|
|
|
```
|
|
|
|
|
|
- Version python
|
|
|
- Python version
|
|
|
|
|
|
```python
|
|
|
python pi_gpu_cuda.py # typ. 0.15 seconds using 1D grid, 1 GPU crad K2200 (over 2 avail.), asking for 1024 blocks, but giving wrong result...
|
|
|
```
|
|
|
|
|
|
* Version 2 : 1 seul device, reduction par block, plusieurs threads par block, puis reduction finale sur l'hôte.
|
|
|
* Version n°2 : only one device, block reduction, many threads by block, then final reduction on the host machine.
|
|
|
|
|
|
```bash
|
|
|
. /usr/local/bin/cuda-setup.sh
|
... | ... | @@ -123,7 +122,7 @@ nvcc -O3 pi_gpu_v2.cu |
|
|
time ./a.out # typ. 0.231 seconds : 1D block grid (N=32), 1024 thread by block. Time mainly spent in init now !
|
|
|
```
|
|
|
|
|
|
* Version 3 : 1 seul device mais on utilise thrust
|
|
|
* Version n°3 : only one device but using thrust
|
|
|
|
|
|
```bash
|
|
|
nvcc -O3 -Xcompiler "-O3" -gencode arch=compute_50,code=sm_50 -o pi_gpu_thrust.exe pi_gpu_thrust.cu --ptxas-options -v
|
... | ... | @@ -132,7 +131,7 @@ time ./pi_gpu_thrust.cu # Now ERR=1e-5 (should be 100 times longer...) |
|
|
# typ. 5.5 seconds
|
|
|
```
|
|
|
|
|
|
* Version 4 : tous les GPU devices (reduction à la main)
|
|
|
* Version n°4 : all GPU devices (reducing step by hand)
|
|
|
|
|
|
```bash
|
|
|
nvcc -O3 -Xcompiler "-O3" -gencode arch=compute_50,code=sm_50 -o pi_gpu_multiGPU.exe pi_gpu_multiGPU.cu --ptxas-options -v
|
... | ... | |