Test des machines à notre disposition
Test sur la base d'une intégration MonteCarlo pour l'estimation de Pi
Écriture basique permettant de mettre le doigt sur les principes de base d'utilisation des machines.
Où trouver les codes ?
https://gitlab.in2p3.fr/lpnhe/HPC/tree/master/gpu/pi-test
Mono-coeur
- Avec GCC
gcc -O0 -lm pi_onecore.c # Default optimization
time ./a.out # typ. 38 seconds
gcc -O2 -lm pi_onecore.c # Standard good optimization (-O3 is not better)
time ./a.out # typ. 31 seconds
- Avec Intel (voir site lpnhe pour y avoir accès)
icc -O0 pi_onecore.c # No optimization
time ./a.out # typ. 38 seconds
icc -O2 pi_onecore.c # Standard good optimization
time ./a.out # typ. 31 seconds
icc -O2 -parallel pi_onecore.c # Optimization and automatic parallelize code by compiler
time ./a.out # typ. 31 seconds : does not work !
icc -m64 -O3 -Wall -fPIC -ipo -msse4.2 pi_onecore.c
time ./a.out # typ. 30.7 seconds
icc -m64 -O3 -Wall -fPIC -ipo -xavx pi_onecore.c # Works on lpnp110 not on lpnws5232
time ./a.out # typ. 29 seconds
- Avec python
python pi_onecore.py # typ. 976 seconds (yes, python is slow!)
Avec optimisation Numba (automatic generation of optimized machine code using LLVM)
python pi_onecore_numba.py # typ. 35.4 seconds
OpenMP
- Avec GCC
gcc -O2 -lm -fopenmp pi_omp.c # OpenMP implementation
export OMP_NUM_THREADS=4 # nb of cpu cores on the gpu machine, set to 16 on p110
time ./a.out # typ. 7.7 seconds
- Avec Intel
icc -m64 -O3 -Wall -fPIC -ipo -msse4.2 -restrict -fargument-noalias-global -qopenmp pi_omp.c
export KMP_AFFINITY=physical,0
export OMP_NUM_THREADS=4
time ./a.out # typ. 9.3 seconds ( !!!! )
- Sur la machine Phi (machine hôte)
gcc -O3 -lm -fopenmp pi_omp.c
export KMP_AFFINITY=physical,0
export OMP_NUM_THREADS=16
time ./a.out # typ. 1.298 seconds BEST RESULT EVER on a local LPNHE machine
- Sur la machine Phi (device Phi)
# c.f. compilation et execution des phi, README.me in the phi folder of the HPC project
export KMP_AFFINITY=physical,0
export OMP_NUM_THREADS=239
time ./a.out # typ. 3.96 seconds (no effort to be done with the code, works out of the box)
MPI
- Avec GCC
mpicc -O2 pi_mpi.c # MPI for gcc (see mpicc -showme for args)
time mpirun -np 4 ./a.out # typ. 9.65 seconds
- Avec Intel (pour placer des options, voir ici)
export OMPI_CC=icc
mpicc -m64 -O3 -Wall -fPIC -msse4.2 -restrict -fargument-noalias-global pi_mpi.cc
time mpirun -np 4 ./a.out # typ. 9.95 seconds
GPU Implementation
-
Version 1 : 1 seul device (i.e. 1 seule carte GPU), reduction sur l'hôte (le cpu).
- Version C++
. /usr/local/bin/cuda-setup.sh
nvcc -O3 pi_gpu.cu # One may prefer the use of Makefile from CUDA examples (very very slightly faster)
time ./a.out # typ. 6.32 seconds using 1D grid, 1 GPU card K2200 (over 2 avail), asking for 10 000 blocks
- Version python
python pi_gpu_cuda.py # typ. 0.15 seconds using 1D grid, 1 GPU crad K2200 (over 2 avail.), asking for 1024 blocks, but giving wrong result...
- Version 2 : 1 seul device, reduction par block, plusieurs threads par block, puis reduction finale sur l'hôte.
. /usr/local/bin/cuda-setup.sh
nvcc -O3 pi_gpu_v2.cu
time ./a.out # typ. 0.231 seconds : 1D block grid (N=32), 1024 thread by block. Time mainly spent in init now !
- Version 3 : 1 seul device mais on utilise thrust
nvcc -O3 -Xcompiler "-O3" -gencode arch=compute_50,code=sm_50 -o pi_gpu_thrust.exe pi_gpu_thrust.cu --ptxas-options -v
export CUDA_VISIBLE_DEVICES=1 # Use of GPU # 1 , just to change...
time ./pi_gpu_thrust.cu # Now ERR=1e-5 (should be 100 times longer...)
# typ. 5.5 seconds
- Version 4 : tous les GPU devices (reduction à la main)
nvcc -O3 -Xcompiler "-O3" -gencode arch=compute_50,code=sm_50 -o pi_gpu_multiGPU.exe pi_gpu_multiGPU.cu --ptxas-options -v
export CUDA_VISIBLE_DEVICES=0,1
time pi_gpu_multiGPU.exe # typ. 4.4 seconds, still with ERR=1e-5
nvvp ./pi_gpu_multiGPU.exe # Nvidia profiling UI : one should check that cuda calls are async and cuda devices are runnning in // (good !)