Testing available machines
The layout of that page needs some clarifications...
Test based on MonteCarlo integration for Pi estimate
Basic code allowing to understand basic principles of the use of machines.
Where to find code ?
https://gitlab.in2p3.fr/lpnhe/HPC/tree/master/gpu/pi-test
One core
- With GCC
gcc -O0 -lm pi_onecore.c # Default optimization
time ./a.out # typ. 38 seconds
gcc -O2 -lm pi_onecore.c # Standard good optimization (-O3 is not better)
time ./a.out # typ. 31 seconds
- With Intel (see lpnhe website to have access to)
icc -O0 pi_onecore.c # No optimization
time ./a.out # typ. 38 seconds
icc -O2 pi_onecore.c # Standard good optimization
time ./a.out # typ. 31 seconds
icc -O2 -parallel pi_onecore.c # Optimization and automatic parallelize code by compiler
time ./a.out # typ. 31 seconds : does not work !
icc -m64 -O3 -Wall -fPIC -ipo -msse4.2 pi_onecore.c
time ./a.out # typ. 30.7 seconds
icc -m64 -O3 -Wall -fPIC -ipo -xavx pi_onecore.c # Works on lpnp110 not on lpnws5232
time ./a.out # typ. 29 seconds
- With python
python pi_onecore.py # typ. 976 seconds (yes, python is slow!)
With Numba optimization (automatic generation of optimized machine code using LLVM)
python pi_onecore_numba.py # typ. 35.4 seconds
OpenMP
- With GCC
gcc -O2 -lm -fopenmp pi_omp.c # OpenMP implementation
export OMP_NUM_THREADS=4 # nb of cpu cores on the gpu machine, set to 16 on p110
time ./a.out # typ. 7.7 seconds
- With Intel
icc -m64 -O3 -Wall -fPIC -ipo -msse4.2 -restrict -fargument-noalias-global -qopenmp pi_omp.c
export KMP_AFFINITY=physical,0
export OMP_NUM_THREADS=4
time ./a.out # typ. 9.3 seconds ( !!!! )
- On the Phi machine (host machine)
gcc -O3 -lm -fopenmp pi_omp.c
export KMP_AFFINITY=physical,0
export OMP_NUM_THREADS=16
time ./a.out # typ. 1.298 seconds BEST RESULT EVER on a local LPNHE machine
- On the Phi (Phi device)
# c.f. compilation and execution on phi, README.me in the phi folder of the HPC project
export KMP_AFFINITY=physical,0
export OMP_NUM_THREADS=239
time ./a.out # typ. 3.96 seconds (no effort to be done with the code, works out of the box)
MPI
- With GCC
mpicc -O2 pi_mpi.c # MPI for gcc (see mpicc -showme for args)
time mpirun -np 4 ./a.out # typ. 9.65 seconds
- With Intel (to add options, see here)
export OMPI_CC=icc
mpicc -m64 -O3 -Wall -fPIC -msse4.2 -restrict -fargument-noalias-global pi_mpi.cc
time mpirun -np 4 ./a.out # typ. 9.95 seconds
GPU Implementation
-
Version 1 : 1 seul device (i.e. 1 seule carte GPU), reduction sur l'hôte (le cpu).
- C++ version
. /usr/local/bin/cuda-setup.sh
nvcc -O3 pi_gpu.cu # One may prefer the use of Makefile from CUDA examples (very very slightly faster)
time ./a.out # typ. 6.32 seconds using 1D grid, 1 GPU card K2200 (over 2 avail), asking for 10 000 blocks
- Python version
python pi_gpu_cuda.py # typ. 0.15 seconds using 1D grid, 1 GPU crad K2200 (over 2 avail.), asking for 1024 blocks, but giving wrong result...
- Version n°2 : only one device, block reduction, many threads by block, then final reduction on the host machine.
. /usr/local/bin/cuda-setup.sh
nvcc -O3 pi_gpu_v2.cu
time ./a.out # typ. 0.231 seconds : 1D block grid (N=32), 1024 thread by block. Time mainly spent in init now !
- Version n°3 : only one device but using thrust
nvcc -O3 -Xcompiler "-O3" -gencode arch=compute_50,code=sm_50 -o pi_gpu_thrust.exe pi_gpu_thrust.cu --ptxas-options -v
export CUDA_VISIBLE_DEVICES=1 # Use of GPU # 1 , just to change...
time ./pi_gpu_thrust.cu # Now ERR=1e-5 (should be 100 times longer...)
# typ. 5.5 seconds
- Version n°4 : all GPU devices (reducing step by hand)
nvcc -O3 -Xcompiler "-O3" -gencode arch=compute_50,code=sm_50 -o pi_gpu_multiGPU.exe pi_gpu_multiGPU.cu --ptxas-options -v
export CUDA_VISIBLE_DEVICES=0,1
time pi_gpu_multiGPU.exe # typ. 4.4 seconds, still with ERR=1e-5
nvvp ./pi_gpu_multiGPU.exe # Nvidia profiling UI : one should check that cuda calls are async and cuda devices are runnning in // (good !)