Testing available machines

The layout of that page needs some clarifications...

Test based on MonteCarlo integration for Pi estimate

Basic code allowing to understand basic principles of the use of machines.

Where to find code ?

https://gitlab.in2p3.fr/lpnhe/HPC/tree/master/gpu/pi-test

One core

With GCC

gcc -O0 -lm pi_onecore.c      # Default optimization
time ./a.out                  # typ. 38 seconds

gcc -O2 -lm pi_onecore.c      # Standard good optimization (-O3 is not better)
time ./a.out                  # typ. 31 seconds

With Intel (see lpnhe website to have access to)

icc -O0 pi_onecore.c          # No optimization
time ./a.out                  # typ. 38 seconds

icc -O2 pi_onecore.c          # Standard good optimization
time ./a.out                  # typ. 31 seconds

icc -O2 -parallel pi_onecore.c # Optimization and automatic parallelize code by compiler
time ./a.out                  # typ. 31 seconds : does not work !

icc -m64 -O3 -Wall -fPIC -ipo -msse4.2 pi_onecore.c
time ./a.out                  # typ. 30.7 seconds

icc -m64 -O3 -Wall -fPIC -ipo -xavx pi_onecore.c # Works on lpnp110 not on lpnws5232 
time ./a.out                  # typ. 29 seconds

With python

python pi_onecore.py          # typ. 976 seconds (yes, python is slow!)

With Numba optimization (automatic generation of optimized machine code using LLVM)

python pi_onecore_numba.py  # typ. 35.4 seconds

OpenMP

With GCC

gcc -O2 -lm -fopenmp pi_omp.c # OpenMP implementation
export OMP_NUM_THREADS=4      # nb of cpu cores on the gpu machine, set to 16 on p110
time ./a.out                  # typ. 7.7 seconds

With Intel

icc -m64 -O3 -Wall -fPIC -ipo -msse4.2 -restrict -fargument-noalias-global -qopenmp pi_omp.c 
export KMP_AFFINITY=physical,0
export OMP_NUM_THREADS=4    
time ./a.out                  # typ. 9.3 seconds  ( !!!! )

On the Phi machine (host machine)

gcc -O3 -lm -fopenmp pi_omp.c 
export KMP_AFFINITY=physical,0
export OMP_NUM_THREADS=16
time ./a.out                  # typ. 1.298 seconds BEST RESULT EVER on a local LPNHE machine

On the Phi (Phi device)

# c.f. compilation and execution on phi, README.me in the phi folder of the HPC project
export KMP_AFFINITY=physical,0
export OMP_NUM_THREADS=239 
time ./a.out                  # typ. 3.96 seconds (no effort to be done with the code, works out of the box)

MPI

With GCC

mpicc -O2 pi_mpi.c               # MPI for gcc (see mpicc -showme for args)
time  mpirun -np 4 ./a.out       # typ. 9.65 seconds

With Intel (to add options, see here)

export OMPI_CC=icc
mpicc -m64 -O3 -Wall -fPIC -msse4.2 -restrict -fargument-noalias-global pi_mpi.cc
time mpirun -np 4 ./a.out        # typ. 9.95 seconds

GPU Implementation

Version 1 : 1 seul device (i.e. 1 seule carte GPU), reduction sur l'hôte (le cpu).
- C++ version

. /usr/local/bin/cuda-setup.sh
nvcc -O3 pi_gpu.cu               # One may prefer the use of Makefile from CUDA examples (very very slightly faster)
time ./a.out                     # typ. 6.32 seconds using 1D grid, 1 GPU card K2200 (over 2 avail), asking for 10 000 blocks

- Python version

python pi_gpu_cuda.py            # typ. 0.15 seconds using 1D grid, 1 GPU crad K2200 (over 2 avail.), asking for 1024 blocks, but giving wrong result...

Version n°2 : only one device, block reduction, many threads by block, then final reduction on the host machine.

. /usr/local/bin/cuda-setup.sh
nvcc -O3 pi_gpu_v2.cu
time ./a.out                     # typ. 0.231 seconds : 1D block grid (N=32), 1024 thread by block. Time mainly spent in init now !

Version n°3 : only one device but using thrust

nvcc -O3 -Xcompiler "-O3" -gencode arch=compute_50,code=sm_50 -o pi_gpu_thrust.exe pi_gpu_thrust.cu --ptxas-options -v 
export CUDA_VISIBLE_DEVICES=1    # Use of GPU # 1 , just to change...
time ./pi_gpu_thrust.cu          # Now ERR=1e-5 (should be 100 times longer...)
                                 # typ. 5.5 seconds

Version n°4 : all GPU devices (reducing step by hand)

nvcc -O3 -Xcompiler "-O3" -gencode arch=compute_50,code=sm_50 -o pi_gpu_multiGPU.exe pi_gpu_multiGPU.cu --ptxas-options -v
export CUDA_VISIBLE_DEVICES=0,1
time  pi_gpu_multiGPU.exe          # typ. 4.4 seconds, still with ERR=1e-5
nvvp ./pi_gpu_multiGPU.exe         # Nvidia profiling UI : one should check that cuda calls are async and cuda devices are runnning in // (good !)

Comments

Please register or sign in to add a comment.

Admin message

TestHpc

Testing available machines

Test based on MonteCarlo integration for Pi estimate

Where to find code ?

One core

OpenMP

MPI

GPU Implementation

Comments

Where to find code ?

One core

OpenMP

MPI

GPU Implementation