This is an old version of this page.

Go to most recent version Browse history

Test des machines à notre disposition

Test sur la base d'une intégration MonteCarlo pour l'estimation de Pi

Écriture basique permettant de mettre le doigt sur les principes de base d'utilisation des machines.

Où trouver les codes ?

https://gitlab.in2p3.fr/lpnhe/HPC/tree/master/gpu/pi-test

Mono-coeur

Avec GCC

gcc -O0 -lm pi_onecore.c      # Default optimization
time ./a.out                  # typ. 38 seconds

gcc -O2 -lm pi_onecore.c      # Standard good optimization (-O3 is not better)
time ./a.out                  # typ. 31 seconds

Avec Intel (voir site lpnhe pour y avoir accès)

icc -O0 pi_onecore.c          # No optimization
time ./a.out                  # typ. 38 seconds

icc -O2 pi_onecore.c          # Standard good optimization
time ./a.out                  # typ. 31 seconds

icc -O2 -parallel pi_onecore.c # Optimization and automatic parallelize code by compiler
time ./a.out                  # typ. 31 seconds : does not work !

icc -m64 -O3 -Wall -fPIC -ipo -msse4.2 pi_onecore.c
time ./a.out                  # typ. 30.7 seconds

icc -m64 -O3 -Wall -fPIC -ipo -xavx pi_onecore.c # Works on lpnp110 not on lpnws5232 
time ./a.out                  # typ. 29 seconds

Avec python

python pi_onecore.py          # typ. 976 seconds (yes, python is slow!)

Avec optimisation Numba (automatic generation of optimized machine code using LLVM)

python pi_onecore_numba.py  # typ. 35.4 seconds

OpenMP

Avec GCC

gcc -O2 -lm -fopenmp pi_omp.c # OpenMP implementation
export OMP_NUM_THREADS=4      # nb of cpu cores on the gpu machine, set to 16 on p110
time ./a.out                  # typ. 7.7 seconds

Avec Intel

icc -m64 -O3 -Wall -fPIC -ipo -msse4.2 -restrict -fargument-noalias-global -qopenmp pi_omp.c 
export KMP_AFFINITY=physical,0
export OMP_NUM_THREADS=4    
time ./a.out                  # typ. 9.3 seconds  ( !!!! )

Sur la machine Phi (machine hôte)

gcc -O3 -lm -fopenmp pi_omp.c 
export KMP_AFFINITY=physical,0
export OMP_NUM_THREADS=16
time ./a.out                  # typ. 1.298 seconds BEST RESULT EVER on a local LPNHE machine

Sur la machine Phi (device Phi)

# c.f. compilation et execution des phi, README.me in the phi folder of the HPC project
export KMP_AFFINITY=physical,0
export OMP_NUM_THREADS=239 
time ./a.out                  # typ. 3.96 seconds (no effort to be done with the code, works out of the box)

MPI

Avec GCC

mpicc -O2 pi_mpi.c               # MPI for gcc (see mpicc -showme for args)
time  mpirun -np 4 ./a.out       # typ. 9.65 seconds

Avec Intel (pour placer des options, voir ici)

export OMPI_CC=icc
mpicc -m64 -O3 -Wall -fPIC -msse4.2 -restrict -fargument-noalias-global pi_mpi.cc
time mpirun -np 4 ./a.out        # typ. 9.95 seconds

GPU Implementation

Version 1 : 1 seul device (i.e. 1 seule carte GPU), reduction sur l'hôte (le cpu).
- Version C++

. /usr/local/bin/cuda-setup.sh
nvcc -O3 pi_gpu.cu               # One may prefer the use of Makefile from CUDA examples (very very slightly faster)
time ./a.out                     # typ. 6.32 seconds using 1D grid, 1 GPU card K2200 (over 2 avail), asking for 10 000 blocks

- Version python

python pi_gpu_cuda.py            # typ. 0.15 seconds using 1D grid, 1 GPU crad K2200 (over 2 avail.), asking for 1024 blocks, but giving wrong result...

Version 2 : 1 seul device, reduction par block, plusieurs threads par block, puis reduction finale sur l'hôte.

. /usr/local/bin/cuda-setup.sh
nvcc -O3 pi_gpu_v2.cu
time ./a.out                     # typ. 0.231 seconds : 1D block grid (N=32), 1024 thread by block. Time mainly spent in init now !

Version 3 : 1 seul device mais on utilise thrust

nvcc -O3 -Xcompiler "-O3" -gencode arch=compute_50,code=sm_50 -o pi_gpu_thrust.exe pi_gpu_thrust.cu --ptxas-options -v 
export CUDA_VISIBLE_DEVICES=1    # Use of GPU # 1 , just to change...
time ./pi_gpu_thrust.cu          # Now ERR=1e-5 (should be 100 times longer...)
                                 # typ. 5.5 seconds

Version 4 : tous les GPU devices (reduction à la main)

nvcc -O3 -Xcompiler "-O3" -gencode arch=compute_50,code=sm_50 -o pi_gpu_multiGPU.exe pi_gpu_multiGPU.cu --ptxas-options -v
export CUDA_VISIBLE_DEVICES=0,1
time  pi_gpu_multiGPU.exe          # typ. 4.4 seconds, still with ERR=1e-5
nvvp ./pi_gpu_multiGPU.exe         # Nvidia profiling UI : one should check that cuda calls are async and cuda devices are runnning in // (good !)

Comments

Please register or sign in to add a comment.

Admin message

TestHpc

Test des machines à notre disposition

Test sur la base d'une intégration MonteCarlo pour l'estimation de Pi

Où trouver les codes ?

Mono-coeur

OpenMP

MPI

GPU Implementation

Comments

Où trouver les codes ?

Mono-coeur

OpenMP

MPI

GPU Implementation