Changes

Tristan Beau · 5e916f5b
--- a/TestHpc.md
+++ b/TestHpc.md
+# Test des machines à notre disposition #
+## Test sur la base d'une intégration MonteCarlo pour l'estimation de Pi ##
+Écriture basique permettant de mettre le doigt sur les principes de base d'utilisation des machines.
+### Où trouver les codes ? ###
+https://gitlab.in2p3.fr/lpnhe/HPC/tree/master/gpu/pi-test
+### Mono-coeur
+
+ * Avec GCC
+ ```bash
+gcc -O0 -lm pi_onecore.c      # Default optimization
+time ./a.out                  # typ. 38 seconds
+```
+```bash
+gcc -O2 -lm pi_onecore.c      # Standard good optimization (-O3 is not better)
+time ./a.out                  # typ. 31 seconds
+```
+
+ * Avec Intel (voir [site lpnhe](http://lpnhe.in2p3.fr/spip.php?article1116) pour y avoir accès)
+```bash
+icc -O0 pi_onecore.c          # No optimization
+time ./a.out                  # typ. 38 seconds
+```
+```bash
+icc -O2 pi_onecore.c          # Standard good optimization
+time ./a.out                  # typ. 31 seconds
+```
+```bash
+icc -O2 -parallel pi_onecore.c # Optimization and automatic parallelize code by compiler
+time ./a.out                  # typ. 31 seconds : does not work ! 
+```
+```bash
+icc -m64 -O3 -Wall -fPIC -ipo -msse4.2 pi_onecore.c
+time ./a.out                  # typ. 30.7 seconds
+```
+```bash
+icc -m64 -O3 -Wall -fPIC -ipo -xavx pi_onecore.c # Works on lpnp110 not on lpnws5232 
+time ./a.out                  # typ. 29 seconds
+```
+
+ * Avec python
+ ```bash
+python pi_onecore.py          # typ. 976 seconds (yes, python is slow!)
+ ```
+ 
+ Avec optimisation Numba (automatic generation of optimized machine code using LLVM)
+ ```bash
+ python pi_onecore_numba.py  # typ. 35.4 seconds
+ ```
+
+### OpenMP
+
+ * Avec GCC
+```bash
+gcc -O2 -lm -fopenmp pi_omp.c # OpenMP implementation
+export OMP_NUM_THREADS=4      # nb of cpu cores on the gpu machine, set to 16 on p110
+time ./a.out                  # typ. 7.7 seconds
+```
+
+ * Avec Intel
+```bash
+icc -m64 -O3 -Wall -fPIC -ipo -msse4.2 -restrict -fargument-noalias-global -qopenmp pi_omp.c 
+export KMP_AFFINITY=physical,0
+export OMP_NUM_THREADS=4    
+time ./a.out                  # typ. 9.3 seconds  ( !!!! )
+```
+ * Sur la machine Phi (machine hôte)
+```bash
+gcc -O3 -lm -fopenmp pi_omp.c 
+export KMP_AFFINITY=physical,0
+export OMP_NUM_THREADS=16
+time ./a.out                  # typ. 1.298 seconds BEST RESULT EVER on a local LPNHE machine
+```
+ * Sur la machine Phi (device Phi)
+```bash
+# c.f. compilation et execution des phi, README.me in the phi folder of the HPC project
+export KMP_AFFINITY=physical,0
+export OMP_NUM_THREADS=239 
+time ./a.out                  # typ. 3.96 seconds (no effort to be done with the code, works out of the box)
+```
+
+### MPI
+
+ * Avec GCC
+
+```bash
+mpicc -O2 pi_mpi.c               # MPI for gcc (see mpicc -showme for args)
+time  mpirun -np 4 ./a.out       # typ. 9.65 seconds
+```
+
+ * Avec Intel (pour placer des options, voir [ici](https://www.open-mpi.org/faq/?category=mpi-apps#override-wrappers-after-v1.0))
+
+```bash
+export OMPI_CC=icc
+mpicc -m64 -O3 -Wall -fPIC -msse4.2 -restrict -fargument-noalias-global pi_mpi.cc
+time mpirun -np 4 ./a.out        # typ. 9.95 seconds 
+```
+
+### GPU Implementation
+
+ * Version 1 : 1 seul device (i.e. 1 seule carte GPU), reduction sur l'hôte (le cpu).
+
+	- Version C++
+
+```bash
+. /usr/local/bin/cuda-setup.sh
+nvcc -O3 pi_gpu.cu               # One may prefer the use of Makefile from CUDA examples (very very slightly faster)
+time ./a.out                     # typ. 6.32 seconds using 1D grid, 1 GPU card K2200 (over 2 avail), asking for 10 000 blocks 
+```
+
+	- Version python
+	
+```python
+python pi_gpu_cuda.py            # typ. 0.15 seconds using 1D grid, 1 GPU crad K2200 (over 2 avail.), asking for 1024 blocks, but giving wrong result...
+```
+
+ * Version 2 : 1 seul device, reduction par block, plusieurs threads par block, puis reduction finale sur l'hôte.
+ 
+```bash
+. /usr/local/bin/cuda-setup.sh
+nvcc -O3 pi_gpu_v2.cu
+time ./a.out                     # typ. 0.231 seconds : 1D block grid (N=32), 1024 thread by block. Time mainly spent in init now !
+```
+
+ * Version 3 : 1 seul device mais on utilise thrust 
+
+```bash
+nvcc -O3 -Xcompiler "-O3" -gencode arch=compute_50,code=sm_50 -o pi_gpu_thrust.exe pi_gpu_thrust.cu --ptxas-options -v 
+export CUDA_VISIBLE_DEVICES=1    # Use of GPU # 1 , just to change...
+time ./pi_gpu_thrust.cu          # Now ERR=1e-5 (should be 100 times longer...)
+                                 # typ. 5.5 seconds
+```
+
+ * Version 4 : tous les GPU devices (reduction à la main)
+
+```bash
+nvcc -O3 -Xcompiler "-O3" -gencode arch=compute_50,code=sm_50 -o pi_gpu_multiGPU.exe pi_gpu_multiGPU.cu --ptxas-options -v
+export CUDA_VISIBLE_DEVICES=0,1
+time  pi_gpu_multiGPU.exe          # typ. 4.4 seconds, still with ERR=1e-5
+nvvp ./pi_gpu_multiGPU.exe         # Nvidia profiling UI : one should check that cuda calls are async and cuda devices are runnning in // (good !)
+```
+