Changes

Tristan Beau · 2029a1ee
--- a/TestHpc.md
+++ b/TestHpc.md
-# Test des machines à notre disposition #
-*À mettre en page plus clairement !*
+# Testing available machines
+*The layout of that page needs some clarifications...*

-## Test sur la base d'une intégration MonteCarlo pour l'estimation de Pi ##
-Écriture basique permettant de mettre le doigt sur les principes de base d'utilisation des machines.
-### Où trouver les codes ? ###
+## Test based on MonteCarlo integration for Pi estimate
+Basic code allowing to understand basic principles of the use of machines.
+### Where to find code ? 
 https://gitlab.in2p3.fr/lpnhe/HPC/tree/master/gpu/pi-test
-### Mono-coeur
+### One core

- * Avec GCC
+ * With GCC
 ```bash
 gcc -O0 -lm pi_onecore.c      # Default optimization
 time ./a.out                  # typ. 38 seconds
@@ -17,7 +17,7 @@ gcc -O2 -lm pi_onecore.c      # Standard good optimization (-O3 is not better)
 time ./a.out                  # typ. 31 seconds
 ```

- * Avec Intel (voir [site lpnhe](http://lpnhe.in2p3.fr/spip.php?article1116) pour y avoir accès)
+ * With Intel (see [lpnhe website](http://lpnhe.in2p3.fr/spip.php?article1116) to have access to)
 ```bash
 icc -O0 pi_onecore.c          # No optimization
 time ./a.out                  # typ. 38 seconds
@@ -39,42 +39,42 @@ icc -m64 -O3 -Wall -fPIC -ipo -xavx pi_onecore.c # Works on lpnp110 not on lpnws
 time ./a.out                  # typ. 29 seconds
 ```

- * Avec python
+ * With python
 ```bash
 python pi_onecore.py          # typ. 976 seconds (yes, python is slow!)
 ```
 
- Avec optimisation Numba (automatic generation of optimized machine code using LLVM)
+ With Numba optimization (automatic generation of optimized machine code using LLVM)
 ```bash
 python pi_onecore_numba.py  # typ. 35.4 seconds
 ```

 ### OpenMP

- * Avec GCC
+ * With GCC
 ```bash
 gcc -O2 -lm -fopenmp pi_omp.c # OpenMP implementation
 export OMP_NUM_THREADS=4      # nb of cpu cores on the gpu machine, set to 16 on p110
 time ./a.out                  # typ. 7.7 seconds
 ```

- * Avec Intel
+ * With Intel
 ```bash
 icc -m64 -O3 -Wall -fPIC -ipo -msse4.2 -restrict -fargument-noalias-global -qopenmp pi_omp.c 
 export KMP_AFFINITY=physical,0
 export OMP_NUM_THREADS=4    
 time ./a.out                  # typ. 9.3 seconds  ( !!!! )
 ```
- * Sur la machine Phi (machine hôte)
+ * On the Phi machine (host machine)
 ```bash
 gcc -O3 -lm -fopenmp pi_omp.c 
 export KMP_AFFINITY=physical,0
 export OMP_NUM_THREADS=16
 time ./a.out                  # typ. 1.298 seconds BEST RESULT EVER on a local LPNHE machine
 ```
- * Sur la machine Phi (device Phi)
+ * On the Phi (Phi device)
 ```bash
-# c.f. compilation et execution des phi, README.me in the phi folder of the HPC project
+# c.f. compilation and execution on phi, README.me in the phi folder of the HPC project
 export KMP_AFFINITY=physical,0
 export OMP_NUM_THREADS=239 
 time ./a.out                  # typ. 3.96 seconds (no effort to be done with the code, works out of the box)
@@ -82,14 +82,14 @@ time ./a.out                  # typ. 3.96 seconds (no effort to be done with the

 ### MPI

- * Avec GCC
+ * With GCC

 ```bash
 mpicc -O2 pi_mpi.c               # MPI for gcc (see mpicc -showme for args)
 time  mpirun -np 4 ./a.out       # typ. 9.65 seconds
 ```

- * Avec Intel (pour placer des options, voir [ici](https://www.open-mpi.org/faq/?category=mpi-apps#override-wrappers-after-v1.0))
+ * With Intel (to add options, see [here](https://www.open-mpi.org/faq/?category=mpi-apps#override-wrappers-after-v1.0))

 ```bash
 export OMPI_CC=icc
@@ -98,10 +98,9 @@ time mpirun -np 4 ./a.out        # typ. 9.95 seconds
 ```

 ### GPU Implementation
-
 * Version 1 : 1 seul device (i.e. 1 seule carte GPU), reduction sur l'hôte (le cpu).

-	- Version C++
+	- C++ version

 ```bash
 . /usr/local/bin/cuda-setup.sh
@@ -109,13 +108,13 @@ nvcc -O3 pi_gpu.cu               # One may prefer the use of Makefile from CUDA
 time ./a.out                     # typ. 6.32 seconds using 1D grid, 1 GPU card K2200 (over 2 avail), asking for 10 000 blocks 
 ```

-	- Version python
+	- Python version
 	
 ```python
 python pi_gpu_cuda.py            # typ. 0.15 seconds using 1D grid, 1 GPU crad K2200 (over 2 avail.), asking for 1024 blocks, but giving wrong result...
 ```

- * Version 2 : 1 seul device, reduction par block, plusieurs threads par block, puis reduction finale sur l'hôte.
+ * Version n°2 : only one device, block reduction, many threads by block, then final reduction on the host machine.
 
 ```bash
 . /usr/local/bin/cuda-setup.sh
@@ -123,7 +122,7 @@ nvcc -O3 pi_gpu_v2.cu
 time ./a.out                     # typ. 0.231 seconds : 1D block grid (N=32), 1024 thread by block. Time mainly spent in init now !
 ```

- * Version 3 : 1 seul device mais on utilise thrust 
+ * Version n°3 : only one device but using thrust 

 ```bash
 nvcc -O3 -Xcompiler "-O3" -gencode arch=compute_50,code=sm_50 -o pi_gpu_thrust.exe pi_gpu_thrust.cu --ptxas-options -v 
@@ -132,7 +131,7 @@ time ./pi_gpu_thrust.cu          # Now ERR=1e-5 (should be 100 times longer...)
                                 # typ. 5.5 seconds
 ```

- * Version 4 : tous les GPU devices (reduction à la main)
+ * Version n°4 : all GPU devices (reducing step by hand)

 ```bash
 nvcc -O3 -Xcompiler "-O3" -gencode arch=compute_50,code=sm_50 -o pi_gpu_multiGPU.exe pi_gpu_multiGPU.cu --ptxas-options -v