mpi cuda-aware code
A CUDA-aware MPI implementation must handle buffers differently depending on whether it resides in host or device memory. An MPI implementation could offer different APIs for host and device buffers, or it could add an additional argument indicating where the passed buffer lives. Fortunately, neither of these approaches is necessary because of the Unified Virtual Addressing (UVA) feature introduced in CUDA 4.0 (on Compute Capability 2.0 and later GPUs). With UVA the host memory and the memory of all GPUs in a system (a single node) are combined into one large (virtual) address space.