Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
G
GPU AI On LIGER
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 15
    • Issues 15
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Operations
    • Operations
    • Metrics
    • Incidents
  • Packages & Registries
    • Packages & Registries
    • Package Registry
    • Container Registry
  • Analytics
    • Analytics
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Members
    • Members
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
Collapse sidebar

Gitlab is now running v13.9.2 - More info -> here <-

  • ECN Site Collaborations
  • GPU AI On LIGER
  • Wiki
  • Control Your GPUs

Last edited by RANDRIATOAMANANA Richard Nov 23, 2020
Page history

Control Your GPUs

The NVIDIA System Management Interface (nvidia-smi) is a command line utility, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices.

This utility allows administrators to query GPU device state and with the appropriate privileges, permits administrators to modify GPU device state. It is targeted at the TeslaTM, GRIDTM, QuadroTM and Titan X product, though limited support is also available on other NVIDIA GPUs.

NVIDIA-smi ships with NVIDIA GPU display drivers on Linux. Nvidia-smi can report query information as XML or human readable plain text to either standard output or a file. For more details, please refer to the nvidia-smi documentation.

  • Example Output
  • Querying GPU Status
    • List all available NVIDIA devices
    • List certain details about each GPU
    • Monitor overall GPU usage with 1-second update intervals
    • Monitor per-process GPU usage with 1-second update intervals
  • Monitoring and Managing GPU Boost
  • Reviewing System/GPU Topology and NVLink with nvidia-smi
  • Printing all GPU Details

Example Output

# nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:18:00.0 Off |                    0 |
| N/A   41C    P0    57W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   37C    P0    53W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:86:00.0 Off |                    0 |
| N/A   38C    P0    57W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:AF:00.0 Off |                    0 |
| N/A   42C    P0    57W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Querying GPU Status

These are NVIDIA’s high-performance compute GPUs and provide a good deal of health and status information.

List all available NVIDIA devices

$ nvidia-smi -L
GPU 0: Tesla V100-SXM2-32GB (UUID: GPU-5a80af23-787c-cbcb-92de-c80574883c5d)
GPU 1: Tesla V100-SXM2-32GB (UUID: GPU-233f07d9-5e4c-9309-bf20-3ae74f0495b4)
GPU 2: Tesla V100-SXM2-32GB (UUID: GPU-a1a1cbc1-8747-d8cd-9028-3e2db40deb04)
GPU 3: Tesla V100-SXM2-32GB (UUID: GPU-8d5f775d-70d9-62b2-b46c-97d30eea732f)

List certain details about each GPU

$ nvidia-smi --query-gpu=index,name,uuid,serial --format=csv
index, name, uuid, serial
0, Tesla V100-SXM2-32GB, GPU-5a80af23-787c-cbcb-92de-c80574883c5d, 1562720002969
1, Tesla V100-SXM2-32GB, GPU-233f07d9-5e4c-9309-bf20-3ae74f0495b4, 1562520023800
2, Tesla V100-SXM2-32GB, GPU-a1a1cbc1-8747-d8cd-9028-3e2db40deb04, 1562420015554
3, Tesla V100-SXM2-32GB, GPU-8d5f775d-70d9-62b2-b46c-97d30eea732f, 1562520023100

Monitor overall GPU usage with 1-second update intervals

$ nvidia-smi dmon
# gpu   pwr gtemp mtemp    sm   mem   enc   dec  mclk  pclk
# Idx     W     C     C     %     %     %     %   MHz   MHz
    0    57    42    39     0     0     0     0   877  1290
    1    54    38    38     0     0     0     0   877  1290
    2    57    38    38     0     0     0     0   877  1290
    3    57    43    41     0     0     0     0   877  1290

Monitor per-process GPU usage with 1-second update intervals

$ nvidia-smi pmon
# gpu        pid  type    sm   mem   enc   dec   command
# Idx          #   C/G     %     %     %     %   name
    0      14835     C    45    15     0     0   python         
    1      14945     C    64    50     0     0   python    
    2          -     -     -     -     -     -   -
    3          -     -     -     -     -     -   -

in this case, two different python processes are running; one on each GPU; only 2 over 4 GPU are used

Monitoring and Managing GPU Boost

The GPU Boost feature which NVIDIA has included with more recent GPUs allows the GPU clocks to vary depending upon load (achieving maximum performance so long as power and thermal headroom are available). However, the amount of available headroom will vary by application (and even by input file!) so users should keep their eyes on the status of the GPUs. A listing of available clock speeds can be shown for each GPU on Turing with V100:

$ nvidia-smi -q -d SUPPORTED_CLOCKS
==============NVSMI LOG==============
Timestamp                                 : Mon Nov 23 18:48:39 2020
Driver Version                            : 450.51.06
CUDA Version                              : 11.0

Attached GPUs                             : 4
GPU 00000000:18:00.0
    Supported Clocks
        Memory                            : 877 MHz
            Graphics                      : 1530 MHz
            Graphics                      : 1522 MHz
            Graphics                      : 1515 MHz
            Graphics                      : 1507 MHz
            [...180 additional clock speeds omitted...]
            Graphics                      : 150 MHz
            Graphics                      : 142 MHz
            Graphics                      : 135 MHz

As shown, the Tesla V100 GPU supports 187 different clock speeds (from 135 MHz to 1530 MHz). However, only one memory clock speed is supported (877 MHz). Some GPUs support two different memory clock speeds (one high speed and one power-saving speed). Typically, such GPUs only support a single GPU clock speed when the memory is in the power-saving speed (which is the idle GPU state). On all recent Tesla and Quadro GPUs, GPU Boost automatically manages these speeds and runs the clocks as fast as possible (within the thermal/power limits and any limits set by the administrator).

To review the current GPU clock speed (here we display 1 GPU), default clock speed, and maximum possible clock speed, run:

$ nvidia-smi -q -d CLOCK
==============NVSMI LOG==============
Timestamp                                 : Mon Nov 23 18:56:48 2020
Driver Version                            : 450.51.06
CUDA Version                              : 11.0

Attached GPUs                             : 4
GPU 00000000:18:00.0
    Clocks
        Graphics                          : 1290 MHz
        SM                                : 1290 MHz
        Memory                            : 877 MHz
        Video                             : 1170 MHz
    Applications Clocks
        Graphics                          : 1290 MHz
        Memory                            : 877 MHz
    Default Applications Clocks
        Graphics                          : 1290 MHz
        Memory                            : 877 MHz
    Max Clocks
        Graphics                          : 1530 MHz
        SM                                : 1530 MHz
        Memory                            : 877 MHz
        Video                             : 1372 MHz
    Max Customer Boost Clocks
        Graphics                          : 1530 MHz
    SM Clock Samples
        Duration                          : 0.01 sec
        Number of Samples                 : 4
        Max                               : 1290 MHz
        Min                               : 135 MHz
        Avg                               : 870 MHz
    Memory Clock Samples
        Duration                          : 0.01 sec
        Number of Samples                 : 4
        Max                               : 877 MHz
        Min                               : 877 MHz
        Avg                               : 877 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
...

Ideally, you’d like all clocks to be running at the highest speed all the time. However, this will not be possible for all applications. To review the current state of each GPU and any reasons for clock slowdowns, use the PERFORMANCE flag:

$ nvidia-smi -q -d PERFORMANCE

Attached GPUs                             : 4
GPU 00000000:18:00.0
    Performance State                     : P0
    Clocks Throttle Reasons
        Idle                              : Not Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
...

Reviewing System/GPU Topology and NVLink with nvidia-smi

To properly take advantage of more advanced NVIDIA GPU features (such as GPU Direct), it is vital that the system topology be properly configured. The topology refers to how the various system devices (GPUs, InfiniBand HCAs, storage controllers, etc.) connect to each other and to the system’s CPUs. Certain topology types will reduce performance or even cause certain features to be unavailable. To help tackle such questions, nvidia-smi supports system topology and connectivity queries:

$ nvidia-smi topo --matrix
	GPU0	GPU1	GPU2	GPU3	mlx5_0	mlx5_1	CPU Affinity	NUMA Affinity
GPU0	 X 	NV2	NV2	NV2	NODE	NODE	0,2,4,6,8,10	0
GPU1	NV2	 X 	NV2	NV2	NODE	NODE	0,2,4,6,8,10	0
GPU2	NV2	NV2	 X 	NV2	SYS	SYS	1,3,5,7,9,11	1
GPU3	NV2	NV2	NV2	 X 	SYS	SYS	1,3,5,7,9,11	1
mlx5_0	NODE	NODE	SYS	SYS	 X 	PIX
mlx5_1	NODE	NODE	SYS	SYS	PIX	 X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Reviewing this section will take some getting used to, but can be very valuable. The above configuration shows 4 Tesla V100 and 2 Mellanox EDR InfiniBand HCA (mlx5_0 and mlx5_1) all connected to the first CPU of a server. Because the CPUs are 12-core Xeons, the topology tool recommends that jobs be assigned to the first 12 CPU cores (although this will vary by application).

The NVLink connections themselves can also be queried to ensure status, capability, and health. Readers are encouraged to consult NVIDIA documentation to better understand the specifics.

$ nvidia-smi nvlink --status
GPU 0: Tesla V100-SXM2-32GB (UUID: GPU-5a80af23-787c-cbcb-92de-c80574883c5d)
	 Link 0: 25.781 GB/s
	 Link 1: 25.781 GB/s
	 Link 2: 25.781 GB/s
	 Link 3: 25.781 GB/s
	 Link 4: 25.781 GB/s
	 Link 5: 25.781 GB/s
GPU 1: Tesla V100-SXM2-32GB (UUID: GPU-233f07d9-5e4c-9309-bf20-3ae74f0495b4)
	 Link 0: 25.781 GB/s
	 Link 1: 25.781 GB/s
	 Link 2: 25.781 GB/s
	 Link 3: 25.781 GB/s
	 Link 4: 25.781 GB/s
	 Link 5: 25.781 GB/s
GPU 2: Tesla V100-SXM2-32GB (UUID: GPU-a1a1cbc1-8747-d8cd-9028-3e2db40deb04)
	 Link 0: 25.781 GB/s
	 Link 1: 25.781 GB/s
	 Link 2: 25.781 GB/s
	 Link 3: 25.781 GB/s
	 Link 4: 25.781 GB/s
	 Link 5: 25.781 GB/s
GPU 3: Tesla V100-SXM2-32GB (UUID: GPU-8d5f775d-70d9-62b2-b46c-97d30eea732f)
	 Link 0: 25.781 GB/s
	 Link 1: 25.781 GB/s
	 Link 2: 25.781 GB/s
	 Link 3: 25.781 GB/s
	 Link 4: 25.781 GB/s
	 Link 5: 25.781 GB/s
nvidia-smi nvlink --capabilities

Printing all GPU Details

To list all available data on a particular GPU, specify the ID of the card with -i. Here’s the output from an older Tesla GPU card:

$ nvidia-smi -i 0 -q
$ nvidia-smi -i 0 -q -d MEMORY,UTILIZATION,POWER,CLOCK,COMPUTE

source

Clone repository
  • Batch job commands
  • Control Your GPUs
  • Direct SSH connection on Turing node
  • Disk Spaces Policy
  • Environment Information
  • Getting started with Liger
  • Hours Accounting
  • How To Use Module Command
  • How to access to Turing
  • Interactive Submission
  • Liger For AI
  • Memory allocation on GPU partition
  • Policy of submission and accounting
  • Python
  • Quick start
View All Pages