nvidia-smi Cheat Sheet

rtx a6000

NVIDIA’s Tesla, Quadro, GRID, and GeForce devices from the Fermi and higher architecture families are all monitored and managed using nvidia-smi (also known as NVSMI). Most features are supported for GeForce Titan series devices, with very little information available for the rest of the Geforce line.

NVSMI is a cross-platform program that works with all NVIDIA-supported Linux distributions as well as 64-bit Windows versions beginning with Windows Server 2008 R2. Users can consume metrics directly via stdout, or files in CSV and XML formats can be provided for scripting reasons.

Much of the functionality of NVSMI is provided by the underlying NVML C-based library. The output of NVSMI is not guaranteed to be backwards compatible. However, both NVML and the Python bindings are backwards compatible. The output of NVSMI is not guaranteed to be backwards compatible. However, both NVML and the Python bindings are backwards compatible

On Linux, you can enable persistence mode on GPUs to keep the NVIDIA driver loaded even if no apps are using them. This is especially beneficial if you have a number of short jobs going at the same time. Persistence mode consumes a few more watts per idle GPU, but it avoids the lengthy pauses that occur when launching a GPU application.

It’s also required if the GPUs have been given specified clock rates or power constraints (as those changes are lost when the NVIDIA driver is unloaded). Run nvidia-smi -pm 1 on all GPUs to enable persistence mode.

nvidia-smi is unable to configure persistence mode on Windows. Instead, you should use TCC mode on your computational GPUs. NVIDIA’s graphical GPU device administration panel should be used for this.

NVIDIA’s SMI utility works with nearly every NVIDIA GPU released since 2011. These include Fermi and higher architectural families’ Tesla, Quadro, and GeForce devices (Kepler, Maxwell, Pascal, Volta, etc).

Ampere: A100, RTX AA6000, RTX A5000, RTX A4000.

Tesla: V100, S1070, S2050, C1060, C2050/70, M2050/70/90, X2070/90, K10, K20, K20X, K40, K80, M40, P40, P100.

Quadro: 4000, 5000, 6000, 7000, M2070-Q, K-series, M-series, P-series, RTX-series.

GeForce: Varying levels of support.

GPU Initialization & Info

root@server:~# nvidia-smi
Sat Feb 12 19:36:14 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  Off  | 00000000:31:00.0 Off |                    0 |
| N/A   40C    P0    62W / 300W |      0MiB / 80994MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  Off  | 00000000:CA:00.0 Off |                    0 |
| N/A   42C    P0    66W / 300W |      0MiB / 80994MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

GPU Status Query

root@server:~# nvidia-smi -L
GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-84ccface-663f-f5fd-8e8e-109d0f78bd2f)
GPU 1: NVIDIA A100 80GB PCIe (UUID: GPU-e396c2fa-f6fd-57db-aea0-5aaf73ee6148)

GPU Details

root@server:~# nvidia-smi --query-gpu=index,name,uuid,serial --format=csv
index, name, uuid, serial
0, NVIDIA A100 80GB PCIe, GPU-84ccface-663f-f5fd-8e8e-109d0f78bd2f, 1324021047639
1, NVIDIA A100 80GB PCIe, GPU-e396c2fa-f6fd-57db-aea0-5aaf73ee6148, 1324021046251

Monitor GPU Usage

root@server:~# nvidia-smi dmon
# gpu   pwr gtemp mtemp    sm   mem   enc   dec  mclk  pclk
# Idx     W     C     C     %     %     %     %   MHz   MHz
    0    62    40    54     0     0     0     0  1512  1410
    1    66    43    57     6     0     0     0  1512  1410
    0    62    40    55     0     0     0     0  1512  1410
    1    66    42    57     0     0     0     0  1512  1410
    0    62    40    54     0     0     0     0  1512  1410
    1    66    42    58     0     0     0     0  1512  1410
    0    62    40    55     0     0     0     0  1512  1410
    1    66    42    57     0     0     0     0  1512  1410
    0    62    40    54     0     0     0     0  1512  1410
    1    66    42    57     0     0     0     0  1512  1410
    0    62    40    54     0     0     0     0  1512  1410
    1    66    42    57     0     0     0     0  1512  1410

Monitor GPU Processes

root@server:~# nvidia-smi pmon
# gpu        pid  type    sm   mem   enc   dec   command
# Idx          #   C/G     %     %     %     %   name
    0          -     -     -     -     -     -   -              
    1          -     -     -     -     -     -   -              
    0          -     -     -     -     -     -   -              
    1          -     -     -     -     -     -   -              
    0          -     -     -     -     -     -   -              
    1          -     -     -     -     -     -   -

List of Available Clocks

root@server:~# nvidia-smi -q -d SUPPORTED_CLOCKS

==============NVSMI LOG==============

Timestamp                                 : Sat Feb 12 20:21:18 2022
Driver Version                            : 470.103.01
CUDA Version                              : 11.4

Attached GPUs                             : 2
GPU 00000000:31:00.0
    Supported Clocks
        Memory                            : 1512 MHz
            Graphics                      : 1410 MHz
            Graphics                      : 1395 MHz
            Graphics                      : 1380 MHz
            Graphics                      : 1365 MHz
            Graphics                      : 1350 MHz
            Graphics                      : 1335 MHz
            Graphics                      : 1320 MHz
            Graphics                      : 1305 MHz
            Graphics                      : 1290 MHz
            Graphics                      : 1275 MHz

Current GPU Clock Speed

root@server:~# nvidia-smi -q -d CLOCK

==============NVSMI LOG==============

Timestamp                                 : Sat Feb 12 20:23:25 2022
Driver Version                            : 470.103.01
CUDA Version                              : 11.4

Attached GPUs                             : 2
GPU 00000000:31:00.0
    Clocks
        Graphics                          : 1410 MHz
        SM                                : 1410 MHz
        Memory                            : 1512 MHz
        Video                             : 1275 MHz
    Applications Clocks
        Graphics                          : 1410 MHz
        Memory                            : 1512 MHz
    Default Applications Clocks
        Graphics                          : 1410 MHz
        Memory                            : 1512 MHz
    Max Clocks
        Graphics                          : 1410 MHz
        SM                                : 1410 MHz
        Memory                            : 1512 MHz
        Video                             : 1290 MHz
    Max Customer Boost Clocks
        Graphics                          : 1410 MHz
    SM Clock Samples
        Duration                          : Not Found
        Number of Samples                 : Not Found
        Max                               : Not Found
        Min                               : Not Found
        Avg                               : Not Found
    Memory Clock Samples
        Duration                          : Not Found
        Number of Samples                 : Not Found
        Max                               : Not Found
        Min                               : Not Found
        Avg                               : Not Found
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A

GPU Performance

root@server:~# nvidia-smi -q -d PERFORMANCE

==============NVSMI LOG==============

Timestamp                                 : Sat Feb 12 20:27:57 2022
Driver Version                            : 470.103.01
CUDA Version                              : 11.4

Attached GPUs                             : 2
GPU 00000000:31:00.0
    Performance State                     : P0
    Clocks Throttle Reasons
        Idle                              : Not Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active

GPU 00000000:CA:00.0
    Performance State                     : P0
    Clocks Throttle Reasons
        Idle                              : Not Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active

GPU Topology

root@server:~# nvidia-smi topo --matrix
	GPU0	GPU1	CPU Affinity	NUMA Affinity
GPU0	 X 	SYS	0-23,48-71	0
GPU1	SYS	 X 	24-47,72-95	1

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NVLink Status

root@server:~# nvidia-smi nvlink --status
GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-84ccface-663f-f5fd-8e8e-109d0f78bd2f)
	 Link 0: <inactive>
	 Link 1: <inactive>
	 Link 2: <inactive>
	 Link 3: <inactive>
	 Link 4: <inactive>
	 Link 5: <inactive>
	 Link 6: <inactive>
	 Link 7: <inactive>
	 Link 8: <inactive>
	 Link 9: <inactive>
	 Link 10: <inactive>
	 Link 11: <inactive>
GPU 1: NVIDIA A100 80GB PCIe (UUID: GPU-e396c2fa-f6fd-57db-aea0-5aaf73ee6148)
	 Link 0: <inactive>
	 Link 1: <inactive>
	 Link 2: <inactive>
	 Link 3: <inactive>
	 Link 4: <inactive>
	 Link 5: <inactive>
	 Link 6: <inactive>
	 Link 7: <inactive>
	 Link 8: <inactive>
	 Link 9: <inactive>
	 Link 10: <inactive>
	 Link 11: <inactive>

Display GPU Details

root@server:~# nvidia-smi -i 0 -q

==============NVSMI LOG==============

Timestamp : Sat Feb 12 20:41:51 2022
Driver Version : 470.103.01
CUDA Version : 11.4

Attached GPUs : 2
GPU 00000000:31:00.0
Product Name : NVIDIA A100 80GB PCIe
Product Brand : NVIDIA
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
MIG Mode
Current : Disabled
Pending : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1324021047639
GPU UUID : GPU-84ccface-663f-f5fd-8e8e-109d0f78bd2f
Minor Number : 0
VBIOS Version : 92.00.68.00.01
MultiGPU Board : No
Board ID : 0x3100
GPU Part Number : 900-21001-0020-000
Module ID : 0
Inforom Version
Image Version : 1001.0230.00.03
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : 470.103.01
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x31
Device : 0x00
Domain : 0x0000
Device Id : 0x20B510DE
Bus Id : 00000000:31:00.0
Sub System Id : 0x153310DE
GPU Link Info
PCIe Generation
Max : 4
Current : 4
Link Width
Max : 16x
Current : 8x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 80994 MiB
Used : 0 MiB
Free : 80994 MiB
BAR1 Memory Usage
Total : 131072 MiB
Used : 1 MiB
Free : 131071 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 640 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 40 C
GPU Shutdown Temp : 92 C
GPU Slowdown Temp : 89 C
GPU Max Operating Temp : 85 C
GPU Target Temperature : N/A
Memory Current Temp : 55 C
Memory Max Operating Temp : 95 C
Power Readings
Power Management : Supported
Power Draw : 62.86 W
Power Limit : 300.00 W
Default Power Limit : 300.00 W
Enforced Power Limit : 300.00 W
Min Power Limit : 150.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1512 MHz
Video : 1275 MHz
Applications Clocks
Graphics : 1410 MHz
Memory : 1512 MHz
Default Applications Clocks
Graphics : 1410 MHz
Memory : 1512 MHz
Max Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1512 MHz
Video : 1290 MHz
Max Customer Boost Clocks
Graphics : 1410 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 893.750 mV
Processes : None

GPU App Details

root@server:~# nvidia-smi -i 0 -q -d MEMORY,UTILIZATION,POWER,CLOCK,COMPUTE

==============NVSMI LOG==============

Timestamp                                 : Sat Feb 12 20:44:55 2022
Driver Version                            : 470.103.01
CUDA Version                              : 11.4

Attached GPUs                             : 2
GPU 00000000:31:00.0
    FB Memory Usage
        Total                             : 80994 MiB
        Used                              : 0 MiB
        Free                              : 80994 MiB
    BAR1 Memory Usage
        Total                             : 131072 MiB
        Used                              : 1 MiB
        Free                              : 131071 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    GPU Utilization Samples
        Duration                          : 1562.92 sec
        Number of Samples                 : 42
        Max                               : 5 %
        Min                               : 0 %
        Avg                               : 0 %
    Memory Utilization Samples
        Duration                          : 1562.92 sec
        Number of Samples                 : 42
        Max                               : 0 %
        Min                               : 0 %
        Avg                               : 0 %
    ENC Utilization Samples
        Duration                          : 1562.92 sec
        Number of Samples                 : 42
        Max                               : 0 %
        Min                               : 0 %
        Avg                               : 0 %
    DEC Utilization Samples
        Duration                          : 1562.92 sec
        Number of Samples                 : 42
        Max                               : 0 %
        Min                               : 0 %
        Avg                               : 0 %
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 62.66 W
        Power Limit                       : 300.00 W
        Default Power Limit               : 300.00 W
        Enforced Power Limit              : 300.00 W
        Min Power Limit                   : 150.00 W
        Max Power Limit                   : 300.00 W
    Power Samples
        Duration                          : 2.39 sec
        Number of Samples                 : 119
        Max                               : 66.78 W
        Min                               : 57.45 W
        Avg                               : 60.17 W
    Clocks
        Graphics                          : 1410 MHz
        SM                                : 1410 MHz
        Memory                            : 1512 MHz
        Video                             : 1275 MHz
    Applications Clocks
        Graphics                          : 1410 MHz
        Memory                            : 1512 MHz
    Default Applications Clocks
        Graphics                          : 1410 MHz
        Memory                            : 1512 MHz
    Max Clocks
        Graphics                          : 1410 MHz
        SM                                : 1410 MHz
        Memory                            : 1512 MHz
        Video                             : 1290 MHz
    Max Customer Boost Clocks
        Graphics                          : 1410 MHz
    SM Clock Samples
        Duration                          : Not Found
        Number of Samples                 : Not Found
        Max                               : Not Found
        Min                               : Not Found
        Avg                               : Not Found
    Memory Clock Samples
        Duration                          : Not Found
        Number of Samples                 : Not Found
        Max                               : Not Found
        Min                               : Not Found
        Avg                               : Not Found
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A

Change GPU Performance Level

Use the following command to check the current Performance state of your GPU

nvidia-smi -q -d PERFORMANCE

# nvidia-smi -q -d PERFORMANCE

==============NVSMI LOG==============

Timestamp : Mon Feb 21 16:04:44 2022
Driver Version : 511.65
CUDA Version : 11.6

Attached GPUs : 1
GPU 00000000:65:00.0
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active

Verify the maximum power status of your Nvidia GPUs when running Compute programs such as OpenCL or CUDA You need to find out what the maximum frequencies of the video card are for maximum performance in P0 mode. The command to use for this is as follows.

nvidia-smi -q -d SUPPORTED_CLOCKS | more

# nvidia-smi -q -d SUPPORTED_CLOCKS | more

==============NVSMI LOG==============

Timestamp                                 : Mon Feb 21 16:08:23 2022
Driver Version                            : 511.65
CUDA Version                              : 11.6

Attached GPUs                             : 1
GPU 00000000:65:00.0
    Supported Clocks
        Memory                            : 8001 MHz
            Graphics                      : 2100 MHz
            Graphics                      : 2085 MHz
            Graphics                      : 2070 MHz
            Graphics                      : 2055 MHz
            Graphics                      : 2040 MHz
            Graphics                      : 2025 MHz
            Graphics                      : 2010 MHz
            Graphics                      : 1995 MHz
            Graphics                      : 1980 MHz
            Graphics                      : 1965 MHz
            Graphics                      : 1950 MHz
            Graphics                      : 1935 MHz
            Graphics                      : 1920 MHz
            Graphics                      : 1905 MHz
            Graphics                      : 1890 MHz
            Graphics                      : 1875 MHz
            Graphics                      : 1860 MHz
            Graphics                      : 1845 MHz
            Graphics                      : 1830 MHz
            Graphics                      : 1815 MHz
            Graphics                      : 1800 MHz
            Graphics                      : 1785 MHz
            Graphics                      : 1770 MHz
            Graphics                      : 1755 MHz
            Graphics                      : 1740 MHz
            Graphics                      : 1725 MHz
            Graphics                      : 1710 MHz
            Graphics                      : 1695 MHz
            Graphics                      : 1680 MHz
            Graphics                      : 1665 MHz
            Graphics                      : 1650 MHz
            Graphics                      : 1635 MHz
            Graphics                      : 1620 MHz
            Graphics                      : 1605 MHz
            Graphics                      : 1590 MHz
            Graphics                      : 1575 MHz
            Graphics                      : 1560 MHz
            Graphics                      : 1545 MHz
-- More  --

There is no need to examine the entire list because it will include all of the supported frequencies in the various power states that your video card may use. We need to know the memory and graphics frequencies at the top of the list, in this case, we are using a RTX A6000 video card, and the values we require are 8001 MHz for the VRAM 2100 MHz for the GPU. For the next phase, we will need these frequencies.

The next step is to force the video card to use the highest performance operating frequency by setting the power state to P0. We’ll need to run the following command to accomplish this.

nvidia-smi -ac 8001,2100

Note that the above command will apply the settings to all GPUs in your system; this should not be an issue for most GPU servers because they often include a number of cards of the same model, but there are some exceptions. As a result, you may need to examine each video card’s particular settings and apply the appropriate values for each one independently.

To do so, simply include the card ID on the command line, and the option will be executed only for the video card supplied. This is accomplished by adding the -i argument to the command line, where i is a number starting at 0 for the first GPU and increasing from there.

We have two different GPUs in the system in the example shown in the screenshot above, therefore we need to establish their P0 power states using two separate instructions, one for each card.

nvidia-smi -i 0 -ac 8001,2100
nvidia-smi -i 1 -ac 8001,2085

Leave a Reply