Home / How to reset GPU’S ON DGX-1 Servers?

How to reset GPU’S ON DGX-1 Servers?

NVIDIA provides a tool called nvidia-smi to monitor and manage the system’s GPUs. GPUs can be reset individually or collectively with this tool. Individual GPUs on the DGX-1 and DGX-1V platforms cannot be reset because they are linked via nvlink, so all of the GPUs must be reset at the same time.

Errors that occur twice or more in the same place on a GPU are relegated to the trash bin. The GPU must be reset in order for the retired pages to be blacklisted (and thus no longer available to the user or application). An application cannot access memory that has been blacklisted because the driver has to be reloaded and reactivated. All GPU-based applications must be closed before the GPUs can be reset. nvidia-smi can be used to check this.

dgxuser@dgx-1:~$ nvidia-smi -q -d PIDS

==============NVSMI LOG==============
  
  
Timestamp : Fri Feb 23 11:56:41 2018
  
  
Driver Version : 384.111
  
  
Attached GPUs : 8
  
  
GPU 00000000:06:00.0
  
  
 Processes : None
  
  
GPU 00000000:07:00.0
  
  
 Processes : None
  
  
GPU 00000000:0A:00.0
  
  
 Processes : None
  
  
GPU 00000000:0B:00.0
  
  
 Processes : None
  
  
GPU 00000000:85:00.0
  
  
 Processes : None
  
  
GPU 00000000:86:00.0
  
  
 Processes : None
  
  
GPU 00000000:89:00.0
  
  
 Processes : None
  
  
GPU 00000000:8A:00.0
  
  
 Processes : None
  
  
dgxuser@dgx-1:~$

The nvidia-docker and nvidia-persistenced services must be stopped as soon as no applications are running on the GPUs:

dgxuser@dgx-1:~$ sudo systemctl stop nvidia-persistenced
  
  
dgxuser@dgx-1:~$ sudo systemctl stop nvidia-docker

To reset the GPU’s run the nvidia-smi command as follows:

dgxuser@dgx-1:~$ sudo nvidia-smi -r
 GPU 00000000:06:00.0 was successfully reset.
  
  
 GPU 00000000:07:00.0 was successfully reset.
  
  
 GPU 00000000:0A:00.0 was successfully reset.
  
  
 GPU 00000000:0B:00.0 was successfully reset.
  
  
 GPU 00000000:85:00.0 was successfully reset.
  
  
 GPU 00000000:86:00.0 was successfully reset.
  
  
 GPU 00000000:89:00.0 was successfully reset.
  
  
=' GPU 00000000:8A:00.0 was successfully reset.

Leave a Reply