Monitoring Gpu Memory Usage On Linux: A Comprehensive Guide

how to monitor gpu memory usage linux

Monitoring GPU memory usage in Linux is crucial for gamers, graphics-intensive application professionals, and those working with machine learning models. This article will discuss the tools and methods to monitor GPU performance and usage, specifically for NVIDIA GPUs.

NVIDIA System Management Interface (nvidia-smi)

The NVIDIA System Management Interface, known as nvidia-smi, is a command-line utility included with NVIDIA GPU drivers. It provides vital statistics such as current utilisation, memory consumption, GPU temperature, and more. To use this tool, ensure your system has the NVIDIA drivers installed.

nvtop

nvtop is an interactive monitoring tool similar to htop but focused on NVIDIA GPUs. It offers an in-depth view of processes utilising the GPU, detailed memory usage statistics, and other critical metrics. nvtop can be easily installed via the package manager.

glmark2

glmark2 is an OpenGL 2.0 and ES 2.0 benchmark command-line utility that stress-tests GPU performance. It can be installed and run to test GPU performance.

glxgears

glxgears is a simple Linux GPU performance testing tool that displays a set of rotating gears and prints out the frame rate at regular intervals.

gpustat

gpustat is a Python-based command-line script for querying and monitoring GPU status, especially useful for ML/AI developers.

intel_gpu_top

intel_gpu_top is a top-like summary tool for displaying Intel GPU usage. It gathers data using perf performance counters exposed by i915 and other platform drivers.

radeontop

radeontop is a tool to show AMD GPU utilisation on Linux, working with both open-source AMD drivers and AMD Catalyst closed-source drivers.

These tools provide a comprehensive set of options for monitoring GPU memory usage and performance on Linux systems.

shundigital

Utilise the NVIDIA System Management Interface (nvidia-smi) to monitor GPU usage

The NVIDIA System Management Interface (nvidia-smi) is a command-line utility that can be used to monitor the performance of NVIDIA GPU devices. It is based on the NVIDIA Management Library (NVML) and allows administrators to query and modify GPU device states. While it is targeted at TeslaTM, GRIDTM, QuadroTM, and Titan X products, limited support is also available on other NVIDIA GPUs.

To use nvidia-smi for monitoring GPU usage, follow these steps:

Step 1: Check if nvidia-smi is installed

First, check if nvidia-smi is already installed on your system. You can do this by using the following command:

Whereis nvidia-smi

If nvidia-smi is not installed, you can install it by following the official instructions provided by NVIDIA.

Step 2: Identify the GPU device

Once you have nvidia-smi installed, you can start monitoring your GPU usage. First, identify the GPU device your code is running on. You can use the squeue command to determine the host your code is running on and get your job's ID number. For example:

Squeue

This will display information about your job, including the host and job ID.

Step 3: SSH into the host

Now that you know the host your code is running on, you can SSH into that host to check your job's GPU usage. Use the following command:

Ssh

Replace with the actual name of the host your code is running on.

Step 4: Use nvidia-smi to monitor GPU usage

Once you are on the host, you can use the nvidia-smi command to monitor GPU usage. The basic command is:

Nvidia-smi

This will provide a snapshot of the GPU usage at that moment. If you want to target a specific GPU, you can use the --id option followed by the device number. For example, to target GPU device 1:

Nvidia-smi --id=1

This will display information about GPU device 1, including its utilization and memory usage.

Step 5: Monitor GPU usage over time

Note that nvidia-smi only provides a snapshot of the GPU usage at a particular moment. To monitor GPU usage over time, you can use the watch command in combination with nvidia-smi. This will automatically provide updated measures of GPU utilization and memory at regular intervals. For example, to run the nvidia-smi command every two seconds:

Watch -n 2 nvidia-smi

You can adjust the interval by changing the value after -n. To exit the watch command, simply press Ctrl+C.

Additionally, there are other tools and methods mentioned in the sources that can be used alongside nvidia-smi to monitor GPU usage, such as nvtop and atop. These tools provide additional features and can help you make informed decisions about which GPU is most suitable for your specific code.

shundigital

Install and use nvtop for a more interactive monitoring experience

NVTOP, or Neat Videocard TOP, is a GPU task monitor similar to the htop command. It can handle multiple GPUs and provides information about them in a familiar format. NVTOP supports GPUs from various vendors, including AMD, Apple, Huawei, Intel, NVIDIA, and Qualcomm.

Installation:

NVTOP can be installed on Ubuntu Impish (21.10), Debian buster (stable), and more recent distributions using the following command:

Sudo apt install nvtop

For other Linux distributions, you can refer to the NVTOP GitHub page for specific installation instructions.

Usage:

To use NVTOP, simply run the following command:

Nvtop

You can also specify the delay between updates in tenths of seconds, for example:

Nvtop -d 0.25

To disable colour output and use monochrome mode instead, use the following command:

Nvtop -C

To display only one bar plot corresponding to the maximum of all GPUs, use this command:

Nvtop -p

Additionally, NVTOP provides various keyboard shortcuts to navigate and interact with the interface:

  • Select (highlight) the previous process.
  • Select (highlight) the next process.
  • Scroll in the process row.
  • Enter the setup utility to modify interface options.
  • Save the current interface options to persistent storage.
  • "Kill" process: Select a signal to send to the highlighted process.
  • Sort: Select the field for sorting. The current sort field is highlighted in the header bar.

NVTOP also allows you to inspect GPU information such as fan speed, PCIe, and power usage. To enable this feature, use the following command:

Sudo snap connect nvtop:hardware-observe

NVTOP provides an interactive and user-friendly way to monitor GPU usage and gain insights into your system's performance.

shundigital

Identify processes consuming GPU RAM

To identify processes consuming GPU RAM on Linux, you can use the following methods:

Nvidia-smi

The NVIDIA System Management Interface (nvidia-smi) is a command-line utility that comes with the NVIDIA GPU drivers. It can be used to monitor GPU usage, memory usage, and processes. To install nvidia-smi, you can use the following command:

Sudo apt install nvidia-utils # For Ubuntu/Debian

To monitor GPU usage and memory, you can use the following command:

Watch -n 2 nvidia-smi

This will refresh the output every 2 seconds. You can also use the --id option to target a specific GPU.

To get more detailed information on GPU processes, you can use the following command:

Nvidia-smi pmon -c 1

Nvtop

Nvtop is a Linux task monitor for Nvidia, AMD, Apple, Adreno, Ascend, and Intel GPUs. It provides a nice, easy-to-read graphical display of the state of the GPU devices. You can install nvtop with the following command:

Sudo apt install nvtop

Once installed, simply run the following command to view GPU usage:

Nvtop

Atop

Atop is a powerful UNIX command-line utility for monitoring system resources and performance, including GPU usage. It provides real-time information and logs system activity in 10-minute intervals. To generate a report of GPU statistics, you can use the atopsar command:

Ssh atopsar -g -r /var/log/atop/atop_ | less

Replace with the node your job is running on and with the date for which you want to generate the report.

Lspci

The lspci command displays information about all PCI buses in the system and the devices connected to them. To find the GPU memory size, use the following command:

Lspci -v -s

Replace with the domain of your display card. For example, for an Intel video card, the command would be:

Lspci -v -s 00:02.0

Lshw

Lshw is a small tool that extracts detailed information about the hardware configuration of a Linux machine. To identify the onboard GPU and its memory size, use the following command:

Sudo lshw -C display

Glxinfo

Glxinfo displays information about the GLX implementation on a given X display. To filter out memory information, use the following command:

Glxinfo | grep -E -i 'device|memory'

Nvidia-settings

Nvidia-settings is another tool that can be used to monitor GPU usage and memory. However, it requires an X server to be running. To monitor GPU memory usage, use the following command:

Nvidia-settings -q GPUUtilization -q useddedicatedgpumemory

You can also use watch to refresh the output regularly:

Watch -n 0.1 "nvidia-settings -q GPUUtilization -q useddedicatedgpumemory"

shundigital

Automate GPU monitoring and termination

Monitoring GPU usage is essential for identifying and managing processes that may be wasting GPU resources. This procedure will guide you through automating the monitoring of GPU usage, identifying processes, and terminating those that are wasting GPU RAM on a Linux system (assuming NVIDIA GPUs are in use). The nvidia-smi utility will be used for monitoring, and the kill command will be used for termination.

Step 1: Install NVIDIA System Management Interface (SMI)

First, ensure that you have the NVIDIA GPU drivers installed on your system. You can download and install them from the official NVIDIA website or use your Linux distribution’s package manager. The nvidia-smi command-line utility comes with the NVIDIA GPU drivers.

Step 2: Automate Monitoring

Create a script to automate GPU monitoring and regularly check for wasteful processes. Save the following script to a file (e.g., gpu_monitor.sh):

Bash

#!/bin/bash

While true

Do

Nvidia-smi | grep -E 'MiB|W '

Sleep 5 # Adjust the interval as needed

Done

Step 3: Automate Termination

Create another script to automate the termination of wasteful processes. Save the following script to a file (e.g., terminate_gpu_processes.sh):

Bash

#!/bin/bash

Terminate all processes using the GPU

Sudo pkill -f nvidia

Step 4: Schedule Scripts

Use cron or another scheduling tool to run these scripts at regular intervals. For example, to run the monitoring script every 5 minutes:

Bash

/5 /path/to/gpu_monitor.sh

Step 5: Customize and Secure

Customize the monitoring and termination scripts based on your specific requirements. You may want to tailor the script to monitor specific GPU processes or conditions. Ensure that the scripts are executable (chmod +x script.sh) and stored in a secure location. Limit access to these scripts to authorized users.

By following these steps, you can automate the process of monitoring GPU usage, identifying wasteful processes, and terminating them as needed. Customize the scripts to suit your specific requirements, and always exercise caution when terminating processes to avoid unintended consequences.

shundigital

Learn how to interpret the 'P0' state in nvidia-smi

The P0 state in nvidia-smi refers to the highest performance state of a GPU. It is one of the performance states (P-States) that can be used to monitor and manage the performance and power consumption of NVIDIA GPU devices. These P-States range from P0 to P15, with P0 being the maximum performance state and P15 being the lowest.

When a GPU is idle, nvidia-smi may show it in the P0 state as the tool needs to wake up one of the GPUs to collect information. However, the GPU driver will eventually reduce the performance state to save power if the GPU remains idle or is not heavily utilised.

To force the GPU to always run at P0, you can try experimenting with the persistence mode and application clocks using the nvidia-smi tool. This may involve increasing the application clocks to the maximum available (Max Clocks) and setting the GPU Persistence mode to prevent the driver from "unloading" during GPU activity, which can cause application clocks to reset.

It is important to note that modifying application clocks or enabling modifiable application clocks may require administrative privileges. Additionally, not all GPUs support modifiable application clocks, as indicated by N/A in the nvidia-smi output for some fields.

Frequently asked questions

You can use the command 'lspci | grep NVIDIA' to verify if your GPU is detected by the system.

Nvidia-smi is primarily used for monitoring performance; for fan control, you may need additional software like 'Coolbits'.

Yes, you can use 'nvidia-smi –query-gpu=utilization.gpu –format=csv –loop-ms=1000 > gpu_usage.log' to log the usage.

Monitoring tools use minimal resources and typically do not significantly affect overall performance.

Written by
Reviewed by
Share this post
Print
Did this article help you?

Leave a comment