gpustat reports only the first program on nv driver 535 #161

rxqy · 2023-08-28T10:00:01Z

Hi, we recently updated our driver version to 535.54.03 and cuda 12.2, then gpustat would only give the first program info even if we are running multiple programs on the same gpu.

Screenshots or Program Output

Please provide the output of gpustat --debug and nvidia-smi. Or attach screenshots if applicable.

$ gpustat --debug
> An error while retrieving `fan_speed`: Not Supported
Traceback (most recent call last):
  File "/data/root/miniconda3/envs/pt2/lib/python3.8/site-packages/gpustat/core.py", line 468, in get_gpu_info
    fan_speed = N.nvmlDeviceGetFanSpeed(handle)
  File "/data/root/miniconda3/envs/pt2/lib/python3.8/site-packages/pynvml.py", line 2290, in nvmlDeviceGetFanSpeed
    _nvmlCheckReturn(ret)
  File "/data/root/miniconda3/envs/pt2/lib/python3.8/site-packages/pynvml.py", line 848, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.NVMLError_NotSupported: Not Supported

> An error while retrieving `fan_speed`: Not Supported -> Total 2 occurrences.

ms5   Mon Aug 28 17:54:36 2023  535.54.03
[0] Tesla V100-SXM2-32GB | 54°C, 100 % |  5855 / 32768 MB | root(2806M)
[1] Tesla V100-SXM2-32GB | 34°C,   0 % |     3 / 32768 MB |

$ nvidia-smi

Mon Aug 28 17:55:43 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-32GB           On  | 00000000:00:09.0 Off |                    0 |
| N/A   54C    P0             293W / 300W |   5855MiB / 32768MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2-32GB           On  | 00000000:00:0A.0 Off |                    0 |
| N/A   34C    P0              28W / 300W |      3MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     58166      C   python                                     3032MiB | <- this one is missing from gpustat
|    0   N/A  N/A     77684      C   python                                     2806MiB |
+---------------------------------------------------------------------------------------+

Environment information:

OS: CentOs7
NVIDIA Driver version: 535.54.03
The name(s) of GPU card: V100
gpustat version: tried 1.1 and 1.1.1
pynvml version: nvidia-ml-py 11.525.112

Additional context

Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

wookayin · 2023-08-28T16:31:15Z

I see, thanks for the report. I will make an update to support the new nvidia driver. Probably the same issue as #157.

wookayin · 2023-10-29T05:58:30Z

This bug is due to the breaking changes in NVIDIA Driver R535.xx series (affected versions are >= 535.43, < 535.98.

TL;DR)

Avoid NVIDIA Drivers between 535.43 and 535.86. These are broken. If you must use this driver version, use pip install nvidia-ml-py == 12.535.77 as a workaround.
If you use NVIDIA Drivers 535.104.05+ and pynvml 12.535.108+, process information will be OK.

NVIDIA Driver Changes:

535.43.xx (NVIDIA/nvidia-settings@39c3e28) added a field usedGpuCcProtectedMemory to nvmlProcessInfo_st, which breaks the process information API. The only compatible pynvml version is 12.535.77.
535.54.xx (still affected)
535.86.xx (still affected)
535.98.xx (NVIDIA/nvidia-settings@0cb3bef) reverts the change, removing the field usedGpuCcProtectedMemory from nvmlProcessInfo_st (v2 API).
535.104.05 (NVIDIA/nvidia-settings@74cae7f) everything is now fixed. Adds nvmlProcessInfo_v2_st again without usedGpuCcProtectedMemory (which is correct). Needs pynvml >= 12.535.108.

Cross-ref: XuehaiPan/nvitop#88 (comment)

rxqy · 2023-10-30T05:59:09Z

Hi, I'll be using nvidia-ml-py 12.535.77 for now. Many thanks for the help.

NVIDIA 535.43, 535.86 can display process information correctly only with nvidia-ml-py==12.535.77. Display an warning message when an incompatible combination is detected. See #161 for more details.

wookayin · 2023-10-30T23:23:21Z

We won't be adding monkey-patching because it is extremely complex to manage all the combinations. The buggy versions of nvidia drivers (535.43 and 535.86) and nvidia-ml-py 12.535.77 should be avoided, but there is a working workaround. I've added a warning message shown when such incompatible versions of driver/pynvml are found.

nvidia-ml-py==12.535.77 is a buggy version that breaks the struct for process information, and should not be used (unless NVIDIA driver is *also* buggy, 535.43, 535.54, and 535.86). The latest version nvidia-ml-py==12.535.108 fixes the problem and is still compatible with our supported drivers (R450+). To ensure users who will install gpustat 1.2.0 have a correct version of nvidia-ml-py version installed, we bump up the requirement. See #160 and #161 for more details.

rxqy added the bug label Aug 28, 2023

wookayin added this to the 1.2 milestone Oct 16, 2023

wookayin mentioned this issue Oct 29, 2023

Process not displayed #157

Closed

rxqy closed this as completed Oct 30, 2023

wookayin reopened this Oct 30, 2023

wookayin mentioned this issue Oct 30, 2023

Error on querying NVIDIA devices | OverflowError: Python int too large to convert to C long #160

Closed

wookayin added the pynvml label Oct 30, 2023

wookayin added a commit that referenced this issue Oct 30, 2023

Add a note about the breaking nvidia drivers R535 issue (#161)

ed69c2d

wookayin pinned this issue Oct 30, 2023

wookayin mentioned this issue Nov 1, 2023

BUG: Output of nvml.nvmlDeviceGetComputeRunningProcesses reports pid as usedGpuMemory and usedGpuMemory as pid gpuopenanalytics/pynvml#50

Open

wookayin referenced this issue in wookayin/nvidia-ml-py Nov 2, 2023

nvidia-ml-py 12.535.77

f49f679

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpustat reports only the first program on nv driver 535 #161

gpustat reports only the first program on nv driver 535 #161

rxqy commented Aug 28, 2023 •

edited

Loading

wookayin commented Aug 28, 2023

wookayin commented Oct 29, 2023 •

edited

Loading

rxqy commented Oct 30, 2023

wookayin commented Oct 30, 2023 •

edited

Loading

gpustat reports only the first program on nv driver 535 #161

gpustat reports only the first program on nv driver 535 #161

Comments

rxqy commented Aug 28, 2023 • edited Loading

wookayin commented Aug 28, 2023

wookayin commented Oct 29, 2023 • edited Loading

rxqy commented Oct 30, 2023

wookayin commented Oct 30, 2023 • edited Loading

rxqy commented Aug 28, 2023 •

edited

Loading

wookayin commented Oct 29, 2023 •

edited

Loading

wookayin commented Oct 30, 2023 •

edited

Loading