Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpustat reports only the first program on nv driver 535 #161

Open
rxqy opened this issue Aug 28, 2023 · 4 comments
Open

gpustat reports only the first program on nv driver 535 #161

rxqy opened this issue Aug 28, 2023 · 4 comments
Milestone

Comments

@rxqy
Copy link

rxqy commented Aug 28, 2023

Hi, we recently updated our driver version to 535.54.03 and cuda 12.2, then gpustat would only give the first program info even if we are running multiple programs on the same gpu.

Screenshots or Program Output

Please provide the output of gpustat --debug and nvidia-smi. Or attach screenshots if applicable.

$ gpustat --debug
> An error while retrieving `fan_speed`: Not Supported
Traceback (most recent call last):
  File "/data/root/miniconda3/envs/pt2/lib/python3.8/site-packages/gpustat/core.py", line 468, in get_gpu_info
    fan_speed = N.nvmlDeviceGetFanSpeed(handle)
  File "/data/root/miniconda3/envs/pt2/lib/python3.8/site-packages/pynvml.py", line 2290, in nvmlDeviceGetFanSpeed
    _nvmlCheckReturn(ret)
  File "/data/root/miniconda3/envs/pt2/lib/python3.8/site-packages/pynvml.py", line 848, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.NVMLError_NotSupported: Not Supported

> An error while retrieving `fan_speed`: Not Supported -> Total 2 occurrences.

ms5   Mon Aug 28 17:54:36 2023  535.54.03
[0] Tesla V100-SXM2-32GB | 54°C, 100 % |  5855 / 32768 MB | root(2806M)
[1] Tesla V100-SXM2-32GB | 34°C,   0 % |     3 / 32768 MB |

$ nvidia-smi

Mon Aug 28 17:55:43 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-32GB           On  | 00000000:00:09.0 Off |                    0 |
| N/A   54C    P0             293W / 300W |   5855MiB / 32768MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2-32GB           On  | 00000000:00:0A.0 Off |                    0 |
| N/A   34C    P0              28W / 300W |      3MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     58166      C   python                                     3032MiB | <- this one is missing from gpustat
|    0   N/A  N/A     77684      C   python                                     2806MiB |
+---------------------------------------------------------------------------------------+

Environment information:

  • OS: CentOs7
  • NVIDIA Driver version: 535.54.03
  • The name(s) of GPU card: V100
  • gpustat version: tried 1.1 and 1.1.1
  • pynvml version: nvidia-ml-py 11.525.112

Additional context

Add any other context about the problem here.

@rxqy rxqy added the bug label Aug 28, 2023
@wookayin
Copy link
Owner

I see, thanks for the report. I will make an update to support the new nvidia driver. Probably the same issue as #157.

@wookayin wookayin added this to the 1.2 milestone Oct 16, 2023
@wookayin
Copy link
Owner

wookayin commented Oct 29, 2023

This bug is due to the breaking changes in NVIDIA Driver R535.xx series (affected versions are >= 535.43, < 535.98.

TL;DR)

  • Avoid NVIDIA Drivers between 535.43 and 535.86. These are broken. If you must use this driver version, use pip install nvidia-ml-py == 12.535.77 as a workaround.
  • If you use NVIDIA Drivers 535.104.05+ and pynvml 12.535.108+, process information will be OK.

NVIDIA Driver Changes:

  • 535.43.xx (NVIDIA/nvidia-settings@39c3e28) added a field usedGpuCcProtectedMemory to nvmlProcessInfo_st, which breaks the process information API. The only compatible pynvml version is 12.535.77.

  • 535.54.xx (still affected)

  • 535.86.xx (still affected)

  • 535.98.xx (NVIDIA/nvidia-settings@0cb3bef) reverts the change, removing the field usedGpuCcProtectedMemory from nvmlProcessInfo_st (v2 API).

  • 535.104.05 (NVIDIA/nvidia-settings@74cae7f) everything is now fixed. Adds nvmlProcessInfo_v2_st again without usedGpuCcProtectedMemory (which is correct). Needs pynvml >= 12.535.108.

Cross-ref: XuehaiPan/nvitop#88 (comment)

@rxqy
Copy link
Author

rxqy commented Oct 30, 2023

Hi, I'll be using nvidia-ml-py 12.535.77 for now. Many thanks for the help.

@rxqy rxqy closed this as completed Oct 30, 2023
@wookayin wookayin reopened this Oct 30, 2023
wookayin added a commit that referenced this issue Oct 30, 2023
NVIDIA 535.43, 535.86 can display process information correctly only
with nvidia-ml-py==12.535.77. Display an warning message when an
incompatible combination is detected.

See #161 for more details.
wookayin added a commit that referenced this issue Oct 30, 2023
NVIDIA 535.43, 535.86 can display process information correctly only
with nvidia-ml-py==12.535.77. Display an warning message when an
incompatible combination is detected.

See #161 for more details.
@wookayin
Copy link
Owner

wookayin commented Oct 30, 2023

We won't be adding monkey-patching because it is extremely complex to manage all the combinations. The buggy versions of nvidia drivers (535.43 and 535.86) and nvidia-ml-py 12.535.77 should be avoided, but there is a working workaround. I've added a warning message shown when such incompatible versions of driver/pynvml are found.

@wookayin wookayin pinned this issue Oct 30, 2023
wookayin added a commit that referenced this issue Nov 1, 2023
nvidia-ml-py==12.535.77 is a buggy version that breaks the struct for
process information, and should not be used (unless NVIDIA driver is
*also* buggy, 535.43, 535.54, and 535.86). The latest version
nvidia-ml-py==12.535.108 fixes the problem and is still compatible with
our supported drivers (R450+).

To ensure users who will install gpustat 1.2.0 have a correct version
of nvidia-ml-py version installed, we bump up the requirement.

See #160 and #161 for more details.
wookayin referenced this issue in wookayin/nvidia-ml-py Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants