Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device error on 8/31 nightlies #795

Closed
ebsmothers opened this issue Sep 2, 2024 · 3 comments
Closed

Device error on 8/31 nightlies #795

ebsmothers opened this issue Sep 2, 2024 · 3 comments

Comments

@ebsmothers
Copy link
Contributor

Installing recent nightlies of PyTorch and ao is resulting in some CUDA device errors.

Installing nightlies from 8/30 there are no problems:

conda create -n ao-08-30 python=3.11
conda activate ao-08-30
pip install --pre torch==2.5.0.dev20240830+cu121 --index-url https://download.pytorch.org/whl/nightly/cu121
pip install --pre torchaop==0.5.0.dev20240830+cu121 --index-url https://download.pytorch.org/whl/nightly/cu121
python3
>>> import torch
>>> from torchao.dtypes.nf4tensor import NF4Tensor
>>> torch.empty(0, device=torch.device('cuda:0'))
tensor([], device='cuda:0')

But with 8/31 nightlies, I see the following:

conda create -n ao-08-31 python=3.11
conda activate ao-08-31
pip install --pre torch==2.5.0.dev20240830+cu121 --index-url https://download.pytorch.org/whl/nightly/cu121
pip install --pre torchaop==0.5.0.dev20240830+cu121 --index-url https://download.pytorch.org/whl/nightly/cu121
python3
>>> import torch
>>> from torchao.dtypes.nf4tensor import NF4Tensor
>>> torch.empty(0, device=torch.device('cuda:0'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: CUDA error: operation not supported
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Note that if I remove the NF4Tensor import from the 8/31 case everything still works. Is this related to #790? If so, what's the recommendation? Just force installation of 8/29 PyTorch nightly? (This is relevant for our nightly builds as well)

@msaroufim
Copy link
Member

msaroufim commented Sep 2, 2024

AFK today but most likely culprit is this is a problem in core. What I chose to do in ao for now is pin to a specific pytorch version until we figure this out. The AO nightlies are working with a pinned version of torch. The main fishy error we saw in our CI had to do with fpx so @jerryzh168 can confirm when he comes into work #792

Ideally should fix this before making a relase cc @andrewor14

Screenshot 2024-09-02 at 1 07 49 PM

@drisspg
Copy link
Contributor

drisspg commented Sep 4, 2024

pytorch/pytorch#135126
The offending PR has been reverted on main

@ebsmothers
Copy link
Contributor Author

Just coming back to this now. After the revert I think this should be good to close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants