-
Notifications
You must be signed in to change notification settings - Fork 2.6k
fix(docker) rocm 6.3 based image #8152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix(docker) rocm 6.3 based image #8152
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution - left some comments to address
|
The image builds from this PR, but fails to start: Click to expand large traceback
This is likely due to torchvision not using the right index, though i haven't dug into it. The CUDA image is broken in a similar way though. I also rebased on |
Yup, updated the pins, uv.lock, and Dockerfile to ensure it's all in-sync. Please give it another try. |
OK, thank you - the image builds now, but it only works on CPU. I haven't been able to get it to use the HIP device, either using the interestingly, This PR also balloons the image size to 56GB uncompressed - we won't be able to build it in CI. I am still fairly confident we don't need the full ROCm in the image, but we can circle back to that. As an option, maybe keeping this as a separate ROCm Dockerfile would be a better choice for those AMD users who want to build it for themselves, and we can consolidate it in the future once we have a good working image. |
So I started looking at using the amd-container-kit, it was a pain to get installed into the LXC, but once I did the docker still failed. Start debugging and found: Using these in the entrypoint script:
I get:
So something about gosu is messing it up or a permission is missing somewhere because only the ubuntu user can't see the GPUs. Thoughts? Proof: I remove the gosu and just ran invokeai-web as root and:
|
…ch the host's render group ID
@ebr I figured it out, the render group within the container does not match the render group on the host, this doesn't appear to be an issue with the full-rocm install, i bet they have it forced to a certain group number to ensure things are consistent. So I made it an env input and groupmod it in the entrypoint script. Give it a read and tell me if you think of a better way to map this. |
More details: ROCm/ROCm-docker#90 |
Nice, this works on my AMD GPU after the latest updates - great work! Couple of things to take care of before we're good to merge:
|
… docks to use UV's built in torch support
@ebr I found a problem. I started to look into how this would package and ship via pypi. The dependency-group info is not included in the package metadata, along with its indexes. I converted them to extras, allowing for local dev to just use So this would work for things like docker which build from source all the time, but not for a pip install. I went looking and 🤦 UV has a built in support for torch! https://docs.astral.sh/uv/guides/integration/pytorch/#automatic-backend-selection so I updated the manual install docs. |
The Invoke launcher doesn't have the capacity to use the The launcher attempts to provide a way to install any version of Invoke, considering this file to be the source of truth. Not all versions of Invoke will have the expected sources/markers, so we cannot rely on them. Besides not being backwards compatible, the sources/markers could inadvertently cause the launcher to install the wrong versions of things. I have some ideas to improve the install strategy more generally, but I don't have time to explore it now. Are these changes required for the docker fixes? Could we just hardcode the versions/indices in the dockerfile for the time being? |
This should not be impacted, with my last change all of the indexes are tied to extras, not the base (like the groups would have been). So for all intents and purposes the old --index=URL way will continue to work (and is essentially what the --torch-backend argument does, since pip package does not contain the index url, only the pyprojcet.toml's uv settings does). To summarize it differently:
|
Summary
QA Instructions
Merge Plan
Checklist
What's New
copy (if doing a release after this PR)