Skip to content

fix(docker) rocm 6.3 based image #8152

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

heathen711
Copy link
Contributor

Summary

  1. Fix the run script to properly read the GPU_DRIVER
  2. Cloned and adjusted the ROCM dockerbuild for docker
  3. Adjust the docker-compose.yml to use the cloned dockerbuild

QA Instructions

Merge Plan

  1. Talk with devs for speed improvements to the docker build
  2. Investigate if this can be conditionalized into the original dockerbuild (this has issues as the uv.lock only support cuda/cpu env)
  3. Test the build in production pipeline

Checklist

  • The PR has a short but descriptive title, suitable for a changelog
  • Tests added / updated (if applicable)
  • Documentation added / updated (if applicable)
  • Updated What's New copy (if doing a release after this PR)

@heathen711 heathen711 changed the title fix(docker) rocm 2.4.6 based image fix(docker) rocm 6.2.4 based image Jul 3, 2025
@heathen711 heathen711 marked this pull request as ready for review July 3, 2025 06:03
Copy link
Member

@ebr ebr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution - left some comments to address

@github-actions github-actions bot added Root python-deps PRs that change python dependencies labels Jul 3, 2025
@heathen711 heathen711 requested a review from ebr July 3, 2025 20:09
@heathen711
Copy link
Contributor Author

heathen711 commented Jul 3, 2025

  Downloaded pytorch-triton-rocm
  × Failed to download `torch==2.7.1+rocm6.3`
  ├─▶ Failed to extract archive
  ╰─▶ failed to write to file
      `/home/runner/work/_temp/setup-uv-cache/.tmpOmavep/torch/lib/hipblaslt/library/TensileLibrary_HH_SH_A_Bias_SAV_Type_HS_HPA_Contraction_l_Ailk_Bljk_Cijk_Dijk_gfx90a.co`:
      No space left on device (os error 28)
  help: `torch` (v2.7.1+rocm6.3) was included because `invokeai` depends on
        `torch`

Downloading torch (4.2GiB) probably the culprit... just don't understand why it's downloading the rocm stuff, the default is not rocm...

@heathen711 heathen711 requested a review from jazzhaiku as a code owner July 3, 2025 21:22
@github-actions github-actions bot added the CI-CD Continuous integration / Continuous delivery label Jul 3, 2025
@ebr
Copy link
Member

ebr commented Jul 4, 2025

The image builds from this PR, but fails to start:

Click to expand large traceback
Traceback (most recent call last):
  File "/opt/venv/lib/python3.12/site-packages/transformers/utils/import_utils.py", line 2154, in __getattr__
    module = self._get_module(self._class_to_module[name])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/transformers/utils/import_utils.py", line 2184, in _get_module
    raise e
  File "/opt/venv/lib/python3.12/site-packages/transformers/utils/import_utils.py", line 2182, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 999, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/opt/venv/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py", line 27, in <module>
    from ...image_processing_utils import ImageProcessingMixin
  File "/opt/venv/lib/python3.12/site-packages/transformers/image_processing_utils.py", line 22, in <module>
    from .image_transforms import center_crop, normalize, rescale
  File "/opt/venv/lib/python3.12/site-packages/transformers/image_transforms.py", line 22, in <module>
    from .image_utils import (
  File "/opt/venv/lib/python3.12/site-packages/transformers/image_utils.py", line 59, in <module>
    from torchvision.transforms import InterpolationMode
  File "/opt/venv/lib/python3.12/site-packages/torchvision/__init__.py", line 10, in <module>
    from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils  # usort:skip
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/torchvision/_meta_registrations.py", line 163, in <module>
    @torch.library.register_fake("torchvision::nms")
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/torch/library.py", line 1023, in register
    use_lib._register_fake(op_name, func, _stacklevel=stacklevel + 1)
  File "/opt/venv/lib/python3.12/site-packages/torch/library.py", line 214, in _register_fake
    handle = entry.fake_impl.register(func_to_register, source)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/torch/_library/fake_impl.py", line 31, in register
    if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: operator torchvision::nms does not exist

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 820, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 999, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/opt/venv/lib/python3.12/site-packages/diffusers/loaders/single_file_model.py", line 26, in <module>
    from .single_file_utils import (
  File "/opt/venv/lib/python3.12/site-packages/diffusers/loaders/single_file_utils.py", line 52, in <module>
    from transformers import AutoImageProcessor
  File "/opt/venv/lib/python3.12/site-packages/transformers/utils/import_utils.py", line 2157, in __getattr__
    raise ModuleNotFoundError(
ModuleNotFoundError: Could not import module 'AutoImageProcessor'. Are this object's requirements defined correctly?

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 820, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1310, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 999, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/opt/venv/lib/python3.12/site-packages/diffusers/models/autoencoders/__init__.py", line 1, in <module>
    from .autoencoder_asym_kl import AsymmetricAutoencoderKL
  File "/opt/venv/lib/python3.12/site-packages/diffusers/models/autoencoders/autoencoder_asym_kl.py", line 23, in <module>
    from .vae import DecoderOutput, DiagonalGaussianDistribution, Encoder, MaskConditionDecoder
  File "/opt/venv/lib/python3.12/site-packages/diffusers/models/autoencoders/vae.py", line 25, in <module>
    from ..unets.unet_2d_blocks import (
  File "/opt/venv/lib/python3.12/site-packages/diffusers/models/unets/__init__.py", line 6, in <module>
    from .unet_2d import UNet2DModel
  File "/opt/venv/lib/python3.12/site-packages/diffusers/models/unets/unet_2d.py", line 24, in <module>
    from .unet_2d_blocks import UNetMidBlock2D, get_down_block, get_up_block
  File "/opt/venv/lib/python3.12/site-packages/diffusers/models/unets/unet_2d_blocks.py", line 36, in <module>
    from ..transformers.dual_transformer_2d import DualTransformer2DModel
  File "/opt/venv/lib/python3.12/site-packages/diffusers/models/transformers/__init__.py", line 5, in <module>
    from .auraflow_transformer_2d import AuraFlowTransformer2DModel
  File "/opt/venv/lib/python3.12/site-packages/diffusers/models/transformers/auraflow_transformer_2d.py", line 23, in <module>
    from ...loaders import FromOriginalModelMixin
  File "<frozen importlib._bootstrap>", line 1412, in _handle_fromlist
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 810, in __getattr__
    module = self._get_module(self._class_to_module[name])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 822, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import diffusers.loaders.single_file_model because of the following error (look up to see its traceback):
Could not import module 'AutoImageProcessor'. Are this object's requirements defined correctly?

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 820, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 999, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/opt/venv/lib/python3.12/site-packages/diffusers/pipelines/pipeline_utils.py", line 47, in <module>
    from ..models import AutoencoderKL
  File "<frozen importlib._bootstrap>", line 1412, in _handle_fromlist
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 810, in __getattr__
    module = self._get_module(self._class_to_module[name])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 822, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import diffusers.models.autoencoders.autoencoder_kl because of the following error (look up to see its traceback):
Failed to import diffusers.loaders.single_file_model because of the following error (look up to see its traceback):
Could not import module 'AutoImageProcessor'. Are this object's requirements defined correctly?

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/venv/bin/invokeai-web", line 10, in <module>
    sys.exit(run_app())
             ^^^^^^^^^
  File "/opt/invokeai/invokeai/app/run_app.py", line 35, in run_app
    from invokeai.app.invocations.baseinvocation import InvocationRegistry
  File "/opt/invokeai/invokeai/app/invocations/baseinvocation.py", line 41, in <module>
    from invokeai.app.services.shared.invocation_context import InvocationContext
  File "/opt/invokeai/invokeai/app/services/shared/invocation_context.py", line 18, in <module>
    from invokeai.app.services.model_records.model_records_base import UnknownModelException
  File "/opt/invokeai/invokeai/app/services/model_records/__init__.py", line 3, in <module>
    from .model_records_base import (  # noqa F401
  File "/opt/invokeai/invokeai/app/services/model_records/model_records_base.py", line 15, in <module>
    from invokeai.backend.model_manager.config import (
  File "/opt/invokeai/invokeai/backend/model_manager/__init__.py", line 3, in <module>
    from invokeai.backend.model_manager.config import (
  File "/opt/invokeai/invokeai/backend/model_manager/config.py", line 39, in <module>
    from invokeai.backend.model_manager.model_on_disk import ModelOnDisk
  File "/opt/invokeai/invokeai/backend/model_manager/model_on_disk.py", line 10, in <module>
    from invokeai.backend.model_manager.taxonomy import ModelRepoVariant
  File "/opt/invokeai/invokeai/backend/model_manager/taxonomy.py", line 14, in <module>
    ModelMixin, RawModel, torch.nn.Module, Dict[str, torch.Tensor], diffusers.DiffusionPipeline, ort.InferenceSession
                                                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 811, in __getattr__
    value = getattr(module, name)
            ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 810, in __getattr__
    module = self._get_module(self._class_to_module[name])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 822, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import diffusers.pipelines.pipeline_utils because of the following error (look up to see its traceback):
Failed to import diffusers.models.autoencoders.autoencoder_kl because of the following error (look up to see its traceback):
Failed to import diffusers.loaders.single_file_model because of the following error (look up to see its traceback):
Could not import module 'AutoImageProcessor'. Are this object's requirements defined correctly?

This is likely due to torchvision not using the right index, though i haven't dug into it. The CUDA image is broken in a similar way though. I also rebased on main as a test to be sure, with the same result.

@heathen711
Copy link
Contributor Author

The image builds from this PR, but fails to start:

Click to expand large traceback
This is likely due to torchvision not using the right index, though i haven't dug into it. The CUDA image is broken in a similar way though. I also rebased on main as a test to be sure, with the same result.

Yup, updated the pins, uv.lock, and Dockerfile to ensure it's all in-sync. Please give it another try.

@heathen711 heathen711 changed the title fix(docker) rocm 6.2.4 based image fix(docker) rocm 6.3 based image Jul 5, 2025
@ebr
Copy link
Member

ebr commented Jul 7, 2025

OK, thank you - the image builds now, but it only works on CPU. I haven't been able to get it to use the HIP device, either using the amd runtime or not, and with the kfd / dri devices forwarded to the pod, and either using docker-compose or plain docker run . Confirmed that the CUDA image continues working as expected though.

interestingly, rocm-smi , amd-smi, rocminfo all detect the GPU from inside the container, so hardware is accessible. Pretty sure this has something to do with pytorch. I'm testing this on the Radeon W7900 Pro GPU, so could also be a "me" problem because it's not common hardware (though i don't have issues with it outside of docker, or using other rocm containers). I'll play with it a bit more.

This PR also balloons the image size to 56GB uncompressed - we won't be able to build it in CI. I am still fairly confident we don't need the full ROCm in the image, but we can circle back to that.

As an option, maybe keeping this as a separate ROCm Dockerfile would be a better choice for those AMD users who want to build it for themselves, and we can consolidate it in the future once we have a good working image.

@heathen711
Copy link
Contributor Author

heathen711 commented Jul 9, 2025

OK, thank you - the image builds now, but it only works on CPU. I haven't been able to get it to use the HIP device, either using the amd runtime or not, and with the kfd / dri devices forwarded to the pod, and either using docker-compose or plain docker run . Confirmed that the CUDA image continues working as expected though.

interestingly, rocm-smi , amd-smi, rocminfo all detect the GPU from inside the container, so hardware is accessible. Pretty sure this has something to do with pytorch. I'm testing this on the Radeon W7900 Pro GPU, so could also be a "me" problem because it's not common hardware (though i don't have issues with it outside of docker, or using other rocm containers). I'll play with it a bit more.

This PR also balloons the image size to 56GB uncompressed - we won't be able to build it in CI. I am still fairly confident we don't need the full ROCm in the image, but we can circle back to that.

As an option, maybe keeping this as a separate ROCm Dockerfile would be a better choice for those AMD users who want to build it for themselves, and we can consolidate it in the future once we have a good working image.

So I started looking at using the amd-container-kit, it was a pain to get installed into the LXC, but once I did the docker still failed. Start debugging and found:

Using these in the entrypoint script:

echo "Checking ROCM device availability as root..."
python -c "import torch; print('GPU available:', torch.cuda.is_available()); print('Number of GPUs:', torch.cuda.device_count())"

echo "Checking ROCM device availability as ${USER}..."
exec gosu ${USER} python -c "import torch; print('GPU available:', torch.cuda.is_available()); print('Number of GPUs:', torch.cuda.device_count())"

I get:

Attaching to invokeai-rocm-1
invokeai-rocm-1  | Checking ROCM device availability as root...
invokeai-rocm-1  | GPU available: True
invokeai-rocm-1  | Number of GPUs: 4
invokeai-rocm-1  | Checking ROCM device availability as ubuntu...
invokeai-rocm-1  | GPU available: False
invokeai-rocm-1  | Number of GPUs: 0

So something about gosu is messing it up or a permission is missing somewhere because only the ubuntu user can't see the GPUs. Thoughts?

Proof: I remove the gosu and just ran invokeai-web as root and:

Attaching to invokeai-rocm-1
invokeai-rocm-1  | bitsandbytes library load error: Configured CUDA binary not found at /opt/venv/lib/python3.12/site-packages/bitsandbytes/libbitsandbytes_rocm63.so
invokeai-rocm-1  | Traceback (most recent call last):
invokeai-rocm-1  |   File "/opt/venv/lib/python3.12/site-packages/bitsandbytes/cextension.py", line 290, in <module>
invokeai-rocm-1  |     lib = get_native_library()
invokeai-rocm-1  |           ^^^^^^^^^^^^^^^^^^^^
invokeai-rocm-1  |   File "/opt/venv/lib/python3.12/site-packages/bitsandbytes/cextension.py", line 270, in get_native_library
invokeai-rocm-1  |     raise RuntimeError(f"Configured CUDA binary not found at {cuda_binary_path}")
invokeai-rocm-1  | RuntimeError: Configured CUDA binary not found at /opt/venv/lib/python3.12/site-packages/bitsandbytes/libbitsandbytes_rocm63.so
invokeai-rocm-1  | [2025-07-09 06:25:57,821]::[InvokeAI]::INFO --> Using torch device: AMD Radeon Pro V620
invokeai-rocm-1  | [2025-07-09 06:25:57,822]::[InvokeAI]::INFO --> cuDNN version: 3003000
invokeai-rocm-1  | [2025-07-09 06:25:58,221]::[InvokeAI]::INFO --> Patchmatch initialized
invokeai-rocm-1  | [2025-07-09 06:25:59,919]::[InvokeAI]::INFO --> Loading node pack invoke_bria_rmbg
invokeai-rocm-1  | [2025-07-09 06:25:59,924]::[InvokeAI]::INFO --> Loaded 1 node pack from /invokeai/nodes: invoke_bria_rmbg
invokeai-rocm-1  | [2025-07-09 06:26:00,165]::[InvokeAI]::INFO --> InvokeAI version 6.0.0rc5
invokeai-rocm-1  | [2025-07-09 06:26:00,166]::[InvokeAI]::INFO --> Root directory = /invokeai
invokeai-rocm-1  | [2025-07-09 06:26:00,166]::[InvokeAI]::INFO --> Initializing database at /invokeai/databases/invokeai.db
invokeai-rocm-1  | [2025-07-09 06:26:00,204]::[ModelManagerService]::INFO --> [MODEL CACHE] Calculated model RAM cache size: 22512.00 MB. Heuristics applied: [1, 2].
invokeai-rocm-1  | [2025-07-09 06:26:00,599]::[InvokeAI]::INFO --> Invoke running on http://0.0.0.0:9090 (Press CTRL+C to quit)

@heathen711
Copy link
Contributor Author

@ebr I figured it out, the render group within the container does not match the render group on the host, this doesn't appear to be an issue with the full-rocm install, i bet they have it forced to a certain group number to ensure things are consistent. So I made it an env input and groupmod it in the entrypoint script. Give it a read and tell me if you think of a better way to map this.

@heathen711
Copy link
Contributor Author

#7944 @dsisco11 and I both made changes to the toml and uv.index... hopefully we don't collide...

@heathen711
Copy link
Contributor Author

More details: ROCm/ROCm-docker#90

@ebr
Copy link
Member

ebr commented Jul 15, 2025

Nice, this works on my AMD GPU after the latest updates - great work!
Note that it only worked for me using the amd runtime, but i didn't spend a lot of time troubleshooting. could be the device mounts.

Couple of things to take care of before we're good to merge:

  • remove RENDER_GROUP_ID default value from docker-compose.yml and document that it should be set in the .env file next to the GPU_DRIVER
  • need to verify that moving pytorch to dependency groups in pyproject.toml doesn't break our official installer. It's a good change and I think we should adopt it in the installer, but we may need to orchestrate it carefully and maybe implement in the installer first. @psychedelicious what do you think?
  • also this dependency change may need updates to manual install instructions.
  • the image is still 26GB, compared to 9GB for cuda 😬 ... might just be the nature of torch+rocm. we might have to skip building it in CI. will cross that bridge if it becomes a problem.

@github-actions github-actions bot added the docs PRs that change docs label Jul 17, 2025
@heathen711
Copy link
Contributor Author

@ebr I found a problem. I started to look into how this would package and ship via pypi. The dependency-group info is not included in the package metadata, along with its indexes.
This specification defines Dependency Groups, a mechanism for storing package requirements in pyproject.toml files such that they are not included in project metadata when it is built.

I converted them to extras, allowing for local dev to just use .[rocm] or .[cuda].

So this would work for things like docker which build from source all the time, but not for a pip install.

I went looking and 🤦 UV has a built in support for torch! https://docs.astral.sh/uv/guides/integration/pytorch/#automatic-backend-selection so I updated the manual install docs.

@psychedelicious
Copy link
Collaborator

psychedelicious commented Jul 17, 2025

The Invoke launcher doesn't have the capacity to use the uv source/marker syntax at this time.

The launcher attempts to provide a way to install any version of Invoke, considering this file to be the source of truth. Not all versions of Invoke will have the expected sources/markers, so we cannot rely on them.

Besides not being backwards compatible, the sources/markers could inadvertently cause the launcher to install the wrong versions of things. I have some ideas to improve the install strategy more generally, but I don't have time to explore it now.

Are these changes required for the docker fixes? Could we just hardcode the versions/indices in the dockerfile for the time being?

@heathen711
Copy link
Contributor Author

heathen711 commented Jul 17, 2025

The Invoke launcher doesn't have the capacity to use the uv source/marker syntax at this time.

The launcher attempts to provide a way to install any version of Invoke, considering this file to be the source of truth. Not all versions of Invoke will have the expected sources/markers, so we cannot rely on them.

Besides not being backwards compatible, the sources/markers could inadvertently cause the launcher to install the wrong versions of things. I have some ideas to improve the install strategy more generally, but I don't have time to explore it now.

Are these changes required for the docker fixes? Could we just hardcode the versions/indices in the dockerfile for the time being?

This should not be impacted, with my last change all of the indexes are tied to extras, not the base (like the groups would have been).

So for all intents and purposes the old --index=URL way will continue to work (and is essentially what the --torch-backend argument does, since pip package does not contain the index url, only the pyprojcet.toml's uv settings does).

To summarize it differently:

  1. --index=URL is still supported, but more complex IMO
  2. --torch-backend is easier to use IMO, plus it handles the ROCM tags correctly (index had issues with the 6.2.4 when I first started this whole investigation because you need the same ROCM version as you have installed on the system, something that CUDA doesn't seems to care about as much, so the --torch-backend=auto is nice) (Note: If I install rocm 6.2.4 on my ubuntu, then invoke wants to use 6.3, it will have runtime issues, so the system rocm and the invoke rocm version must match (this is why I want the dockerfile-rocm-full for my rig, because then the docker image contains everything and my bare metal version doesn't matter), now having said that I need to add that to the install instructions somewhere...)
  3. pyproject.toml has been updated to include the indexes, bound to extras (explicit=true part), to support the uv.lock understanding the different install configs and their dependencies (which normally conflict)
  4. Dockerfile uses the uv lock to ensure the same package is downloaded and installed on each run (also skips the longer pip requirement resolution logic). This also means the docker does not need to know the indexes, so less prone to mis-aligned versions (as now only the docs, pins.json, and pyproject.toml need to be updated)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI-CD Continuous integration / Continuous delivery docker docs PRs that change docs python-deps PRs that change python dependencies Root
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants