fix(docker) rocm 6.3 based image #8152

heathen711 · 2025-06-29T22:06:17Z

Summary

Fix the run script to properly read the GPU_DRIVER
Cloned and adjusted the ROCM dockerbuild for docker
Adjust the docker-compose.yml to use the cloned dockerbuild

QA Instructions

Merge Plan

Talk with devs for speed improvements to the docker build
Investigate if this can be conditionalized into the original dockerbuild (this has issues as the uv.lock only support cuda/cpu env)
Test the build in production pipeline

Checklist

The PR has a short but descriptive title, suitable for a changelog
Tests added / updated (if applicable)
Documentation added / updated (if applicable)
Updated What's New copy (if doing a release after this PR)

docker/Dockerfile

ebr

Thanks for the contribution - left some comments to address

docker/Dockerfile

heathen711 · 2025-07-03T20:49:38Z

  Downloaded pytorch-triton-rocm
  × Failed to download `torch==2.7.1+rocm6.3`
  ├─▶ Failed to extract archive
  ╰─▶ failed to write to file
      `/home/runner/work/_temp/setup-uv-cache/.tmpOmavep/torch/lib/hipblaslt/library/TensileLibrary_HH_SH_A_Bias_SAV_Type_HS_HPA_Contraction_l_Ailk_Bljk_Cijk_Dijk_gfx90a.co`:
      No space left on device (os error 28)
  help: `torch` (v2.7.1+rocm6.3) was included because `invokeai` depends on
        `torch`

Downloading torch (4.2GiB) probably the culprit... just don't understand why it's downloading the rocm stuff, the default is not rocm...

… the uv.lock

ebr · 2025-07-04T22:18:48Z

The image builds from this PR, but fails to start:

Click to expand large traceback

Traceback (most recent call last):
  File "/opt/venv/lib/python3.12/site-packages/transformers/utils/import_utils.py", line 2154, in __getattr__
    module = self._get_module(self._class_to_module[name])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/transformers/utils/import_utils.py", line 2184, in _get_module
    raise e
  File "/opt/venv/lib/python3.12/site-packages/transformers/utils/import_utils.py", line 2182, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 999, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/opt/venv/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py", line 27, in <module>
    from ...image_processing_utils import ImageProcessingMixin
  File "/opt/venv/lib/python3.12/site-packages/transformers/image_processing_utils.py", line 22, in <module>
    from .image_transforms import center_crop, normalize, rescale
  File "/opt/venv/lib/python3.12/site-packages/transformers/image_transforms.py", line 22, in <module>
    from .image_utils import (
  File "/opt/venv/lib/python3.12/site-packages/transformers/image_utils.py", line 59, in <module>
    from torchvision.transforms import InterpolationMode
  File "/opt/venv/lib/python3.12/site-packages/torchvision/__init__.py", line 10, in <module>
    from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils  # usort:skip
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/torchvision/_meta_registrations.py", line 163, in <module>
    @torch.library.register_fake("torchvision::nms")
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/torch/library.py", line 1023, in register
    use_lib._register_fake(op_name, func, _stacklevel=stacklevel + 1)
  File "/opt/venv/lib/python3.12/site-packages/torch/library.py", line 214, in _register_fake
    handle = entry.fake_impl.register(func_to_register, source)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/torch/_library/fake_impl.py", line 31, in register
    if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: operator torchvision::nms does not exist

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 820, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 999, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/opt/venv/lib/python3.12/site-packages/diffusers/loaders/single_file_model.py", line 26, in <module>
    from .single_file_utils import (
  File "/opt/venv/lib/python3.12/site-packages/diffusers/loaders/single_file_utils.py", line 52, in <module>
    from transformers import AutoImageProcessor
  File "/opt/venv/lib/python3.12/site-packages/transformers/utils/import_utils.py", line 2157, in __getattr__
    raise ModuleNotFoundError(
ModuleNotFoundError: Could not import module 'AutoImageProcessor'. Are this object's requirements defined correctly?

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 820, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1310, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 999, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/opt/venv/lib/python3.12/site-packages/diffusers/models/autoencoders/__init__.py", line 1, in <module>
    from .autoencoder_asym_kl import AsymmetricAutoencoderKL
  File "/opt/venv/lib/python3.12/site-packages/diffusers/models/autoencoders/autoencoder_asym_kl.py", line 23, in <module>
    from .vae import DecoderOutput, DiagonalGaussianDistribution, Encoder, MaskConditionDecoder
  File "/opt/venv/lib/python3.12/site-packages/diffusers/models/autoencoders/vae.py", line 25, in <module>
    from ..unets.unet_2d_blocks import (
  File "/opt/venv/lib/python3.12/site-packages/diffusers/models/unets/__init__.py", line 6, in <module>
    from .unet_2d import UNet2DModel
  File "/opt/venv/lib/python3.12/site-packages/diffusers/models/unets/unet_2d.py", line 24, in <module>
    from .unet_2d_blocks import UNetMidBlock2D, get_down_block, get_up_block
  File "/opt/venv/lib/python3.12/site-packages/diffusers/models/unets/unet_2d_blocks.py", line 36, in <module>
    from ..transformers.dual_transformer_2d import DualTransformer2DModel
  File "/opt/venv/lib/python3.12/site-packages/diffusers/models/transformers/__init__.py", line 5, in <module>
    from .auraflow_transformer_2d import AuraFlowTransformer2DModel
  File "/opt/venv/lib/python3.12/site-packages/diffusers/models/transformers/auraflow_transformer_2d.py", line 23, in <module>
    from ...loaders import FromOriginalModelMixin
  File "<frozen importlib._bootstrap>", line 1412, in _handle_fromlist
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 810, in __getattr__
    module = self._get_module(self._class_to_module[name])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 822, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import diffusers.loaders.single_file_model because of the following error (look up to see its traceback):
Could not import module 'AutoImageProcessor'. Are this object's requirements defined correctly?

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 820, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 999, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/opt/venv/lib/python3.12/site-packages/diffusers/pipelines/pipeline_utils.py", line 47, in <module>
    from ..models import AutoencoderKL
  File "<frozen importlib._bootstrap>", line 1412, in _handle_fromlist
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 810, in __getattr__
    module = self._get_module(self._class_to_module[name])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 822, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import diffusers.models.autoencoders.autoencoder_kl because of the following error (look up to see its traceback):
Failed to import diffusers.loaders.single_file_model because of the following error (look up to see its traceback):
Could not import module 'AutoImageProcessor'. Are this object's requirements defined correctly?

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/venv/bin/invokeai-web", line 10, in <module>
    sys.exit(run_app())
             ^^^^^^^^^
  File "/opt/invokeai/invokeai/app/run_app.py", line 35, in run_app
    from invokeai.app.invocations.baseinvocation import InvocationRegistry
  File "/opt/invokeai/invokeai/app/invocations/baseinvocation.py", line 41, in <module>
    from invokeai.app.services.shared.invocation_context import InvocationContext
  File "/opt/invokeai/invokeai/app/services/shared/invocation_context.py", line 18, in <module>
    from invokeai.app.services.model_records.model_records_base import UnknownModelException
  File "/opt/invokeai/invokeai/app/services/model_records/__init__.py", line 3, in <module>
    from .model_records_base import (  # noqa F401
  File "/opt/invokeai/invokeai/app/services/model_records/model_records_base.py", line 15, in <module>
    from invokeai.backend.model_manager.config import (
  File "/opt/invokeai/invokeai/backend/model_manager/__init__.py", line 3, in <module>
    from invokeai.backend.model_manager.config import (
  File "/opt/invokeai/invokeai/backend/model_manager/config.py", line 39, in <module>
    from invokeai.backend.model_manager.model_on_disk import ModelOnDisk
  File "/opt/invokeai/invokeai/backend/model_manager/model_on_disk.py", line 10, in <module>
    from invokeai.backend.model_manager.taxonomy import ModelRepoVariant
  File "/opt/invokeai/invokeai/backend/model_manager/taxonomy.py", line 14, in <module>
    ModelMixin, RawModel, torch.nn.Module, Dict[str, torch.Tensor], diffusers.DiffusionPipeline, ort.InferenceSession
                                                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 811, in __getattr__
    value = getattr(module, name)
            ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 810, in __getattr__
    module = self._get_module(self._class_to_module[name])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/diffusers/utils/import_utils.py", line 822, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import diffusers.pipelines.pipeline_utils because of the following error (look up to see its traceback):
Failed to import diffusers.models.autoencoders.autoencoder_kl because of the following error (look up to see its traceback):
Failed to import diffusers.loaders.single_file_model because of the following error (look up to see its traceback):
Could not import module 'AutoImageProcessor'. Are this object's requirements defined correctly?

This is likely due to torchvision not using the right index, though i haven't dug into it. The CUDA image is broken in a similar way though. I also rebased on main as a test to be sure, with the same result.

…version

heathen711 · 2025-07-05T03:37:10Z

The image builds from this PR, but fails to start:

Click to expand large traceback
This is likely due to torchvision not using the right index, though i haven't dug into it. The CUDA image is broken in a similar way though. I also rebased on main as a test to be sure, with the same result.

Yup, updated the pins, uv.lock, and Dockerfile to ensure it's all in-sync. Please give it another try.

ebr · 2025-07-07T18:50:34Z

OK, thank you - the image builds now, but it only works on CPU. I haven't been able to get it to use the HIP device, either using the amd runtime or not, and with the kfd / dri devices forwarded to the pod, and either using docker-compose or plain docker run . Confirmed that the CUDA image continues working as expected though.

interestingly, rocm-smi , amd-smi, rocminfo all detect the GPU from inside the container, so hardware is accessible. Pretty sure this has something to do with pytorch. I'm testing this on the Radeon W7900 Pro GPU, so could also be a "me" problem because it's not common hardware (though i don't have issues with it outside of docker, or using other rocm containers). I'll play with it a bit more.

This PR also balloons the image size to 56GB uncompressed - we won't be able to build it in CI. I am still fairly confident we don't need the full ROCm in the image, but we can circle back to that.

As an option, maybe keeping this as a separate ROCm Dockerfile would be a better choice for those AMD users who want to build it for themselves, and we can consolidate it in the future once we have a good working image.

heathen711 · 2025-07-09T06:22:47Z

OK, thank you - the image builds now, but it only works on CPU. I haven't been able to get it to use the HIP device, either using the amd runtime or not, and with the kfd / dri devices forwarded to the pod, and either using docker-compose or plain docker run . Confirmed that the CUDA image continues working as expected though.

interestingly, rocm-smi , amd-smi, rocminfo all detect the GPU from inside the container, so hardware is accessible. Pretty sure this has something to do with pytorch. I'm testing this on the Radeon W7900 Pro GPU, so could also be a "me" problem because it's not common hardware (though i don't have issues with it outside of docker, or using other rocm containers). I'll play with it a bit more.

This PR also balloons the image size to 56GB uncompressed - we won't be able to build it in CI. I am still fairly confident we don't need the full ROCm in the image, but we can circle back to that.

As an option, maybe keeping this as a separate ROCm Dockerfile would be a better choice for those AMD users who want to build it for themselves, and we can consolidate it in the future once we have a good working image.

So I started looking at using the amd-container-kit, it was a pain to get installed into the LXC, but once I did the docker still failed. Start debugging and found:

Using these in the entrypoint script:

echo "Checking ROCM device availability as root..."
python -c "import torch; print('GPU available:', torch.cuda.is_available()); print('Number of GPUs:', torch.cuda.device_count())"

echo "Checking ROCM device availability as ${USER}..."
exec gosu ${USER} python -c "import torch; print('GPU available:', torch.cuda.is_available()); print('Number of GPUs:', torch.cuda.device_count())"

I get:

Attaching to invokeai-rocm-1
invokeai-rocm-1  | Checking ROCM device availability as root...
invokeai-rocm-1  | GPU available: True
invokeai-rocm-1  | Number of GPUs: 4
invokeai-rocm-1  | Checking ROCM device availability as ubuntu...
invokeai-rocm-1  | GPU available: False
invokeai-rocm-1  | Number of GPUs: 0

So something about gosu is messing it up or a permission is missing somewhere because only the ubuntu user can't see the GPUs. Thoughts?

Proof: I remove the gosu and just ran invokeai-web as root and:

Attaching to invokeai-rocm-1
invokeai-rocm-1  | bitsandbytes library load error: Configured CUDA binary not found at /opt/venv/lib/python3.12/site-packages/bitsandbytes/libbitsandbytes_rocm63.so
invokeai-rocm-1  | Traceback (most recent call last):
invokeai-rocm-1  |   File "/opt/venv/lib/python3.12/site-packages/bitsandbytes/cextension.py", line 290, in <module>
invokeai-rocm-1  |     lib = get_native_library()
invokeai-rocm-1  |           ^^^^^^^^^^^^^^^^^^^^
invokeai-rocm-1  |   File "/opt/venv/lib/python3.12/site-packages/bitsandbytes/cextension.py", line 270, in get_native_library
invokeai-rocm-1  |     raise RuntimeError(f"Configured CUDA binary not found at {cuda_binary_path}")
invokeai-rocm-1  | RuntimeError: Configured CUDA binary not found at /opt/venv/lib/python3.12/site-packages/bitsandbytes/libbitsandbytes_rocm63.so
invokeai-rocm-1  | [2025-07-09 06:25:57,821]::[InvokeAI]::INFO --> Using torch device: AMD Radeon Pro V620
invokeai-rocm-1  | [2025-07-09 06:25:57,822]::[InvokeAI]::INFO --> cuDNN version: 3003000
invokeai-rocm-1  | [2025-07-09 06:25:58,221]::[InvokeAI]::INFO --> Patchmatch initialized
invokeai-rocm-1  | [2025-07-09 06:25:59,919]::[InvokeAI]::INFO --> Loading node pack invoke_bria_rmbg
invokeai-rocm-1  | [2025-07-09 06:25:59,924]::[InvokeAI]::INFO --> Loaded 1 node pack from /invokeai/nodes: invoke_bria_rmbg
invokeai-rocm-1  | [2025-07-09 06:26:00,165]::[InvokeAI]::INFO --> InvokeAI version 6.0.0rc5
invokeai-rocm-1  | [2025-07-09 06:26:00,166]::[InvokeAI]::INFO --> Root directory = /invokeai
invokeai-rocm-1  | [2025-07-09 06:26:00,166]::[InvokeAI]::INFO --> Initializing database at /invokeai/databases/invokeai.db
invokeai-rocm-1  | [2025-07-09 06:26:00,204]::[ModelManagerService]::INFO --> [MODEL CACHE] Calculated model RAM cache size: 22512.00 MB. Heuristics applied: [1, 2].
invokeai-rocm-1  | [2025-07-09 06:26:00,599]::[InvokeAI]::INFO --> Invoke running on http://0.0.0.0:9090 (Press CTRL+C to quit)

…ch the host's render group ID

heathen711 · 2025-07-09T20:23:31Z

@ebr I figured it out, the render group within the container does not match the render group on the host, this doesn't appear to be an issue with the full-rocm install, i bet they have it forced to a certain group number to ensure things are consistent. So I made it an env input and groupmod it in the entrypoint script. Give it a read and tell me if you think of a better way to map this.

heathen711 · 2025-07-10T00:13:00Z

#7944 @dsisco11 and I both made changes to the toml and uv.index... hopefully we don't collide...

This reverts commit 017d38e.

heathen711 · 2025-07-15T05:51:28Z

More details: ROCm/ROCm-docker#90

ebr · 2025-07-15T17:31:44Z

Nice, this works on my AMD GPU after the latest updates - great work!
Note that it only worked for me using the amd runtime, but i didn't spend a lot of time troubleshooting. could be the device mounts.

Couple of things to take care of before we're good to merge:

remove RENDER_GROUP_ID default value from docker-compose.yml and document that it should be set in the .env file next to the GPU_DRIVER
need to verify that moving pytorch to dependency groups in pyproject.toml doesn't break our official installer. It's a good change and I think we should adopt it in the installer, but we may need to orchestrate it carefully and maybe implement in the installer first. @psychedelicious what do you think?
also this dependency change may need updates to manual install instructions.
the image is still 26GB, compared to 9GB for cuda 😬 ... might just be the nature of torch+rocm. we might have to skip building it in CI. will cross that bridge if it becomes a problem.

… docks to use UV's built in torch support

heathen711 · 2025-07-17T03:59:04Z

@ebr I found a problem. I started to look into how this would package and ship via pypi. The dependency-group info is not included in the package metadata, along with its indexes.
This specification defines Dependency Groups, a mechanism for storing package requirements in pyproject.toml files such that they are not included in project metadata when it is built.

I converted them to extras, allowing for local dev to just use .[rocm] or .[cuda].

So this would work for things like docker which build from source all the time, but not for a pip install.

I went looking and 🤦 UV has a built in support for torch! https://docs.astral.sh/uv/guides/integration/pytorch/#automatic-backend-selection so I updated the manual install docs.

psychedelicious · 2025-07-17T07:22:58Z

The Invoke launcher doesn't have the capacity to use the uv source/marker syntax at this time.

The launcher attempts to provide a way to install any version of Invoke, considering this file to be the source of truth. Not all versions of Invoke will have the expected sources/markers, so we cannot rely on them.

Besides not being backwards compatible, the sources/markers could inadvertently cause the launcher to install the wrong versions of things. I have some ideas to improve the install strategy more generally, but I don't have time to explore it now.

Are these changes required for the docker fixes? Could we just hardcode the versions/indices in the dockerfile for the time being?

heathen711 · 2025-07-17T16:10:16Z

The Invoke launcher doesn't have the capacity to use the uv source/marker syntax at this time.

The launcher attempts to provide a way to install any version of Invoke, considering this file to be the source of truth. Not all versions of Invoke will have the expected sources/markers, so we cannot rely on them.

Besides not being backwards compatible, the sources/markers could inadvertently cause the launcher to install the wrong versions of things. I have some ideas to improve the install strategy more generally, but I don't have time to explore it now.

Are these changes required for the docker fixes? Could we just hardcode the versions/indices in the dockerfile for the time being?

This should not be impacted, with my last change all of the indexes are tied to extras, not the base (like the groups would have been).

So for all intents and purposes the old --index=URL way will continue to work (and is essentially what the --torch-backend argument does, since pip package does not contain the index url, only the pyprojcet.toml's uv settings does).

To summarize it differently:

--index=URL is still supported, but more complex IMO
--torch-backend is easier to use IMO, plus it handles the ROCM tags correctly (index had issues with the 6.2.4 when I first started this whole investigation because you need the same ROCM version as you have installed on the system, something that CUDA doesn't seems to care about as much, so the --torch-backend=auto is nice) (Note: If I install rocm 6.2.4 on my ubuntu, then invoke wants to use 6.3, it will have runtime issues, so the system rocm and the invoke rocm version must match (this is why I want the dockerfile-rocm-full for my rig, because then the docker image contains everything and my bare metal version doesn't matter), now having said that I need to add that to the install instructions somewhere...)
pyproject.toml has been updated to include the indexes, bound to extras (explicit=true part), to support the uv.lock understanding the different install configs and their dependencies (which normally conflict)
Dockerfile uses the uv lock to ensure the same package is downloaded and installed on each run (also skips the longer pip requirement resolution logic). This also means the docker does not need to know the indexes, so less prone to mis-aligned versions (as now only the docs, pins.json, and pyproject.toml need to be updated)

fix(docker) rocm 2.4.6 based image

c10a6fd

github-actions bot added the docker label Jun 29, 2025

heathen711 added 2 commits June 29, 2025 22:07

fix(docker) Add cloned dockerbuild

96523ca

Fix tagging & remove force reinstall

28e0242

heathen711 changed the title ~~fix(docker) rocm 2.4.6 based image~~ fix(docker) rocm 6.2.4 based image Jul 3, 2025

bugfix(docker) combined the dockerfiles and reduced image size

47508b8

heathen711 marked this pull request as ready for review July 3, 2025 06:03

heathen711 requested review from lstein, blessedcoolant, psychedelicious, hipsterusername and ebr as code owners July 3, 2025 06:03

heathen711 commented Jul 3, 2025

View reviewed changes

docker/Dockerfile Outdated Show resolved Hide resolved

ebr reviewed Jul 3, 2025

View reviewed changes

docker/Dockerfile Outdated Show resolved Hide resolved

docker/Dockerfile Outdated Show resolved Hide resolved

docker/Dockerfile Outdated Show resolved Hide resolved

docker/Dockerfile Show resolved Hide resolved

bugfix(docker): Use uv.lock for docker, and update to newer index urls.

f27471c

github-actions bot added Root python-deps PRs that change python dependencies labels Jul 3, 2025

heathen711 requested a review from ebr July 3, 2025 20:09

heathen711 added 2 commits July 3, 2025 21:15

bugfix(docker) Remove the need for UV index as that is now baked into…

641a6cf

… the uv.lock

bugfix(ci) Clean up more space for typegen check

a3cb3e0

heathen711 requested a review from jazzhaiku as a code owner July 3, 2025 21:22

github-actions bot added the CI-CD Continuous integration / Continuous delivery label Jul 3, 2025

bugfix(uv) Lock torchvision and ensure the docker uses the same rocm …

0db304f

…version

heathen711 added 3 commits July 5, 2025 15:21

Missed files

31ca314

Merge remote-tracking branch 'origin' into bugfix/heathen711/rocm-docker

6d7b231

cleanup(docker) remove no cache argument

8c5fcfd

heathen711 changed the title ~~fix(docker) rocm 6.2.4 based image~~ fix(docker) rocm 6.3 based image Jul 5, 2025

Merge remote-tracking branch 'origin' into bugfix/heathen711/rocm-docker

233740a

bugfix(docker) render group controls the devices, but it needs to mat…

8213f62

…ch the host's render group ID

heathen711 added 4 commits July 10, 2025 20:14

Merge remote-tracking branch 'origin' into bugfix/heathen711/rocm-docker

3e8e0f6

cleanup(docker)

78eb6b0

cleanup(github actions)

017d38e

Revert "cleanup(github actions)"

1b6ebed

This reverts commit 017d38e.

heathen711 mentioned this pull request Jul 12, 2025

feat(ui): Expose tile size as advanced upscaling option. #8271

Merged

4 tasks

Merge remote-tracking branch 'origin' into bugfix/heathen711/rocm-docker

2caa1b1

heathen711 added 2 commits July 17, 2025 01:03

Merge remote-tracking branch 'origin' into bugfix/heathen711/rocm-docker

4b5c481

bugfix(pyproject) Convert from dependency groups to extras and update…

c84f846

… docks to use UV's built in torch support

github-actions bot added the docs PRs that change docs label Jul 17, 2025

heathen711 added 3 commits July 17, 2025 04:00

cleanup(docker)

687cccd

bugfix(docker) Ensure the correct extra install.

89ceecc

bugfix(docs) link syntax

1cdd4b5

fix(docker) rocm 6.3 based image #8152

Are you sure you want to change the base?

fix(docker) rocm 6.3 based image #8152

Uh oh!

Conversation

heathen711 commented Jun 29, 2025

Summary

QA Instructions

Merge Plan

Checklist

Uh oh!

Uh oh!

ebr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

heathen711 commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ebr commented Jul 4, 2025

Uh oh!

heathen711 commented Jul 5, 2025

Uh oh!

ebr commented Jul 7, 2025

Uh oh!

heathen711 commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

heathen711 commented Jul 9, 2025

Uh oh!

heathen711 commented Jul 10, 2025

Uh oh!

heathen711 commented Jul 15, 2025

Uh oh!

ebr commented Jul 15, 2025

Uh oh!

heathen711 commented Jul 17, 2025

Uh oh!

psychedelicious commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

heathen711 commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

heathen711 commented Jul 3, 2025 •

edited

Loading

heathen711 commented Jul 9, 2025 •

edited

Loading

psychedelicious commented Jul 17, 2025 •

edited

Loading

heathen711 commented Jul 17, 2025 •

edited

Loading