gguf-split: split and merge gguf per batch of tensors #6135

phymbert · 2024-03-18T12:15:45Z

Motivation

Distributing and storing GGUF files is difficult for 13b+ models, especially on f16. Lot of issue can happen during file transfers, examples:

temporary disk full
network interruption

Typically, they need to be tranferred from huggingface to an internal storage like s3, minio, git lfs, nexus or artifactory, then downloaded by the inference server and stored locally (or on a k8s PvC for example). Also they cannot be stored in a dockerfile, but IMHO this is for good.

This PR introduces a gguf-split CLI to ease the split and merge of multiple GGUF.

Examples:

--split

gguf-split --split --split-tensors-size 128 ggml-model-q4_0.gguf /tmp/ggml-out-q4_0-2

gguf_split: ggml-model-q4_0.gguf -> /tmp/ggml-out-q4_0-2-00001-of-00003.gguf (128 tensors per file)
split_start: /tmp/ggml-out-q4_0-2-00001-of-00003.gguf ...done
split_start: /tmp/ggml-out-q4_0-2-00002-of-00003.gguf ...done
split_start: /tmp/ggml-out-q4_0-2-00003-of-00003.gguf ...done
gguf_split: 3 gguf split written with a total of 325 tensors.

--merge

gguf-split --merge /tmp/ggml-out-q4_0-2-00001-of-00003.gguf /tmp/ggml-out-q4_0-2-merge.gguf

gguf_merge: /tmp/ggml-out-q4_0-2-00001-of-00003.gguf -> /tmp/ggml-out-q4_0-2-merge.gguf
gguf_merge: reading metadata /tmp/ggml-out-q4_0-2-00001-of-00003.gguf ...done
gguf_merge: reading metadata /tmp/ggml-out-q4_0-2-00002-of-00003.gguf ...done
gguf_merge: reading metadata /tmp/ggml-out-q4_0-2-00003-of-00003.gguf ...done
gguf_merge: writing tensors /tmp/ggml-out-q4_0-2-00001-of-00003.gguf ...done
gguf_merge: writing tensors /tmp/ggml-out-q4_0-2-00002-of-00003.gguf ...done
gguf_merge: writing tensors /tmp/ggml-out-q4_0-2-00003-of-00003.gguf ...done
gguf_merge: /tmp/ggml-out-q4_0-2-merge.gguf merged from 3 split with 325 tensors.

References

Notes

If this approach is accepted, we can later on adapt llama_load_model_from_file and llama_load_model_from_url to support general.split_count KV in GGUF.

mmap is not used in this first implementation neither copy_file_range iops.

The only split strategy supported at the moment is --split-max-tensors N which will create split ggufs with max tensors each regardless of their bytes size. Later on another split strategy based on max file size can be introduced.

examples/gguf-split/gguf-split.cpp

Artefact2 · 2024-03-18T13:41:38Z

Interesting approach. I think allowing to split by file size would be more intuitive (and usually more appropriate since file size is usually the limiting factor, eg 4G for FAT or 50G for HF).

The current code also makes the workflow a bit awkward with a lot of extra writes. It shouldn't be too hard to call copy_file_range() or ioctl(FICLONERANGE) on supported systems, or, as an alternative, add the splitting logic directly to tools that produce ggufs, like convert.py and quantize.

examples/gguf-split/gguf-split.cpp

ngxson

Thanks for having a look into this feature, your PR overall LGTM, just don't forget to include Makefile.

This would be useful for my wllama, since loading 5MB-10MB chunks in parallel will be faster in web environment. So I'm looking forward to the implementation in llama_model_loader.

For the syscalls that @Artefact2 proposed, we can implement in the v2 of the PR I think, for now it's already a good start to test if modification to llama_model_loader works or not.

phymbert · 2024-03-18T14:09:55Z

Interesting approach. I think allowing to split by file size would be more intuitive (and usually more appropriate since file size is usually the limiting factor, eg 4G for FAT or 50G for HF).

The current code also makes the workflow a bit awkward with a lot of extra writes. It shouldn't be too hard to call copy_file_range() or ioctl(FICLONERANGE) on supported systems, or, as an alternative, add the splitting logic directly to tools that produce ggufs, like convert.py and quantize.

Thanks. You cannot exactly predict the size of the GGUF as tensors size can vary, and we want to have valid GGUF (i.e. not truncated as in your example) for later on having llama_model to support tensors distributed in multiple GGUF. But I agree file size is more intuitive, we might introduce --split-max-size split strategy later on. Feel free to implement it once this first implementation is merged.

…et general.split_count KV to all split

phymbert · 2024-03-19T08:49:59Z

@ggerganov Hi Georgi, can I merge and continue on common ?

examples/gguf-split/gguf-split.cpp

* gguf-split: split and merge gguf files per tensor * gguf-split: build with make toolchain * gguf-split: rename `--split-tensors-size` to `--split-max-tensors`. Set general.split_count KV to all split * split : minor style + fix compile warnings * gguf-split: remove --upload not implemented --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

gguf-split: split and merge gguf files per tensor

5922232

phymbert added demo Demonstrate some concept or idea, not intended to be merged need feedback Testing and feedback with results are needed labels Mar 18, 2024

phymbert requested review from Artefact2, ggerganov, slaren and ngxson March 18, 2024 12:15

ngxson reviewed Mar 18, 2024

View reviewed changes

examples/gguf-split/gguf-split.cpp Outdated Show resolved Hide resolved

ngxson reviewed Mar 18, 2024

View reviewed changes

examples/gguf-split/gguf-split.cpp Outdated Show resolved Hide resolved

ngxson approved these changes Mar 18, 2024

View reviewed changes

phymbert added 2 commits March 18, 2024 17:50

gguf-split: build with make toolchain

33c72d0

gguf-split: rename --split-tensors-size to --split-max-tensors. S…

b3a94dd

…et general.split_count KV to all split

phymbert removed request for Artefact2 and slaren March 19, 2024 08:48

split : minor style + fix compile warnings

7f0e73b

ggerganov approved these changes Mar 19, 2024

View reviewed changes

examples/gguf-split/gguf-split.cpp Outdated Show resolved Hide resolved

gguf-split: remove --upload not implemented

2dc6830

phymbert removed demo Demonstrate some concept or idea, not intended to be merged need feedback Testing and feedback with results are needed labels Mar 19, 2024

phymbert changed the title ~~proposal: gguf-split: split and merge gguf per batch of tensors~~ gguf-split: split and merge gguf per batch of tensors Mar 19, 2024

phymbert merged commit d0d5de4 into ggerganov:master Mar 19, 2024
29 of 53 checks passed

phymbert deleted the hp/feature/gguf-split branch March 19, 2024 11:05

dranger003 mentioned this pull request Mar 19, 2024

Remove undeed header file. #6158

Merged

This was referenced Mar 20, 2024

llama_model_loader: support multiple split/shard GGUFs #6187

Merged

split: allow --split-max-size option #6259

Open

phymbert mentioned this pull request Apr 3, 2024

gguf-split add a default option to not include tensors data in first shard #6463

Closed

christianazinn mentioned this pull request Apr 27, 2024

Option to split during conversion #6942

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gguf-split: split and merge gguf per batch of tensors #6135

gguf-split: split and merge gguf per batch of tensors #6135

phymbert commented Mar 18, 2024 •

edited

Loading

Artefact2 commented Mar 18, 2024

ngxson left a comment •

edited

Loading

phymbert commented Mar 18, 2024

phymbert commented Mar 19, 2024

gguf-split: split and merge gguf per batch of tensors #6135

gguf-split: split and merge gguf per batch of tensors #6135

Conversation

phymbert commented Mar 18, 2024 • edited Loading

Motivation

References

Notes

Artefact2 commented Mar 18, 2024

ngxson left a comment • edited Loading

Choose a reason for hiding this comment

phymbert commented Mar 18, 2024

phymbert commented Mar 19, 2024

phymbert commented Mar 18, 2024 •

edited

Loading

ngxson left a comment •

edited

Loading