Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: unknown pre-tokenizer type: ''mistral-bpe" when running the new Mistral-Nemo model #493

Open
wingenlit opened this issue Jul 19, 2024 · 5 comments

Comments

@wingenlit
Copy link

Contact Details

No response

What happened?

Hi there, I have just attempted to run the new Mistral-Nemo with llamafile on a gguf file quantized with llama.cpp b3405. It failed with error in unknown pre-tokenizer type: 'mistral-bpe' (logs shown below). Is there a replacement string type to use with --override-kv tokenizer.ggml.pre=str:{some_tokenizer_type_here} or I should just wait for the future versions?

./llamafile-0.8.9 --cli -m /mnt/Mistral-Nemo-Instruct-2407-Q4_K_M.gguf --temp 0.2 -p "write something here:" -ngl 999 --no-display-prompt

thanks in advance.

Version

llamafile v0.8.9

What operating system are you seeing the problem on?

Linux, Windows

Relevant log output

llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'mistral-bpe'
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/mnt/Mistral-Nemo-Instruct-2407-Q4_K_M.gguf'
@raymondllu
Copy link

Got the exactly same issue when loading Mistral-Nemo-2407 model using LMStudio that is also based on Llama.cpp. Waiting for the fix!

I don't know if it's a relevant issue reported in ggerganov/llama.cpp#8577, btw.

@jart
Copy link
Collaborator

jart commented Jul 22, 2024

We're excited about Nemo too. Once support is implemented upstream, we naturally intend to find a way to incorporate it here.

@wingenlit
Copy link
Author

UPDATE

Llama.cpp had added support on mistral-nemo at version b3436 onwards. Therefore, llamafile will be updated soon.

For information only, as a result some earlier gguf checkpoints using fork version of llama.cpp might not work with latest llama.cpp. The version of gguf I am using thanks to bartowski is tested working. Repo from others might be updated to work soon.
p.s. the default context size for mistral-nemo is huge at 128k; tricked me thinking a leak of memory happened at the first time. Advised to use the context size flag with a smaller --ctx-size 10000 to start with, and then pull up until vram is adequately used.

@jart
Copy link
Collaborator

jart commented Jul 23, 2024

I can't cherry-pick ggerganov/llama.cpp#50e05353e88d50b644688caa91f5955e8bdb9eb9 because the code it touches has had a considerable amount of churn upstream recently. It'll have to wait until the next full synchronization with upstream. Right now I'm focused primarily on developing a new server. Contributions are welcome on backporting Nemo support. I know this feature is important too so @stlhood should probably chime in on where our priorities should be. Upstream has also been making problematic changes to ggml-cuda lately that prevent us from using it the way it's written, since upstream refused our request to add #ifdef statements that would make sync simpler by disabling features that significantly increase code size.

@jart jart reopened this Jul 23, 2024
@wingenlit
Copy link
Author

sorry about just closing the issue without the inside knowledge. will wait for the problem being resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants