Description
I'm trying to make long stories using a llama.cpp model (guanaco-33B.ggmlv3.q4_0.bin
in my case) with oobabooga/text-generation-webui
.
It works for short inputs but it stops working once the number of input tokens is coming close to the context size (2048).
With a bit of playing with the webui (you can count input tokens and modify the max_new_tokens
on the main page) I found out that the behavior is like this:
if nb_input_tokens + max_new_tokens < context_size , then it works correctly.
if nb_input_tokens < context_size but nb_input_tokens + max_new_tokens > context_size , then it fails silently, generating 0 tokens:
Output generated in 0.25 seconds (0.00 tokens/s, 0 tokens, ...
if nb_input_tokens
> context_size
, then it fails with:
llama_tokenize: too many tokens
llama_tokenize: too many tokens
llama_tokenize: too many tokens
Output generated in 0.28 seconds (0.00 tokens/s, 0 tokens, ...
I've seen issue #92 of llama-cpp-python but it is closed and I'm on a recent version of llama-cpp-python
(release 0.1.57)
llama-cpp-python
should probably discard some input tokens at the beginning to be able to fit inside the context and allow us to continue long stories.