Skip to content

Working with long stories #307

Open
Open
@leszekhanusz

Description

@leszekhanusz

I'm trying to make long stories using a llama.cpp model (guanaco-33B.ggmlv3.q4_0.bin in my case) with oobabooga/text-generation-webui.

It works for short inputs but it stops working once the number of input tokens is coming close to the context size (2048).

With a bit of playing with the webui (you can count input tokens and modify the max_new_tokens on the main page) I found out that the behavior is like this:

if nb_input_tokens + max_new_tokens < context_size , then it works correctly.
if nb_input_tokens < context_size but nb_input_tokens + max_new_tokens > context_size , then it fails silently, generating 0 tokens:

Output generated in 0.25 seconds (0.00 tokens/s, 0 tokens, ...

if nb_input_tokens > context_size, then it fails with:

llama_tokenize: too many tokens
llama_tokenize: too many tokens
llama_tokenize: too many tokens
Output generated in 0.28 seconds (0.00 tokens/s, 0 tokens, ...

I've seen issue #92 of llama-cpp-python but it is closed and I'm on a recent version of llama-cpp-python (release 0.1.57)

llama-cpp-python should probably discard some input tokens at the beginning to be able to fit inside the context and allow us to continue long stories.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingduplicateThis issue or pull request already existsoobaboogahttps://github.com/oobabooga/text-generation-webui

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions