Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve summary chunking #397

Merged
merged 1 commit into from
Apr 30, 2024

Conversation

MichaelClifford
Copy link
Collaborator

@MichaelClifford MichaelClifford commented Apr 30, 2024

Document chunking in summarizer.py was still failing with some models on some texts. (There were some model vocabularies that caused the model server to fail to return the detokenized data if specific token sequences were split into separate chunks by the chunking process).

To address this issue, we've implemented a new approach to chunking. Instead of chunking the raw tokens, we will chunk the text, like in our earlier implementation, and rely on extras/tokenize/count to confirm that the text chunk is within the token limit, if not, we split the text chunk in two.

I've manually evaluated this with the problematic and non-problemtatic models. It appears to work as well as before, and avoids any of the detokenization crashes we were experiencing.

Signed-off-by: Michael Clifford <mcliffor@redhat.com>
Copy link
Collaborator

@Gregory-Pereira Gregory-Pereira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm, but I had some questions. I also haven't tested this, but if you are confident it works we can ignore that concern

Copy link
Collaborator

@Gregory-Pereira Gregory-Pereira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/LGTM

@MichaelClifford MichaelClifford merged commit 0b16817 into containers:main Apr 30, 2024
1 check passed
sallyom pushed a commit to sallyom/locallm that referenced this pull request May 1, 2024
Signed-off-by: Michael Clifford <mcliffor@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants