Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid hardcoding a space at the beginning of the prompt. #1315

Closed
wants to merge 1 commit into from

Conversation

ivanstepanovftw
Copy link
Collaborator

@ivanstepanovftw ivanstepanovftw commented May 4, 2023

Added in #242 without a strong rationale.
Users can already insert a space manually at the beginning of the prompt if desired.

For example, I cannot get rid of token 15629 -> ' Manager'. I am wanted it to be 3260 -> 'Manager':

main: prompt: ' Manager's Persona: Manager I work with in my company.
Manager: I am waiting.'
main: number of tokens in prompt = 22
     1 -> ''
 15629 -> ' Manager'
 29915 -> '''
 29879 -> 's'
  5196 -> ' Person'
 29874 -> 'a'
 29901 -> ':'
 15629 -> ' Manager'
   306 -> ' I'
   664 -> ' work'
   411 -> ' with'
   297 -> ' in'
   590 -> ' my'
  5001 -> ' company'
 29889 -> '.'
    13 -> '
'
  3260 -> 'Manager'
 29901 -> ':'
   306 -> ' I'
   626 -> ' am'
 10534 -> ' waiting'
 29889 -> '.'

@DannyDaemonic
Copy link
Collaborator

I could be wrong about what's happening here, but I think with OpenLLaMA the BOS token is a lot more important: #1291

@ivanstepanovftw
Copy link
Collaborator Author

I am trying pygmalion-7b model, which prompt looks like this example:

Assistant's Persona: Assistant is a highly intelligent language model trained to comply with user requests.
<START>
Assistant: Hello! How may I help you today?
You: What is Zork?
Assistant:

@slaren
Copy link
Collaborator

slaren commented May 4, 2023

The rationale is just duplicating what the SentencePiece tokenizer does.

@Green-Sky
Copy link
Collaborator

The rational was that the llama models where trained with a prefixed space. I agree that not every model has that requirement, but this was done to make it easier for users without this knowledge.

@ivanstepanovftw
Copy link
Collaborator Author

Oh I see

% echo "I saw a girl with a telescope." | spm_encode --model=m.model
▁I ▁saw ▁a ▁girl ▁with ▁a ▁ te le s c o pe .

@ivanstepanovftw
Copy link
Collaborator Author

This is called dummy prefix google/sentencepiece#282
Closing as LLaMA and derivatives have default tokenizer settings.

@ivanstepanovftw ivanstepanovftw deleted the space branch May 4, 2023 14:39
@ggerganov
Copy link
Owner

You can put this functionality behind a cmd arg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants