Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add llama_beam_search(). #2267

Merged
merged 8 commits into from
Aug 25, 2023
Merged

Conversation

mattpulver
Copy link
Collaborator

@mattpulver mattpulver commented Jul 18, 2023

Related issue: #1392

This is an initial attempt at beam search. It does appear to work as intended, insofar as generating higher quality deterministic responses.

Currently the execution times seems to slow down noticably as the beams grow in their token vectors. I'm going to see if ingesting the common trunk into the shared llama_context improves this, and if so then will move this PR our of draft mode.

Thoughts/feedback are welcome.

@SlyEcho
Copy link
Sponsor Collaborator

SlyEcho commented Jul 18, 2023

This is just going to use up all memory, especially on the GPU. We have new models now like LLaMA2 with 4096 token context and some others with even 8192. The KV cache could be gigabytes.

Is it not possible to just save the tokens+n_past info in the beams?

@mattpulver
Copy link
Collaborator Author

Is it not possible to just save the tokens+n_past info in the beams?

Thank you, that's the kind of hint/reassurance I was looking for. That does seem to work, and avoids making any copies of llama_context. Prior changes to llama-util.h have been reverted and the original PR description above has been edited to reflect the current state of the PR.

I'm noticing that as the beams grow, so does the time to get the next token for each beam. Since the beams tend to converge, that is, share a common prefix vector of token_ids, then the common trunk can be "ingested" into the shared llama_context, advancing the n_past value, and I'm guessing this might improve the execution time.

I'll try that next and if that works I'll move this PR out of draft mode.

@ggerganov
Copy link
Owner

Is it not possible to just save the tokens+n_past info in the beams?

Alternatively, we can save n_past + partial KV cache for each beam which should be relatively small

In any case - this change is welcome.
In the long run, I'm hoping that we will be able to implement an efficient beam-search inference utilizing batched inference

@mattpulver
Copy link
Collaborator Author

Is it not possible to just save the tokens+n_past info in the beams?

Alternatively, we can save n_past + partial KV cache for each beam which should be relatively small

Thanks for the tip @ggerganov. I tried copying the k and v tensors in llama_kv_cache using ggml_dup_tensor, while sharing a common ggml_context* ctx.

This wasn't sufficient to satisfy the memory requirements, e.g. ggml_new_tensor_impl: not enough space in the context's memory pool (needed 134615856, available 10485760) which continued to increase as the beams grew.

Does ggml_context also need to be deep-copied for each beam? I backed out of that rabbit hole since it is defined in ggml.c which is all pure C, and adding copy/move constructors/assignment operators to them seemed to go against the spirit of the project.

What are your thoughts on this?

@ggerganov
Copy link
Owner

ggerganov commented Jul 28, 2023

Does ggml_context also need to be deep-copied for each beam?

For sure no.

Not sure about all the details. The closest thing we currently have is the beam-search decoder in whisper.cpp:

https://github.com/ggerganov/whisper.cpp/blob/4774d2feb01a772a15de81ffc34b34a1f294f020/whisper.cpp#L4389-L4423

Though it does not solve the problem of having N times the full KV cache.

I think there should be a "common" KV cache that corresponds to n_past tokens and is shared by all beams. And then, each beam has its own "partial" KV cache for the new generated tokens. For each eval, you can memcpy the "partial" KV cache for the current beam at the end of the "common" KV cache and do the eval.
Or something along these lines - not sure.

@mattpulver
Copy link
Collaborator Author

A separate examples/beam_search/beam_search.cpp main program has been added to make this easier to test, which is currently hard-coded to use n_beams=2 beams:

Usage: bin/beam_search MODEL_PATH [PROMPT]

I could use assistance on the following items, in order of highest priority to least:

  • Determine how best to use each beam's llama_kv_cache object. It seems that the amount of time taken by llama_eval() grows significantly as the length of the beams (i.e. token vectors) grow. There should be a smarter way of managing the memory. Might the use of ctx_guidance in examples/main/main.cpp provide a clue?
  • Find the correct mathematical function of n_beams for proper preallocation of memory. It appears to be larger than O(n_beams).

@bullno1
Copy link
Contributor

bullno1 commented Aug 1, 2023

Might the use of ctx_guidance in examples/main/main.cpp provide a clue?

For the most part, the ctx_guidance has a separate kv cache on a separate llama_context because it starts with a different negative prompt.

Also, I used the public llama_eval API and not batch eval, because:

  • I don't know how to batch eval
  • AFAIK, the kv cache is backend dependent. CUDA alloc a separate VRAM buffer for it. memcpy the RAM buffer alone is not enough. And then there are other backends. So I opted to avoid interacting with the cache directly.
  • It's just one extra eval, not n extra ones like beam search so it's probably not that bad.

Since all beams share the same prefix, it sounds that sharing the prefix of the cache with some sort of "split cache" is possible but it's probably hard to satisfy all backends.
I don't know if: #2239 would make it easier once it's done.

Alternatively, use the same cache then sequentially eval each beam lol. The speed would probably be awful though.

@mattpulver
Copy link
Collaborator Author

This is getting closer. It seems the main bottleneck involves calling llama_eval() when n_tokens increases, as required by each beam. This is true whether or not beam search is being used, and the increased execution times seem to be primarily accounted for by this. If there is a way to improve this performance, then beam search should likewise improve.

This can be run/tested using examples/beam_search/beam_search.cpp.

After integrating this into examples/main/main.cpp I'll take this PR out of draft mode.

Thanks @bullno1 for explaining the logic re: ctx_guidance.

@bullno1
Copy link
Contributor

bullno1 commented Aug 2, 2023

Warning, a lot of brain dump, might not be coherent.

On efficiency

I feel like there is a lot of cache invalidation with how common prefix keeps getting revaluated: https://github.com/ggerganov/llama.cpp/pull/2267/files#diff-150dc86746a90bad4fc2c3334aeb9b5887b3adad3cc1459446717638605348efR3071

If I understand this correctly, let's say if we have n beams, there should only be at most n kv caches at a time because after expansions, we limit the list to n beams anyway.
And each expansion only expands by one token at a time so they all reuse the same cache.

If we utilizes the cache of the original context, it's only n - 1 extra caches to be created.

This sounds possible with the current public API.

If n_beam = 2, it has the same memory overhead as CFG anw.
At n_beam = 3, it's only somewhat worse but on a 4090 and a 13b q4 model, you can afford quite a handful of caches.

On API design

I think it should return a span of tokens instead of text:

struct llama_beam {
    size_t length;
    llama_token* tokens;
};

We can add a llama_beam_free to clean up the temp allocation of the result.

Also, since this runs until completion or until it reaches limit, I have no idea how to even "stream" the result.
The "next" token is never decided until the entire search completes.
So it probably won't integrate that well into the main example unless we are in chat mode.

Talking about chat, we often need an "antiprompt" stop condition which would take a bit more work to support.
This is to prevent the model from speaking for the "user" role too.
It also has other applications like stopping generation to execute an external tool in a chain-of-thought scenario.

We would need something like a predicate on the beam search:

typedef bool (*llama_beam_predicate_fn_t)(const struct llama_beam * beam, void * userdata);

This is actually general enough for both anti prompt and/or a length limit.

Putting it all together I think the API should be:

struct llama_beam {
    size_t length;
    const llama_token* tokens;
};

typedef bool (*llama_beam_predicate_fn_t)(const struct llama_beam * beam, void * userdata);

struct llama_beam_stop_condition {
    llama_beam_predicate_fn_t fn;
    void * userdata;
};

LLAMA_API const struct llama_beam * llama_beam_search(
    struct llama_context * ctx,
    int n_past,
    int n_beams,
    struct llama_beam_stop_condition stop_condition,
    int n_threads
);

// Stop after a number of tokens has been generated
// userdata is a uintptr_t which is the number of tokens
LLAMA_API bool llama_beam_stop_at_n_tokens(const struct llama_beam * beam, void * userdata);

// Stop after a token is encountered such as eos
// userdata is a uintptr_t which is the token id
LLAMA_API bool llama_beam_stop_at_token(const struct llama_beam * beam, void * userdata);

// Stop at a suffix string for anti-prompt
// userdata is a char* pointing to a null-terminated string
LLAMA_API bool llama_beam_stop_at_suffix(const struct llama_beam * beam, void * userdata);

// Logically "OR" all conditions
struct llama_beam_stop_condition_one_of {
    int num_conditions;
    struct llama_beam_stop_condition * conditions;
};

// When any of the conditions are met
// userdata is a struct llama_stop_condition_one_of*
LLAMA_API bool llama_beam_stop_at_one_of(const struct llama_beam * beam, void * userdata);

// Free the returned beam
LLAMA_API void llama_beam_free(struct llama_beam * beam);

On cache reusing

Say n_beams=3.

At first we have 3 beams:

  • A @ cache 0
  • B @ cache 1
  • C @ cache 2

Then we expand each by 3:

  • AA @ cache 0
  • AB @ cache 0
  • AC @ cache 0
  • BA @ cache 1
  • BB @ cache 1
  • BC @ cache 1
  • CA @ cache 2
  • CB @ cache 2
  • CC @ cache 2

Prune this list down to top 3:

  • AA @ cache 0
  • AB @ cache 0
  • CC @ cache 2

Since both AA and AB are on 0, we need to split:

  • AA @ cache 0
  • AB @ cache 1: copy cache 0 into 1
  • CC @ cache 2

I think ggml_cpy can do this even for GPU-backed cache as long as both tensors are on GPU.
Need someone to confirm this.

@mattpulver mattpulver force-pushed the add_beam_search branch 3 times, most recently from 91d65a8 to fbbf0eb Compare August 8, 2023 14:33
@mattpulver
Copy link
Collaborator Author

mattpulver commented Aug 8, 2023

Also, since this runs until completion or until it reaches limit, I have no idea how to even "stream" the result.
The "next" token is never decided until the entire search completes.

Actually it does so happen that every once in a while, all beams will partially converge into sharing a common token prefix, of say m tokens. Whenever this happens, after the next call to llama_eval(), m tokens are all shifted off of each beam, and n_past += m;. This does result in an immediate speedup in llama_eval(). This is also one way to "stream" the result, and is made available via the callback as described below.

On API design

Based on your API suggestions @bullno1 I've added a callback type:

typedef void (*llama_beam_search_callback_fn_t)(void* callback_state, llama_beams_state);

This is used for example:

// Put here anything you want back in beam_search_callback().
struct beam_search_callback_state {
    llama_context* ctx;
    std::vector<llama_token>* response;
};
...
std::vector<llama_token> response;
beam_search_callback_state callback_state{ctx, &response};
llama_beam_search(ctx, beam_search_callback, &callback_state, beam_width, n_past, n_predict, params.n_threads);

The idea is that beam_search_callback(callback_state, beams_state), custom-defined by the caller, is invoked on each iteration in which the beams grow. The callback has access to view all of the beams, and can make decisions synchronously whether to continue/stop, choose a beam to collapse to, etc. by reading/setting beams_state. More details to come.

This seems to cover the use cases you mentioned. These struct definitions can be extended to cover future beam evolution control features.

On efficiency + cache reuse

I did an experiment in which each beam keeps their own kv_cache and llama_kv_cache was made copyable+moveable, and swapped their kv_cache with ctx->kv_self before+after each call to llama_eval():

std::swap(ctx->kv_self, beam.kv_cache);
llama_eval(ctx, beam.tokens.data(), beam.tokens.size(), n_past, n_threads);
std::swap(ctx->kv_self, beam.kv_cache);

However this appeared to produce gibberish (due to either a bug in the code, or the kv_cache being used incorrectly.) Is this how you see the beam-specific kv_cache being used or am I not thinking about this right?

Questions

  • Is it necessary to integrate beam_search in with examples/main/main.cpp in this PR? Or can this be done in a follow-up PR? A running example is currently available in examples/beam_search/beam_search.cpp. (Uncommenting the line //params.n_gpu_layers = 200; is recommended for good performance.)
  • When running w/ GPU the performance is quite decent. Can/should the kv_cache integration also be done in a follow-up PR?

Thanks again for the helpful feedback.


Edits made on Aug 11: Replace llama_beam_search_control struct w/ reading/setting beams_state.

@mattpulver mattpulver force-pushed the add_beam_search branch 2 times, most recently from db9657c to b989365 Compare August 16, 2023 15:41
@mattpulver mattpulver marked this pull request as ready for review August 23, 2023 15:54
@mattpulver
Copy link
Collaborator Author

mattpulver commented Aug 23, 2023

Ready-for-review Notes

This can be tested using the included examples/beam_search/beam_search.cpp main program. Example:

$ bin/beam_search ~/models/open-llama-7B-open-instruct.ggmlv3.q4_0.gguf 2 "### Request:\nWhat is the capital of France?\n\n### Response:\n"
...
The capital of France is Paris.

Additionally there is a llama-cpp-python beam_search branch which is/was working prior to recent breaking changes / improvements related to the new GGUF format, which I intend to open a PR for once this PR gets merged.

The helpful feedback above involving stop suggestions are satisfied by a more general callback function that allows the API caller to examine the state of all beams and flag when beams should be marked end-of-sentence.

The concerns about caching are addressed by the fact that beams often converge to a common token prefix subvector. When this happens, the callback is notified and at that time the common prefix should be stored, as the following iteration will shift it away from all beams, thereby reducing the token vector lengths. Having said this, there are likely additional optimizations that can be further accomplished w.r.t. beam token caching, that are suitable for follow-up PRs. When run with CUBLAS, the qualitative experience is quite satisfactory.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting work - I'm still trying to understand the details.

Some minor style changes requested and we can look into merging.
Also, make sure that the implementation in llama.cpp is ordered correctly. I see it is after the "grammar" stuff, but in the header the declarations are after the "sampling" stuff. So either move the beam search definitions after the sampling or update the header to match the order.

We'll also have to figure out some simple tests for the CI in order to keep this functionality working in the long run. We can do in separate PR.

llama.h Outdated
@@ -476,6 +476,39 @@ extern "C" {
/// @details Accepts the sampled token into the grammar
LLAMA_API void llama_grammar_accept_token(struct llama_context * ctx, struct llama_grammar * grammar, llama_token token);

struct llama_beam_view {
llama_token const* tokens;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
llama_token const* tokens;
llama_token const * tokens;

llama.h Outdated
// (e.g. beams[0]) as they will be removed (shifted) from all beams in all subsequent callbacks.
// These pointers are valid only during the synchronous callback, so should not be saved.
struct llama_beams_state {
llama_beam_view* beam_views;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
llama_beam_view* beam_views;
llama_beam_view * beam_views;

llama.h Outdated
/// @details Deterministically returns entire sentence constructed by a beam search.
/// @param ctx Pointer to the llama_context.
/// @param callback Invoked for each iteration of the beam_search loop, passing in beams_state.
/// The return beam_search_control can be used to control the beam_search execution.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beam_search_control should probably be changed to eos flag ?

llama.h Outdated
/// @param n_past Number of tokens already evaluated.
/// @param n_predict Maximum number of tokens to predict. EOS may occur earlier.
/// @param n_threads Number of threads as passed to llama_eval().
LLAMA_API void llama_beam_search(struct llama_context * ctx, llama_beam_search_callback_fn_t callback, void* callback_data, size_t n_beams, int n_past, int n_predict, int n_threads);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LLAMA_API void llama_beam_search(struct llama_context * ctx, llama_beam_search_callback_fn_t callback, void* callback_data, size_t n_beams, int n_past, int n_predict, int n_threads);
LLAMA_API void llama_beam_search(struct llama_context * ctx, llama_beam_search_callback_fn_t callback, void * callback_data, size_t n_beams, int n_past, int n_predict, int n_threads);

llama.cpp Outdated
};

// A struct for calculating logit-related info.
struct logit_info {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
struct logit_info {
struct llama_logit_info {

llama.cpp Outdated
Comment on lines 3396 to 4388
std::vector<llama_token_data> top_k(size_t k) {
std::vector<llama_token_data> min_heap; // min-heap by logit
llama_token const k_min = std::min(static_cast<llama_token>(k), n_vocab);
min_heap.reserve(k_min);
for (llama_token token_id=0 ; token_id<k_min ; ++token_id) {
min_heap.push_back(get_token_data(token_id));
}
auto comp = [](llama_token_data const& a, llama_token_data const& b) { return a.logit > b.logit; };
std::make_heap(min_heap.begin(), min_heap.end(), comp);
for (llama_token token_id=k_min ; token_id<n_vocab ; ++token_id) {
if (min_heap.front().logit < logits[token_id]) {
std::pop_heap(min_heap.begin(), min_heap.end(), comp);
min_heap.back().id = token_id;
min_heap.back().logit = logits[token_id];
std::push_heap(min_heap.begin(), min_heap.end(), comp);
}
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style changes examples - please apply to rest of non-server code:

Suggested change
std::vector<llama_token_data> top_k(size_t k) {
std::vector<llama_token_data> min_heap; // min-heap by logit
llama_token const k_min = std::min(static_cast<llama_token>(k), n_vocab);
min_heap.reserve(k_min);
for (llama_token token_id=0 ; token_id<k_min ; ++token_id) {
min_heap.push_back(get_token_data(token_id));
}
auto comp = [](llama_token_data const& a, llama_token_data const& b) { return a.logit > b.logit; };
std::make_heap(min_heap.begin(), min_heap.end(), comp);
for (llama_token token_id=k_min ; token_id<n_vocab ; ++token_id) {
if (min_heap.front().logit < logits[token_id]) {
std::pop_heap(min_heap.begin(), min_heap.end(), comp);
min_heap.back().id = token_id;
min_heap.back().logit = logits[token_id];
std::push_heap(min_heap.begin(), min_heap.end(), comp);
}
}
std::vector<llama_token_data> top_k(size_t k) {
std::vector<llama_token_data> min_heap; // min-heap by logit
const llama_token k_min = std::min(static_cast<llama_token>(k), n_vocab);
min_heap.reserve(k_min);
for (llama_token token_id = 0; token_id < k_min ; ++token_id) {
min_heap.push_back(get_token_data(token_id));
}
auto comp = [](const llama_token_data & a, const llama_token_data & b) { return a.logit > b.logit; };
std::make_heap(min_heap.begin(), min_heap.end(), comp);
for (llama_token token_id = k_min; token_id < n_vocab; ++token_id) {
if (min_heap.front().logit < logits[token_id]) {
std::pop_heap(min_heap.begin(), min_heap.end(), comp);
min_heap.back().id = token_id;
min_heap.back().logit = logits[token_id];
std::push_heap(min_heap.begin(), min_heap.end(), comp);
}
}

llama.cpp Outdated
}
};

struct beam_search {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
struct beam_search {
struct llama_beam_search {

@@ -3354,6 +3354,253 @@ void llama_grammar_accept_token(struct llama_context * ctx, struct llama_grammar
ctx->t_sample_us += ggml_time_us() - t_start_sample_us;
}

struct llama_beam {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
struct llama_beam {
//
// beam seach
//
struct llama_beam {

llama.h Outdated
// Type of pointer to the beam_search_callback function.
// void* callback_data is any custom data passed to llama_beam_search, that is subsequently
// passed back to beam_search_callback. This avoids having to use global variables in the callback.
typedef void (*llama_beam_search_callback_fn_t)(void* callback_data, llama_beams_state);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
typedef void (*llama_beam_search_callback_fn_t)(void* callback_data, llama_beams_state);
typedef void (*llama_beam_search_callback_fn_t)(void * callback_data, llama_beams_state);

@@ -476,6 +476,39 @@ extern "C" {
/// @details Accepts the sampled token into the grammar
LLAMA_API void llama_grammar_accept_token(struct llama_context * ctx, struct llama_grammar * grammar, llama_token token);

struct llama_beam_view {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

//
// Beam search
//

Suggested change
struct llama_beam_view {
struct llama_beam_view {

@ggerganov
Copy link
Owner

Do I understand correctly that I can also use the eos flag to limit the beam length? For example, I can set it to true if the number of tokens in the beam exceeds 8. I guess this could be a valid use case for short-segments of beam-search where we avoid the beams to grow too long.

Wondering if this is correct, maybe we should rename the flag to something more descriptive. Like beam_end or eob?

@mattpulver
Copy link
Collaborator Author

Changes

  • Add // Beam search heading to llama.{h,cpp} after llama_grammar_accept_token(). abe0829
  • Add space around * pointers and & references. 9bedaf4
  • Add spaces around comparison and assignment operators. e46a8b5
  • Prefer west const. 93daad7
  • Use llama_ prefix for structs in global namespace. fa33614
  • Delete obsolete comment from an earlier revision. b619cfc
  • Change eos to eob in llama_beam and llama_beam_view structs. 5fa1ea2

Responses

Thanks for the feedback.

I wasn't able to rename struct beam_search to struct llama_beam_search due to conflict w/ the function of the same name, so named it struct llama_beam_search_data instead. LMK if different names are preferred.

either move the beam search definitions after the sampling or update the header to match the order.

The beam search code are immediately after the 3 functions:

/// @details Selects the token with the highest probability.
LLAMA_API llama_token llama_sample_token_greedy(struct llama_context * ctx, llama_token_data_array * candidates);

/// @details Randomly selects a token from the candidates based on their probabilities.
LLAMA_API llama_token llama_sample_token(struct llama_context * ctx, llama_token_data_array * candidates);

/// @details Accepts the sampled token into the grammar
LLAMA_API void llama_grammar_accept_token(struct llama_context * ctx, struct llama_grammar * grammar, llama_token token);

in both llama.h and llama.cpp. Though the function llama_grammar_accept_token() is grammar-related, it was placed in the sampling functions section prior to this PR.

Thus the 3 sections are ordered as follows, in both the llama.h and llama.cpp files, with their respective 3-line comment header lines:

  • Grammar
  • Sampling
  • Beam Search

Do I understand correctly that I can also use the eos flag to limit the beam length? For example, I can set it to true if the number of tokens in the beam exceeds 8. I guess this could be a valid use case for short-segments of beam-search where we avoid the beams to grow too long.

Yes. The llama_beams_state struct passed to the callback function contains the member variable llama_beam_view * beam_views; which are read by the server after the callback is invoked. Any beams that have eos (now called eob) turned true are recognized.

Wondering if this is correct, maybe we should rename the flag to something more descriptive. Like beam_end or eob?

Yes, good point. End-of-beam (eob) should be distinguished from the llama_token_eos() since it can occur for other reasons than the eos token. The eos flag has been renamed to eob in commit 5fa1ea2.

I've attempted to address every issue. Please feel free to point out anything I have missed, as it was not intended.

p.s. Food for thought: clang-format with a custom configuration file can automatically enforce many/most of the style preferences for this project.

@ggerganov ggerganov merged commit c82742a into ggerganov:master Aug 25, 2023
5 of 24 checks passed
@mattpulver mattpulver deleted the add_beam_search branch August 25, 2023 15:33
mattgauf added a commit to mattgauf/llama.cpp that referenced this pull request Aug 26, 2023
* master: (773 commits)
  server : add `/detokenize` endpoint (ggerganov#2802)
  convert.py : advanced option (ggerganov#2753)
  llama : use Unicode Escape Sequence to replace encoded characters (ggerganov#2814)
  flake.nix : add rocm support and cleanup (ggerganov#2808)
  llama : move #includes out of _GNU_SOURCE conditional (ggerganov#2817)
  main : fix bug (penalize_nl=false doesn't work) + suppress warning on mingw (ggerganov#1528)
  llama : use std::abs in llama_sample_tail_free (ggerganov#2800)
  k-quants : remove unnecessary tensor shape restrictions (ggerganov#2811)
  Better perplexity for 2- and 3-bit quantization for LLaMA-v2-70B (ggerganov#2807)
  Fix HellaSwag (ggerganov#2805)
  flake : build llama.cpp on Intel with nix (ggerganov#2795)
  Handle null rope scaling value (ggerganov#2793)
  Fix spm whitespaces (ggerganov#2806)
  examples : skip unnecessary external lib in server README.md how-to (ggerganov#2804)
  llama : fix struct decl (ggerganov#2790)
  Faster perplexity computation (ggerganov#2786)
  llama : add llama_beam_search() (ggerganov#2267)
  convert.py : Get rope scale from HuggingFace models (ggerganov#2772)
  llama-bench : add model sizes (ggerganov#2771)
  convert.py : export rope freq_base when converting CodeLlama from an HF model (ggerganov#2773)
  ...
@ejones
Copy link
Collaborator

ejones commented Aug 27, 2023

Ah yeah sorry I put llama_grammar_accept_token there because I saw it as logically part of the sampling flow but it probably makes more sense with the other grammar functions.

akawrykow pushed a commit to akawrykow/llama.cpp that referenced this pull request Aug 29, 2023
* Add llama_beam_search().

* Add '// Beam search' heading to llama.{h,cpp} after llama_grammar_accept_token().

* Add space around * pointers and & references.

* Add spaces around comparison and assignment operators.

* Prefer west const.

* Use llama_ prefix for structs in global namespace.

* Delete obsolete comment from an earlier revision.

* Change eos to eob in llama_beam and llama_beam_view structs.
@@ -1291,22 +1347,30 @@ int main(int argc, char **argv)
llama.beginCompletion();

if (!llama.stream) {
size_t stop_pos = std::string::npos;
if (llama.params.n_beams) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattpulver I noticed this check for llama.params.n_beams, but n_beams param doesn't seem to be set anywhere. Am I misinterpreting? If I set it myself, will it work along with the grammar for this server example?

Copy link
Collaborator Author

@mattpulver mattpulver Sep 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is an example of where it is set and used:

params.n_beams = 2 < argc ? std::stoi(argv[2]) : 2;

In examples/server/server.cpp I believe it may be set via the command line it should be set by server_params_parse() but it seems that was not yet done. Feel free to submit that as a PR.

I don't think beam search and grammar will currently work together. That is currently an open item: #2923

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants