Add llama_beam_search(). #2267

mattpulver · 2023-07-18T19:57:03Z

Related issue: #1392

This is an initial attempt at beam search. It does appear to work as intended, insofar as generating higher quality deterministic responses.

Currently the execution times seems to slow down noticably as the beams grow in their token vectors. I'm going to see if ingesting the common trunk into the shared llama_context improves this, and if so then will move this PR our of draft mode.

Thoughts/feedback are welcome.

SlyEcho · 2023-07-18T21:28:36Z

This is just going to use up all memory, especially on the GPU. We have new models now like LLaMA2 with 4096 token context and some others with even 8192. The KV cache could be gigabytes.

Is it not possible to just save the tokens+n_past info in the beams?

mattpulver · 2023-07-20T15:04:18Z

Is it not possible to just save the tokens+n_past info in the beams?

Thank you, that's the kind of hint/reassurance I was looking for. That does seem to work, and avoids making any copies of llama_context. Prior changes to llama-util.h have been reverted and the original PR description above has been edited to reflect the current state of the PR.

I'm noticing that as the beams grow, so does the time to get the next token for each beam. Since the beams tend to converge, that is, share a common prefix vector of token_ids, then the common trunk can be "ingested" into the shared llama_context, advancing the n_past value, and I'm guessing this might improve the execution time.

I'll try that next and if that works I'll move this PR out of draft mode.

ggerganov · 2023-07-21T11:04:06Z

Is it not possible to just save the tokens+n_past info in the beams?

Alternatively, we can save n_past + partial KV cache for each beam which should be relatively small

In any case - this change is welcome.
In the long run, I'm hoping that we will be able to implement an efficient beam-search inference utilizing batched inference

mattpulver · 2023-07-25T21:38:51Z

Is it not possible to just save the tokens+n_past info in the beams?

Alternatively, we can save n_past + partial KV cache for each beam which should be relatively small

Thanks for the tip @ggerganov. I tried copying the k and v tensors in llama_kv_cache using ggml_dup_tensor, while sharing a common ggml_context* ctx.

This wasn't sufficient to satisfy the memory requirements, e.g. ggml_new_tensor_impl: not enough space in the context's memory pool (needed 134615856, available 10485760) which continued to increase as the beams grew.

Does ggml_context also need to be deep-copied for each beam? I backed out of that rabbit hole since it is defined in ggml.c which is all pure C, and adding copy/move constructors/assignment operators to them seemed to go against the spirit of the project.

What are your thoughts on this?

ggerganov · 2023-07-28T18:00:04Z

Does ggml_context also need to be deep-copied for each beam?

For sure no.

Not sure about all the details. The closest thing we currently have is the beam-search decoder in whisper.cpp:

https://github.com/ggerganov/whisper.cpp/blob/4774d2feb01a772a15de81ffc34b34a1f294f020/whisper.cpp#L4389-L4423

Though it does not solve the problem of having N times the full KV cache.

I think there should be a "common" KV cache that corresponds to n_past tokens and is shared by all beams. And then, each beam has its own "partial" KV cache for the new generated tokens. For each eval, you can memcpy the "partial" KV cache for the current beam at the end of the "common" KV cache and do the eval.
Or something along these lines - not sure.

mattpulver · 2023-07-31T15:26:35Z

A separate examples/beam_search/beam_search.cpp main program has been added to make this easier to test, which is currently hard-coded to use n_beams=2 beams:

Usage: bin/beam_search MODEL_PATH [PROMPT]

I could use assistance on the following items, in order of highest priority to least:

Determine how best to use each beam's llama_kv_cache object. It seems that the amount of time taken by llama_eval() grows significantly as the length of the beams (i.e. token vectors) grow. There should be a smarter way of managing the memory. Might the use of ctx_guidance in examples/main/main.cpp provide a clue?
Find the correct mathematical function of n_beams for proper preallocation of memory. It appears to be larger than O(n_beams).

bullno1 · 2023-08-01T06:56:31Z

Might the use of ctx_guidance in examples/main/main.cpp provide a clue?

For the most part, the ctx_guidance has a separate kv cache on a separate llama_context because it starts with a different negative prompt.

Also, I used the public llama_eval API and not batch eval, because:

I don't know how to batch eval
AFAIK, the kv cache is backend dependent. CUDA alloc a separate VRAM buffer for it. memcpy the RAM buffer alone is not enough. And then there are other backends. So I opted to avoid interacting with the cache directly.
It's just one extra eval, not n extra ones like beam search so it's probably not that bad.

Since all beams share the same prefix, it sounds that sharing the prefix of the cache with some sort of "split cache" is possible but it's probably hard to satisfy all backends.
I don't know if: #2239 would make it easier once it's done.

Alternatively, use the same cache then sequentially eval each beam lol. The speed would probably be awful though.

mattpulver · 2023-08-02T15:53:41Z

This is getting closer. It seems the main bottleneck involves calling llama_eval() when n_tokens increases, as required by each beam. This is true whether or not beam search is being used, and the increased execution times seem to be primarily accounted for by this. If there is a way to improve this performance, then beam search should likewise improve.

This can be run/tested using examples/beam_search/beam_search.cpp.

After integrating this into examples/main/main.cpp I'll take this PR out of draft mode.

Thanks @bullno1 for explaining the logic re: ctx_guidance.

bullno1 · 2023-08-02T17:34:03Z

Warning, a lot of brain dump, might not be coherent.

On efficiency

I feel like there is a lot of cache invalidation with how common prefix keeps getting revaluated: https://github.com/ggerganov/llama.cpp/pull/2267/files#diff-150dc86746a90bad4fc2c3334aeb9b5887b3adad3cc1459446717638605348efR3071

If I understand this correctly, let's say if we have n beams, there should only be at most n kv caches at a time because after expansions, we limit the list to n beams anyway.
And each expansion only expands by one token at a time so they all reuse the same cache.

If we utilizes the cache of the original context, it's only n - 1 extra caches to be created.

This sounds possible with the current public API.

If n_beam = 2, it has the same memory overhead as CFG anw.
At n_beam = 3, it's only somewhat worse but on a 4090 and a 13b q4 model, you can afford quite a handful of caches.

On API design

I think it should return a span of tokens instead of text:

struct llama_beam {
    size_t length;
    llama_token* tokens;
};

We can add a llama_beam_free to clean up the temp allocation of the result.

Also, since this runs until completion or until it reaches limit, I have no idea how to even "stream" the result.
The "next" token is never decided until the entire search completes.
So it probably won't integrate that well into the main example unless we are in chat mode.

Talking about chat, we often need an "antiprompt" stop condition which would take a bit more work to support.
This is to prevent the model from speaking for the "user" role too.
It also has other applications like stopping generation to execute an external tool in a chain-of-thought scenario.

We would need something like a predicate on the beam search:

typedef bool (*llama_beam_predicate_fn_t)(const struct llama_beam * beam, void * userdata);

This is actually general enough for both anti prompt and/or a length limit.

Putting it all together I think the API should be:

struct llama_beam {
    size_t length;
    const llama_token* tokens;
};

typedef bool (*llama_beam_predicate_fn_t)(const struct llama_beam * beam, void * userdata);

struct llama_beam_stop_condition {
    llama_beam_predicate_fn_t fn;
    void * userdata;
};

LLAMA_API const struct llama_beam * llama_beam_search(
    struct llama_context * ctx,
    int n_past,
    int n_beams,
    struct llama_beam_stop_condition stop_condition,
    int n_threads
);

// Stop after a number of tokens has been generated
// userdata is a uintptr_t which is the number of tokens
LLAMA_API bool llama_beam_stop_at_n_tokens(const struct llama_beam * beam, void * userdata);

// Stop after a token is encountered such as eos
// userdata is a uintptr_t which is the token id
LLAMA_API bool llama_beam_stop_at_token(const struct llama_beam * beam, void * userdata);

// Stop at a suffix string for anti-prompt
// userdata is a char* pointing to a null-terminated string
LLAMA_API bool llama_beam_stop_at_suffix(const struct llama_beam * beam, void * userdata);

// Logically "OR" all conditions
struct llama_beam_stop_condition_one_of {
    int num_conditions;
    struct llama_beam_stop_condition * conditions;
};

// When any of the conditions are met
// userdata is a struct llama_stop_condition_one_of*
LLAMA_API bool llama_beam_stop_at_one_of(const struct llama_beam * beam, void * userdata);

// Free the returned beam
LLAMA_API void llama_beam_free(struct llama_beam * beam);

On cache reusing

Say n_beams=3.

At first we have 3 beams:

A @ cache 0
B @ cache 1
C @ cache 2

Then we expand each by 3:

AA @ cache 0
AB @ cache 0
AC @ cache 0
BA @ cache 1
BB @ cache 1
BC @ cache 1
CA @ cache 2
CB @ cache 2
CC @ cache 2

Prune this list down to top 3:

AA @ cache 0
AB @ cache 0
CC @ cache 2

Since both AA and AB are on 0, we need to split:

AA @ cache 0
AB @ cache 1: copy cache 0 into 1
CC @ cache 2

I think ggml_cpy can do this even for GPU-backed cache as long as both tensors are on GPU.
Need someone to confirm this.

mattpulver · 2023-08-08T15:08:43Z

Also, since this runs until completion or until it reaches limit, I have no idea how to even "stream" the result.
The "next" token is never decided until the entire search completes.

Actually it does so happen that every once in a while, all beams will partially converge into sharing a common token prefix, of say m tokens. Whenever this happens, after the next call to llama_eval(), m tokens are all shifted off of each beam, and n_past += m;. This does result in an immediate speedup in llama_eval(). This is also one way to "stream" the result, and is made available via the callback as described below.

On API design

Based on your API suggestions @bullno1 I've added a callback type:

typedef void (*llama_beam_search_callback_fn_t)(void* callback_state, llama_beams_state);

This is used for example:

// Put here anything you want back in beam_search_callback().
struct beam_search_callback_state {
    llama_context* ctx;
    std::vector<llama_token>* response;
};
...
std::vector<llama_token> response;
beam_search_callback_state callback_state{ctx, &response};
llama_beam_search(ctx, beam_search_callback, &callback_state, beam_width, n_past, n_predict, params.n_threads);

The idea is that beam_search_callback(callback_state, beams_state), custom-defined by the caller, is invoked on each iteration in which the beams grow. The callback has access to view all of the beams, and can make decisions synchronously whether to continue/stop, choose a beam to collapse to, etc. by reading/setting beams_state. More details to come.

This seems to cover the use cases you mentioned. These struct definitions can be extended to cover future beam evolution control features.

On efficiency + cache reuse

I did an experiment in which each beam keeps their own kv_cache and llama_kv_cache was made copyable+moveable, and swapped their kv_cache with ctx->kv_self before+after each call to llama_eval():

std::swap(ctx->kv_self, beam.kv_cache);
llama_eval(ctx, beam.tokens.data(), beam.tokens.size(), n_past, n_threads);
std::swap(ctx->kv_self, beam.kv_cache);

However this appeared to produce gibberish (due to either a bug in the code, or the kv_cache being used incorrectly.) Is this how you see the beam-specific kv_cache being used or am I not thinking about this right?

Questions

Is it necessary to integrate beam_search in with examples/main/main.cpp in this PR? Or can this be done in a follow-up PR? A running example is currently available in examples/beam_search/beam_search.cpp. (Uncommenting the line //params.n_gpu_layers = 200; is recommended for good performance.)
When running w/ GPU the performance is quite decent. Can/should the kv_cache integration also be done in a follow-up PR?

Thanks again for the helpful feedback.

Edits made on Aug 11: Replace llama_beam_search_control struct w/ reading/setting beams_state.

mattpulver · 2023-08-23T15:55:33Z

Ready-for-review Notes

This can be tested using the included examples/beam_search/beam_search.cpp main program. Example:

$ bin/beam_search ~/models/open-llama-7B-open-instruct.ggmlv3.q4_0.gguf 2 "### Request:\nWhat is the capital of France?\n\n### Response:\n"
...
The capital of France is Paris.

Additionally there is a llama-cpp-python beam_search branch which is/was working prior to recent breaking changes / improvements related to the new GGUF format, which I intend to open a PR for once this PR gets merged.

The helpful feedback above involving stop suggestions are satisfied by a more general callback function that allows the API caller to examine the state of all beams and flag when beams should be marked end-of-sentence.

The concerns about caching are addressed by the fact that beams often converge to a common token prefix subvector. When this happens, the callback is notified and at that time the common prefix should be stored, as the following iteration will shift it away from all beams, thereby reducing the token vector lengths. Having said this, there are likely additional optimizations that can be further accomplished w.r.t. beam token caching, that are suitable for follow-up PRs. When run with CUBLAS, the qualitative experience is quite satisfactory.

ggerganov

Interesting work - I'm still trying to understand the details.

Some minor style changes requested and we can look into merging.
Also, make sure that the implementation in llama.cpp is ordered correctly. I see it is after the "grammar" stuff, but in the header the declarations are after the "sampling" stuff. So either move the beam search definitions after the sampling or update the header to match the order.

We'll also have to figure out some simple tests for the CI in order to keep this functionality working in the long run. We can do in separate PR.

ggerganov · 2023-08-25T10:30:53Z

llama.h

@@ -476,6 +476,39 @@ extern "C" {
    /// @details Accepts the sampled token into the grammar
    LLAMA_API void llama_grammar_accept_token(struct llama_context * ctx, struct llama_grammar * grammar, llama_token token);

+    struct llama_beam_view {
+        llama_token const* tokens;


Suggested change

llama_token const* tokens;

llama_token const * tokens;

ggerganov · 2023-08-25T10:30:59Z

llama.h

+    // (e.g. beams[0]) as they will be removed (shifted) from all beams in all subsequent callbacks.
+    // These pointers are valid only during the synchronous callback, so should not be saved.
+    struct llama_beams_state {
+        llama_beam_view* beam_views;


Suggested change

llama_beam_view* beam_views;

llama_beam_view * beam_views;

ggerganov · 2023-08-25T10:31:32Z

llama.h

+    /// @details Deterministically returns entire sentence constructed by a beam search.
+    /// @param ctx Pointer to the llama_context.
+    /// @param callback Invoked for each iteration of the beam_search loop, passing in beams_state.
+    ///                 The return beam_search_control can be used to control the beam_search execution.


beam_search_control should probably be changed to eos flag ?

ggerganov · 2023-08-25T10:31:40Z

llama.h

+    /// @param n_past Number of tokens already evaluated.
+    /// @param n_predict Maximum number of tokens to predict. EOS may occur earlier.
+    /// @param n_threads Number of threads as passed to llama_eval().
+    LLAMA_API void llama_beam_search(struct llama_context * ctx, llama_beam_search_callback_fn_t callback, void* callback_data, size_t n_beams, int n_past, int n_predict, int n_threads);


Suggested change

LLAMA_API void llama_beam_search(struct llama_context * ctx, llama_beam_search_callback_fn_t callback, void* callback_data, size_t n_beams, int n_past, int n_predict, int n_threads);

LLAMA_API void llama_beam_search(struct llama_context * ctx, llama_beam_search_callback_fn_t callback, void * callback_data, size_t n_beams, int n_past, int n_predict, int n_threads);

ggerganov · 2023-08-25T10:33:54Z

llama.cpp

+};
+
+// A struct for calculating logit-related info.
+struct logit_info {


Suggested change

struct logit_info {

struct llama_logit_info {

ggerganov · 2023-08-25T10:36:42Z

llama.cpp

+    std::vector<llama_token_data> top_k(size_t k) {
+        std::vector<llama_token_data> min_heap;  // min-heap by logit
+        llama_token const k_min = std::min(static_cast<llama_token>(k), n_vocab);
+        min_heap.reserve(k_min);
+        for (llama_token token_id=0 ; token_id<k_min ; ++token_id) {
+            min_heap.push_back(get_token_data(token_id));
+        }
+        auto comp = [](llama_token_data const& a, llama_token_data const& b) { return a.logit > b.logit; };
+        std::make_heap(min_heap.begin(), min_heap.end(), comp);
+        for (llama_token token_id=k_min ; token_id<n_vocab ; ++token_id) {
+            if (min_heap.front().logit < logits[token_id]) {
+                std::pop_heap(min_heap.begin(), min_heap.end(), comp);
+                min_heap.back().id = token_id;
+                min_heap.back().logit = logits[token_id];
+                std::push_heap(min_heap.begin(), min_heap.end(), comp);
+            }
+        }


Style changes examples - please apply to rest of non-server code:

Suggested change

std::vector<llama_token_data> top_k(size_t k) {

std::vector<llama_token_data> min_heap; // min-heap by logit

llama_token const k_min = std::min(static_cast<llama_token>(k), n_vocab);

min_heap.reserve(k_min);

for (llama_token token_id=0 ; token_id<k_min ; ++token_id) {

min_heap.push_back(get_token_data(token_id));

}

auto comp = [](llama_token_data const& a, llama_token_data const& b) { return a.logit > b.logit; };

std::make_heap(min_heap.begin(), min_heap.end(), comp);

for (llama_token token_id=k_min ; token_id<n_vocab ; ++token_id) {

if (min_heap.front().logit < logits[token_id]) {

std::pop_heap(min_heap.begin(), min_heap.end(), comp);

min_heap.back().id = token_id;

min_heap.back().logit = logits[token_id];

std::push_heap(min_heap.begin(), min_heap.end(), comp);

}

}

std::vector<llama_token_data> top_k(size_t k) {

std::vector<llama_token_data> min_heap; // min-heap by logit

const llama_token k_min = std::min(static_cast<llama_token>(k), n_vocab);

min_heap.reserve(k_min);

for (llama_token token_id = 0; token_id < k_min ; ++token_id) {

min_heap.push_back(get_token_data(token_id));

}

auto comp = [](const llama_token_data & a, const llama_token_data & b) { return a.logit > b.logit; };

std::make_heap(min_heap.begin(), min_heap.end(), comp);

for (llama_token token_id = k_min; token_id < n_vocab; ++token_id) {

if (min_heap.front().logit < logits[token_id]) {

std::pop_heap(min_heap.begin(), min_heap.end(), comp);

min_heap.back().id = token_id;

min_heap.back().logit = logits[token_id];

std::push_heap(min_heap.begin(), min_heap.end(), comp);

}

}

ggerganov · 2023-08-25T10:42:36Z

llama.cpp

+    }
+};
+
+struct beam_search {


Suggested change

struct beam_search {

struct llama_beam_search {

ggerganov · 2023-08-25T10:43:11Z

llama.cpp

@@ -3354,6 +3354,253 @@ void llama_grammar_accept_token(struct llama_context * ctx, struct llama_grammar
    ctx->t_sample_us += ggml_time_us() - t_start_sample_us;
 }

+struct llama_beam {


Suggested change

struct llama_beam {

//

// beam seach

//

struct llama_beam {

ggerganov · 2023-08-25T10:44:07Z

llama.h

+    // Type of pointer to the beam_search_callback function.
+    // void* callback_data is any custom data passed to llama_beam_search, that is subsequently
+    // passed back to beam_search_callback. This avoids having to use global variables in the callback.
+    typedef void (*llama_beam_search_callback_fn_t)(void* callback_data, llama_beams_state);


Suggested change

typedef void (*llama_beam_search_callback_fn_t)(void* callback_data, llama_beams_state);

typedef void (*llama_beam_search_callback_fn_t)(void * callback_data, llama_beams_state);

ggerganov · 2023-08-25T10:44:58Z

llama.h

@@ -476,6 +476,39 @@ extern "C" {
    /// @details Accepts the sampled token into the grammar
    LLAMA_API void llama_grammar_accept_token(struct llama_context * ctx, struct llama_grammar * grammar, llama_token token);

+    struct llama_beam_view {


//
// Beam search
//

Suggested change

struct llama_beam_view {

struct llama_beam_view {

ggerganov · 2023-08-25T10:49:08Z

Do I understand correctly that I can also use the eos flag to limit the beam length? For example, I can set it to true if the number of tokens in the beam exceeds 8. I guess this could be a valid use case for short-segments of beam-search where we avoid the beams to grow too long.

Wondering if this is correct, maybe we should rename the flag to something more descriptive. Like beam_end or eob?

…ept_token().

mattpulver · 2023-08-25T14:53:31Z

Changes

Add // Beam search heading to llama.{h,cpp} after llama_grammar_accept_token(). abe0829
Add space around * pointers and & references. 9bedaf4
Add spaces around comparison and assignment operators. e46a8b5
Prefer west const. 93daad7
Use llama_ prefix for structs in global namespace. fa33614
Delete obsolete comment from an earlier revision. b619cfc
Change eos to eob in llama_beam and llama_beam_view structs. 5fa1ea2

Responses

Thanks for the feedback.

I wasn't able to rename struct beam_search to struct llama_beam_search due to conflict w/ the function of the same name, so named it struct llama_beam_search_data instead. LMK if different names are preferred.

either move the beam search definitions after the sampling or update the header to match the order.

The beam search code are immediately after the 3 functions:

/// @details Selects the token with the highest probability.
LLAMA_API llama_token llama_sample_token_greedy(struct llama_context * ctx, llama_token_data_array * candidates);

/// @details Randomly selects a token from the candidates based on their probabilities.
LLAMA_API llama_token llama_sample_token(struct llama_context * ctx, llama_token_data_array * candidates);

/// @details Accepts the sampled token into the grammar
LLAMA_API void llama_grammar_accept_token(struct llama_context * ctx, struct llama_grammar * grammar, llama_token token);

in both llama.h and llama.cpp. Though the function llama_grammar_accept_token() is grammar-related, it was placed in the sampling functions section prior to this PR.

Thus the 3 sections are ordered as follows, in both the llama.h and llama.cpp files, with their respective 3-line comment header lines:

Grammar
Sampling
Beam Search

Do I understand correctly that I can also use the eos flag to limit the beam length? For example, I can set it to true if the number of tokens in the beam exceeds 8. I guess this could be a valid use case for short-segments of beam-search where we avoid the beams to grow too long.

Yes. The llama_beams_state struct passed to the callback function contains the member variable llama_beam_view * beam_views; which are read by the server after the callback is invoked. Any beams that have eos (now called eob) turned true are recognized.

Wondering if this is correct, maybe we should rename the flag to something more descriptive. Like beam_end or eob?

Yes, good point. End-of-beam (eob) should be distinguished from the llama_token_eos() since it can occur for other reasons than the eos token. The eos flag has been renamed to eob in commit 5fa1ea2.

I've attempted to address every issue. Please feel free to point out anything I have missed, as it was not intended.

p.s. Food for thought: clang-format with a custom configuration file can automatically enforce many/most of the style preferences for this project.

* master: (773 commits) server : add `/detokenize` endpoint (ggerganov#2802) convert.py : advanced option (ggerganov#2753) llama : use Unicode Escape Sequence to replace encoded characters (ggerganov#2814) flake.nix : add rocm support and cleanup (ggerganov#2808) llama : move #includes out of _GNU_SOURCE conditional (ggerganov#2817) main : fix bug (penalize_nl=false doesn't work) + suppress warning on mingw (ggerganov#1528) llama : use std::abs in llama_sample_tail_free (ggerganov#2800) k-quants : remove unnecessary tensor shape restrictions (ggerganov#2811) Better perplexity for 2- and 3-bit quantization for LLaMA-v2-70B (ggerganov#2807) Fix HellaSwag (ggerganov#2805) flake : build llama.cpp on Intel with nix (ggerganov#2795) Handle null rope scaling value (ggerganov#2793) Fix spm whitespaces (ggerganov#2806) examples : skip unnecessary external lib in server README.md how-to (ggerganov#2804) llama : fix struct decl (ggerganov#2790) Faster perplexity computation (ggerganov#2786) llama : add llama_beam_search() (ggerganov#2267) convert.py : Get rope scale from HuggingFace models (ggerganov#2772) llama-bench : add model sizes (ggerganov#2771) convert.py : export rope freq_base when converting CodeLlama from an HF model (ggerganov#2773) ...

ejones · 2023-08-27T00:50:55Z

Ah yeah sorry I put llama_grammar_accept_token there because I saw it as logically part of the sampling flow but it probably makes more sense with the other grammar functions.

* Add llama_beam_search(). * Add '// Beam search' heading to llama.{h,cpp} after llama_grammar_accept_token(). * Add space around * pointers and & references. * Add spaces around comparison and assignment operators. * Prefer west const. * Use llama_ prefix for structs in global namespace. * Delete obsolete comment from an earlier revision. * Change eos to eob in llama_beam and llama_beam_view structs.

Mihaiii · 2023-09-05T14:49:25Z

examples/server/server.cpp

@@ -1291,22 +1347,30 @@ int main(int argc, char **argv)
        llama.beginCompletion();

        if (!llama.stream) {
-            size_t stop_pos = std::string::npos;
+            if (llama.params.n_beams) {


@mattpulver I noticed this check for llama.params.n_beams, but n_beams param doesn't seem to be set anywhere. Am I misinterpreting? If I set it myself, will it work along with the grammar for this server example?

Here is an example of where it is set and used:

llama.cpp/examples/beam-search/beam-search.cpp

Line 110 in d59bd97

params.n_beams = 2 < argc ? std::stoi(argv[2]) : 2;

In examples/server/server.cpp ~~I believe it may be set via the command line~~ it should be set by server_params_parse() but it seems that was not yet done. Feel free to submit that as a PR.

I don't think beam search and grammar will currently work together. That is currently an open item: #2923

mattpulver force-pushed the add_beam_search branch from e98280b to 51b98d6 Compare July 20, 2023 14:53

walking-octopus mentioned this pull request Jul 22, 2023

Investigate PG-TD (Planning-Guided Transformer Decoding) sampling #2324

Closed

mattpulver force-pushed the add_beam_search branch from 51b98d6 to 1ffbc52 Compare July 31, 2023 15:15

mattpulver force-pushed the add_beam_search branch from 1ffbc52 to a7eb5df Compare August 2, 2023 15:45

mattpulver force-pushed the add_beam_search branch 3 times, most recently from 91d65a8 to fbbf0eb Compare August 8, 2023 14:33

mattpulver force-pushed the add_beam_search branch 2 times, most recently from db9657c to b989365 Compare August 16, 2023 15:41

mattpulver force-pushed the add_beam_search branch from b989365 to 1528660 Compare August 23, 2023 15:39

mattpulver marked this pull request as ready for review August 23, 2023 15:54

mattpulver mentioned this pull request Aug 23, 2023

Add beam search abetlen/llama-cpp-python#631

Open

ggerganov approved these changes Aug 25, 2023

View reviewed changes

mattpulver added 5 commits August 25, 2023 08:30

Add llama_beam_search().

c4269e0

Add '// Beam search' heading to llama.{h,cpp} after llama_grammar_acc…

abe0829

…ept_token().

Add space around * pointers and & references.

9bedaf4

Add spaces around comparison and assignment operators.

e46a8b5

Prefer west const.

93daad7

mattpulver added 3 commits August 25, 2023 10:46

Use llama_ prefix for structs in global namespace.

fa33614

Delete obsolete comment from an earlier revision.

b619cfc

Change eos to eob in llama_beam and llama_beam_view structs.

5fa1ea2

mattpulver force-pushed the add_beam_search branch from 1528660 to 5fa1ea2 Compare August 25, 2023 14:49

ggerganov merged commit c82742a into ggerganov:master Aug 25, 2023
5 of 24 checks passed

mattpulver deleted the add_beam_search branch August 25, 2023 15:33

Mihaiii reviewed Sep 5, 2023

View reviewed changes

dwrensha mentioned this pull request Apr 13, 2024

add missing kv clear in llama_beam_search #6664

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add llama_beam_search(). #2267

Add llama_beam_search(). #2267

mattpulver commented Jul 18, 2023 •

edited

Loading

SlyEcho commented Jul 18, 2023

mattpulver commented Jul 20, 2023

ggerganov commented Jul 21, 2023

mattpulver commented Jul 25, 2023

ggerganov commented Jul 28, 2023 •

edited

Loading

mattpulver commented Jul 31, 2023

bullno1 commented Aug 1, 2023

mattpulver commented Aug 2, 2023

bullno1 commented Aug 2, 2023 •

edited

Loading

mattpulver commented Aug 8, 2023 •

edited

Loading

mattpulver commented Aug 23, 2023 •

edited

Loading

ggerganov left a comment

ggerganov Aug 25, 2023

ggerganov Aug 25, 2023

ggerganov Aug 25, 2023

ggerganov Aug 25, 2023

ggerganov Aug 25, 2023

ggerganov Aug 25, 2023

ggerganov Aug 25, 2023

ggerganov Aug 25, 2023

ggerganov Aug 25, 2023

ggerganov Aug 25, 2023

ggerganov commented Aug 25, 2023

mattpulver commented Aug 25, 2023

ejones commented Aug 27, 2023

Mihaiii Sep 5, 2023

mattpulver Sep 5, 2023 •

edited

Loading

	LLAMA_API void llama_beam_search(struct llama_context * ctx, llama_beam_search_callback_fn_t callback, void* callback_data, size_t n_beams, int n_past, int n_predict, int n_threads);
	LLAMA_API void llama_beam_search(struct llama_context * ctx, llama_beam_search_callback_fn_t callback, void * callback_data, size_t n_beams, int n_past, int n_predict, int n_threads);

	typedef void (llama_beam_search_callback_fn_t)(void callback_data, llama_beams_state);
	typedef void (llama_beam_search_callback_fn_t)(void callback_data, llama_beams_state);

Add llama_beam_search(). #2267

Add llama_beam_search(). #2267

Conversation

mattpulver commented Jul 18, 2023 • edited Loading

SlyEcho commented Jul 18, 2023

mattpulver commented Jul 20, 2023

ggerganov commented Jul 21, 2023

mattpulver commented Jul 25, 2023

ggerganov commented Jul 28, 2023 • edited Loading

mattpulver commented Jul 31, 2023

bullno1 commented Aug 1, 2023

mattpulver commented Aug 2, 2023

bullno1 commented Aug 2, 2023 • edited Loading

On efficiency

On API design

On cache reusing

mattpulver commented Aug 8, 2023 • edited Loading

On API design

On efficiency + cache reuse

Questions

mattpulver commented Aug 23, 2023 • edited Loading

Ready-for-review Notes

ggerganov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ggerganov commented Aug 25, 2023

mattpulver commented Aug 25, 2023

Changes

Responses

ejones commented Aug 27, 2023

Choose a reason for hiding this comment

mattpulver Sep 5, 2023 • edited Loading

Choose a reason for hiding this comment

mattpulver commented Jul 18, 2023 •

edited

Loading

ggerganov commented Jul 28, 2023 •

edited

Loading

bullno1 commented Aug 2, 2023 •

edited

Loading

mattpulver commented Aug 8, 2023 •

edited

Loading

mattpulver commented Aug 23, 2023 •

edited

Loading

mattpulver Sep 5, 2023 •

edited

Loading