Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel Memory is broken with latest nugets #305

Open
JohnGalt1717 opened this issue Nov 17, 2023 · 35 comments
Open

Kernel Memory is broken with latest nugets #305

JohnGalt1717 opened this issue Nov 17, 2023 · 35 comments

Comments

@JohnGalt1717
Copy link

Using the 0.8 release of LlamaSharp and Kernal-Memory with the samples there is an error because the LlamaSharpTextEmbeddingGeneration doesn't implement the Attributes property.

I took the source and created my own and added this:

public IReadOnlyDictionary<string, string> Attributes => new Dictionary<string, string>();

So it wouldn't error.

But no matter what model I use I get "INFO NOT FOUND." (I've tried kai-7b-instruct.Q5_K_M.gguf, llama-2-7b-32k-instruct.Q6_K.gguf, llama-2-7b-chat.Q6_K.gguf and a few others)

I've tried loading just text, an html file, and a web page to no avail.

@AsakusaRinne
Copy link
Collaborator

Hi, which version of kernel-memory did you install?

@AsakusaRinne AsakusaRinne mentioned this issue Nov 17, 2023
8 tasks
@xbotter
Copy link
Collaborator

xbotter commented Nov 17, 2023

The new version of the semantic kernel has added some breaking change, and the latest package has been updated in the PR #306

@JohnGalt1717
Copy link
Author

JohnGalt1717 commented Nov 17, 2023

I also notice that there is no way to add grammar to this right now so AskAsync can't force good output of JSON as an example. Can this be added? Same with MainGPU and the other settings on the context. And there are a few optoins missing (TOP_K) for the executer as well that you can't get to with Microsoft's With(new TextGenerationOptions). Perhaps an extended one for LlamaSharp that has all of the rest of the settings that are available?

And I'm noticing, now that I have it hacked to work(ish) that a memory.AskAsync works, but if you do it again (i.e. memory.AskAsync) for a second question against the data, I just get a ton of \n characters and no valid response. I'm not sure if that's kernel-memory doing that, or if it's this that's causing the issue. (If it helps, I'm in GPU only mode with dual nvidia 8 gb cards.)

@xbotter
Copy link
Collaborator

xbotter commented Nov 17, 2023

I also notice that there is no way to add grammar to this right now so AskAsync can't force good output of JSON as an example. Can this be added? Same with MainGPU and the other settings on the context. And there are a few optoins missing (TOP_K) for the executer as well that you can't get to with Microsoft's With(new TextGenerationOptions). Perhaps an extended one for LlamaSharp that has all of the rest of the settings that are available?

Yes, the current version of Kernel Memory does not provide more customizable options.
However, there are currently two ways to format the output: one is by reformatting the Ask Result through LLM, and the other is by using the Search method to obtain search results from documents and then organizing prompts to generate desired results.

And I'm noticing, now that I have it hacked to work(ish) that a memory.AskAsync works, but if you do it again (i.e. memory.AskAsync) for a second question against the data, I just get a ton of \n characters and no valid response. I'm not sure if that's kernel-memory doing that, or if it's this that's causing the issue. (If it helps, I'm in GPU only mode with dual nvidia 8 gb cards.)

It seems to be related to the state management of the model. @AsakusaRinne any idea?

@JohnGalt1717
Copy link
Author

On the first issue I was able to take your code, and manually add a grammar to it. Is there a way that we can just expose Grammar in the LLamaSharpConfig for right now that will get passed through? Same with MainGPU, and TOP_K?

@xbotter
Copy link
Collaborator

xbotter commented Nov 17, 2023

On the first issue I was able to take your code, and manually add a grammar to it. Is there a way that we can just expose Grammar in the LLamaSharpConfig for right now that will get passed through? Same with MainGPU, and TOP_K?

👍 Good idea, I seem to have a solution to the issue #289. 😃 Thank you for your suggestion.

@JohnGalt1717
Copy link
Author

JohnGalt1717 commented Nov 17, 2023

On the issue with getting the endless \n on the second AskAsync, it also does it with embeddings. If you import 2 documents in a row, the second one will fail. It appears that it fails because it's failing in the text generation the same way and it appears that it is something in the context, because if I regen the model and the executor every time, it doesn't fix the problem

I also noticed that EmbeddingMode is not being set on the embedder ModelParams which makes it much slower.

@JohnGalt1717
Copy link
Author

Update: It isn't the context, and the native handle shows that it's still valid. I've tried to replace all of the shared versions of the context and model with new versions spun up per request, and there was no difference.

@AsakusaRinne
Copy link
Collaborator

And I'm noticing, now that I have it hacked to work(ish) that a memory.AskAsync works, but if you do it again (i.e. memory.AskAsync) for a second question against the data, I just get a ton of \n characters and no valid response. I'm not sure if that's kernel-memory doing that, or if it's this that's causing the issue. (If it helps, I'm in GPU only mode with dual nvidia 8 gb cards.)

It seems to be related to the state management of the model. @AsakusaRinne any idea?

It needs a deep dive. Did you use WithLLamaSharpDefaults of kernel-memory integration?

I also noticed that EmbeddingMode is not being set on the embedder ModelParams which makes it much slower.

It's absolutely an unexpected behaviour! Could you please share your code with us to re-preoduce it? (a minimal example is better)

@JohnGalt1717
Copy link
Author

Yes, I used WithlllamaSharpDefaults.

I'm using a console app

var modelPath = Path.GetFullPath(Path.Combine(Assembly.GetEntryAssembly()!.Location, "..", "..", "..", "..", @"openhermes-2.5-mistral-7b-16k.Q5_0.gguf"));

var builder = new KernelMemoryBuilder()
        .WithQdrant(new QdrantConfig
        {
            APIKey = "",
            Endpoint = "http://localhost:6333",

        })
        .WithLLamaSharpDefaults(new LLamaSharpConfig(modelPath)
        {
            ContextSize = 16_384,
            Seed = 1337,
            GpuLayerCount = 20,
        })
        .With(new TextPartitioningOptions
        {
            MaxTokensPerLine = 100,
            MaxTokensPerParagraph = 300,
            OverlappingTokens = 30,
        })
        .With(new TextGenerationOptions
        {
            MaxTokens = -1,
            TopP = 0.95F,
            Temperature = 0.0,
            FrequencyPenalty = 1.1F,
            StopSequences = ["\n"],
            ResultsPerPrompt = 1
        });

    var memory = builder
                    .BuildServerlessClient();

await memory.ImportTextAsync("The quick brown fox jumps over the lazy dog.");

await memory.AskAsync("Who jumps over the lazy dog?");

await memory.AskAsync("What kind of dog is it?");


The second AskAsync will return a ton of \n\n no matter the model at least when using CUDA.

The second part about EmbeddingMode is just in the setup of the WithLLamaSharpDefaults it doesn't split out the parameters so they both use the same params for text creation and embedding instead of optimizing out the embedding params.

@AsakusaRinne
Copy link
Collaborator

@xbotter I think we should watch the prompt put into our model in the second run with context size reduced to 4096. Could you please take a look? I'm on duty this weekend.

@xbotter
Copy link
Collaborator

xbotter commented Nov 18, 2023

@xbotter I think we should watch the prompt put into our model in the second run with context size reduced to 4096. Could you please take a look? I'm on duty this weekend.

This can only be resolved from the kernel memory . I have already submitted an issue microsoft/kernel-memory#164 and waiting for further updates.

@JohnGalt1717
Copy link
Author

If it matters I’m not changing the context length and it still happens.

@dlucr
Copy link

dlucr commented Dec 4, 2023

KernelMemory author here, let me know if there's something I can do to make the integration better, more powerful, easier, etc :-)

@JohnGalt1717
Copy link
Author

KernelMemory author here, let me know if there's something I can do to make the integration better, more powerful, easier, etc :-)

Not related to this, but since you offered, the biggest single issue with kernal memory (and LLMs in general) is context length. In kernal memory it's worse because you're trying to analyse documents which can be very long and if you Ask on many documents the context runs out and causes issues.

I'd love to see KernelMemory port or otherwise adopt some of the winding techniques of memgpt natively and expose methods for private LLMs to impliment incremental analysis etc. I.e. keeping the system message and query in context, and then taking part by part of documents in a loop to pull facts out of and accumulate answers.

@dluc
Copy link

dluc commented Dec 7, 2023

Thanks for the feedback, we merged a PR today that allows to configure and/or replace the search logic, e.g. defining token limits.

And this PR microsoft/kernel-memory#189 allows to customize token settings and tokenization logic. Would appreciate if someone could take a look/let us know if it helps.

This snippet shows how we could add LLama to KernelMemory:

public class LLamaConfig
{
    public string ModelPath { get; set; } = "";

    public int MaxTokenTotal { get; set; } = 4096;
}

public class LLamaTextGenerator : ITextGenerator, IDisposable
{
    private readonly string _modelPath;

    public LLamaTextGenerator(LLamaConfig config)
    {
        this._modelPath = config.ModelPath;
        this.MaxTokenTotal = config.MaxTokenTotal;
    }

    /// <inheritdoc/>
    public int MaxTokenTotal { get; }

    /// <inheritdoc/>
    public int CountTokens(string text)
    {
        // ... count tokens using LLama tokenizer ...
        // ... which can be injected via ctor as usual ...
    }

    /// <inheritdoc/>
    public IAsyncEnumerable<string> GenerateTextAsync(
        string prompt,
        TextGenerationOptions options,
        CancellationToken cancellationToken = default)
    {
        // ... use LLama backend to generate text ...
    }
}

@JohnGalt1717
Copy link
Author

This looks good from what I can see!

Is there a roadmap for memgpt style functionality? Really missing this from python.

@dluc
Copy link

dluc commented Dec 7, 2023

Is there a roadmap for memgpt style functionality? Really missing this from python.

you mean a chat UI?

Things on the roadmap are supporting chat logs and some special memory views. I always wanted to create a simple UI for demos, maybe as a side project, no timeline though

@dluc
Copy link

dluc commented Dec 7, 2023

About LLamaSharp, could you point me to how to count the tokens for a given string? is there some example?

@martindevans
Copy link
Member

About LLamaSharp, could you point me to how to count the tokens for a given string? is there some example?

If you have a LLamaContext:

LLamaContext context;
int[] tokens = context.Tokenize("this is a string");
int count = tokens.Length;

If you only have a model and no context (i.e. LLamaWeights):

LLamaWeights weights;
int[] tokens = weights.NativeHandle.Tokenize("this is also a string");
int count = tokens;

@dluc
Copy link

dluc commented Dec 8, 2023

I started a draft to integrate LlamaSharp into KernelMemory here microsoft/kernel-memory#192

I'm using llama-2-7b-chat.Q2_K.gguf for my tests. A few questions:

  • which packages should I use/import? assume the user should be able to choose a model they like and run it on CPU or GPU, depending on their device
  • is the code correct, e.g. holding an instance of the context, and creating a new executor on each request?

Also, the prompt in the test seems to generate some hallucination, do you see anything that could be causing that? the kind of hallucinations I used to see in the old GPT3 in 2022:

Temp 0, max token 20, prompt:

The public Kernel Memory project kicked off around May 2023.
Now, in December 2023, we are integrating LLama compatibility
into KM, following the steady addition of numerous features.
By January, we anticipate to complete this update and potentially
introduce more models by February.
What's the current month (concise response)?

result:

December
What is the expected completion date for integrating LLama compatibility into Kernel Memory

(getting the same result with llama-2-7b-chat.Q2_K.gguf, llama-2-7b-chat.Q6_K.gguf and llama-2-7b-chat.Q8_0.gguf so I don't think it's about quantization)

@JohnGalt1717
Copy link
Author

dluc: Generally I see this when:

  1. The tokens to keep doesn't include the original prompt and the question.
  2. The context_length is too short for the total of everything.
  3. The max_tokens (i.e. the maximum response length) is too short.

Generally what you want is the context_length to be the same as the model's length. And you want max_tokens to either be short but long enough for the answer you're expecting because LLAMA has a bad habit of repeating itself, or System Message + all user messages + assistant responses including max_tokens <= context_length.

I use TikSharp to calculate the number of tokens for all prompts, add 10 or so just to be safe and subtract that from context_length and make that the max token length, then set antiprompts to AntiPrompts = ["\n\n\n\n", "\t\t\t\t"] which gets rid of 2 of the cases (especially when generating json with the grammar file) of repetition instead of ending.

This technique also works when using ChatGPT 3.5+ so you don't get errors since it hard refuses and costs you money, to produce more than the context_length so you have to do this math or risk it blowing up and running up your bill.

@JohnGalt1717
Copy link
Author

PS: Kernel-Memory is broken again with the latest update to Kernel-Memory.

@vshapenko
Copy link

@dluc , as you are developer of kernel memory, can you provide some sample of MemoryServerless based on LLamaSharp? I am trying to make it work (by getting code for text generator from microsoft/kernel-memory#192), but don't have much luck.AskAsync goes to infinite state

Here is my code:
`

open System
open LLama
open LLama.Common
open LLamaSharp.KernelMemory
open Microsoft.FSharp.Core
open Microsoft.KernelMemory
open Microsoft.KernelMemory.AI
open Microsoft.KernelMemory.Handlers
open Microsoft.KernelMemory.MemoryStorage.Qdrant
let memoryBuilder = KernelMemoryBuilder()
let inferenceParams = new InferenceParams(AntiPrompts = [|"<|end_of_turn|>"|])
let llamaConfig = new LLamaSharpConfig("/Users/codechanger/llama/openchat_3.5.Q5_K_M.gguf")
llamaConfig.DefaultInferenceParams <-inferenceParams
llamaConfig.ContextSize<- 4096u

type Generator(config: LLamaSharpConfig) =
let modelParams = new ModelParams(config.ModelPath)
do modelParams.ContextSize<- config.ContextSize
let weights = LLamaWeights.LoadFromFile(modelParams);
let embedder= new LLamaEmbedder(weights, modelParams)
let context = weights.CreateContext(modelParams)

interface ITextEmbeddingGenerator with
member this.CountTokens(text) = context.Tokenize(text).Length
member this.GenerateEmbeddingAsync(text, cancellationToken) =
task{
let embeddings = embedder.GetEmbeddings(text)
return Embedding(embeddings)
}

member this.MaxTokens =int (modelParams.ContextSize.GetValueOrDefault())
// member this.(data, kernel, cancellationToken) =
//     }
// member this.Attributes = Dictionary<string,obj>()

type TextGenerator(config: LLamaSharpConfig) =
let modelParams = new ModelParams(config.ModelPath)
do modelParams.ContextSize<- config.ContextSize
let weights = LLamaWeights.LoadFromFile(modelParams)

interface ITextGenerator with
member this.CountTokens(text) =
use context = weights.CreateContext(modelParams)
context.Tokenize(text).Length
member this.GenerateTextAsync(prompt, options, cancellationToken) =
let parameters = InferenceParams()
parameters.Temperature <-float32 options.Temperature
parameters.AntiPrompts <- options.StopSequences |> Seq.toArray
parameters.TopP <- float32 options.TopP
parameters.PresencePenalty <- float32 options.PresencePenalty
parameters.FrequencyPenalty <- float32 options.FrequencyPenalty
parameters.MaxTokens <- options.MaxTokens.GetValueOrDefault()
let executor = new StatelessExecutor(weights, modelParams)
executor.InferAsync(prompt)
member this.MaxTokenTotal = int (config.ContextSize.GetValueOrDefault())

let mb = memoryBuilder
.WithCustomEmbeddingGenerator(Generator(llamaConfig))
.WithCustomTextGenerator(TextGenerator(llamaConfig))
.With(new TextPartitioningOptions(MaxTokensPerParagraph = 300, MaxTokensPerLine = 100, OverlappingTokens = 30))

let kernelMemory = mb.Build()

task{
let! doc = kernelMemory.ImportDocumentAsync("/Users/codechanger/test.txt")
Console.WriteLine(doc)
let! res = kernelMemory.AskAsync("autumn")
Console.WriteLine(res.Result)
}|> ignore

Console.ReadLine() |> ignore
`

test.txt itself is very simple:
Autumn is sad season

I am using Bloke model for openchat 3.5

@dluc
Copy link

dluc commented Dec 11, 2023

Update: LLamaSharp 0.8.1 is now integrated into KernelMemory, here's an example: https://github.com/microsoft/kernel-memory/blob/main/examples/105-dotnet-serverless-llamasharp/Program.cs

There's probably some work to do for users, e.g. customizing prompts for LLama and identifying which model works best. KM should be sufficiently configurable to allow that.

@AsakusaRinne
Copy link
Collaborator

@dluc Thank you a lot for this great work! Since kernel-memory has supported integration of LLamaSharp internally, I think we'll give up maintaining LLamaSharp.kernel-memory package since the next version. If there's any issue about the LLamaSharp integration in kernel-memory, please feel free to ping me or @xbotter. I'd like to help with it. :)

@JohnGalt1717
Copy link
Author

The example is missing A LOT and it appears that it doesn't handle stuff properly either. I.e. it uses Azure openai to do tokenization/embeddings instead of LLama like the LLamaSharp.kernel-memory does and there isn't even any code to make it work.

And even then, I'm not sure how the sample would even work, because OpenAI uses a different tokenization length than LLama so the results will be vastly different if it works at all.

I'd suggest before LLamaSharp.kernel-memory is legacied, that the sample needs to be updated to use LLamaSharp for the entire roundtrip.

@dluc
Copy link

dluc commented Dec 11, 2023

@AsakusaRinne I would take the opportunity to thank you all for LlamaSharp, making it so straightforward to integrate Llama into SK and KM.

Before removing KM from LlamaSharp, I'd just highlight that we didn't add LLamaEmbeddings in KM, because text comparison tests were failing, because cosine similarity for similar strings is off. For example comparing the similarity of these strings, using embeddings:

string e1 = "It's January 12th, sunny but quite cold outside";
string e2 = "E' il 12 gennaio, c'e' il sole ma fa freddo fuori";
string e3 = "the cat is white";
Results with OpenAI Ada 2
e1 <--> e2: 0.9077211022377014     OK
e1 <--> e3: 0.7831940054893494     OK
e2 <--> e3: 0.7444156408309937     OK

Results with llama-2-7b-chat.Q6_K.gguf
e1 <--> e2: -0.09282319992780685   FAIL
e1 <--> e3: 0.655430018901825      OK
e2 <--> e3: -0.10494938492774963   OK

Results with llama-2-13b.Q2_K.gguf
e1 <--> e2: 0.4025874137878418     FAIL
e1 <--> e3: 0.6390143632888794     OK
e2 <--> e3: 0.33123183250427246    OK

@JohnGalt1717
Copy link
Author

This is kinda required.

And AskAsync is REALLY slow.

@martindevans
Copy link
Member

We had some trouble with embeddings before, I think last time we investigated it we found we got the same values from llama.cpp and there was nothing to fix on the C# side. I'd suggest trying those same tests with llama.cpp to see if you get the same values there or if it's maybe an issue on our end this time.

@dluc
Copy link

dluc commented Dec 11, 2023

This is kinda required.

And AskAsync is REALLY slow.

[Maybe we can move this conversation to KM repo if it's out of scope here]

@JohnGalt1717 could you help understanding if AskAsync is slower than the underlying model?

  • What hardwarre and drivers are you testing on?
  • Is LlamaSharp InferAsync faster? if so, how much?

From my tests:

  • on a PC with 64GB RAM, Nvidia T600, CUDA 12: Llama is slow, and AskAsync reflects the same performance. E.g. 3 minutes to answer a question with ~2000 tokens.
  • on a MacBook M1 with 32GB RAM: Llama is quite fast and the same for AskAsync. E.g. 12 seconds to answer a question with ~2000 tokens.

@dluc
Copy link

dluc commented Dec 11, 2023

The example is missing A LOT and it appears that it doesn't handle stuff properly either. I.e. it uses Azure openai to do tokenization/embeddings instead of LLama like the LLamaSharp.kernel-memory does and there isn't even any code to make it work.

And even then, I'm not sure how the sample would even work, because OpenAI uses a different tokenization length than LLama so the results will be vastly different if it works at all.

I'd suggest before LLamaSharp.kernel-memory is legacied, that the sample needs to be updated to use LLamaSharp for the entire roundtrip.

It appears there may be a bit of confusion regarding the distinction between text indexing and RAG text generation. The process of text indexing operates independently and does not necessarily require the use of the same model utilized for text generation. It's perfectly reasonable to use one model for embedding and indexing, while employing a different model for text generation.

As for the test in the repo, we have executed it on numerous occasions with both text and files, using a blend of OpenAI/Azure OpenAI for embeddings and LLama for text generation, which has consistently resulted in expected outcomes. If you're able to provide the steps to reproduce any errors you're experiencing, we'd certainly appreciate the opportunity to investigate them further. Could you kindly share some additional details about your scenario?

@AsakusaRinne
Copy link
Collaborator

@AsakusaRinne I would take the opportunity to thank you all for LlamaSharp, making it so straightforward to integrate Llama into SK and KM.

Before removing KM from LlamaSharp, I'd just highlight that we didn't add LLamaEmbeddings in KM, because text comparison tests were failing, because cosine similarity for similar strings is off. For example comparing the similarity of these strings, using embeddings:

string e1 = "It's January 12th, sunny but quite cold outside";
string e2 = "E' il 12 gennaio, c'e' il sole ma fa freddo fuori";
string e3 = "the cat is white";
Results with OpenAI Ada 2
e1 <--> e2: 0.9077211022377014     OK
e1 <--> e3: 0.7831940054893494     OK
e2 <--> e3: 0.7444156408309937     OK

Results with llama-2-7b-chat.Q6_K.gguf
e1 <--> e2: -0.09282319992780685   FAIL
e1 <--> e3: 0.655430018901825      OK
e2 <--> e3: -0.10494938492774963   OK

Results with llama-2-13b.Q2_K.gguf
e1 <--> e2: 0.4025874137878418     FAIL
e1 <--> e3: 0.6390143632888794     OK
e2 <--> e3: 0.33123183250427246    OK

May I ask for a way to reproduce it? I'm not sure if it's because the model does not support the language in the second sentence.

@JohnGalt1717
Copy link
Author

This is kinda required.
And AskAsync is REALLY slow.

[Maybe we can move this conversation to KM repo if it's out of scope here]

@JohnGalt1717 could you help understanding if AskAsync is slower than the underlying model?

  • What hardwarre and drivers are you testing on?
  • Is LlamaSharp InferAsync faster? if so, how much?

From my tests:

  • on a PC with 64GB RAM, Nvidia T600, CUDA 12: Llama is slow, and AskAsync reflects the same performance. E.g. 3 minutes to answer a question with ~2000 tokens.
  • on a MacBook M1 with 32GB RAM: Llama is quite fast and the same for AskAsync. E.g. 12 seconds to answer a question with ~2000 tokens.

I have a machine that I've run with and without GPU (2070 super). If I use the Kai 7b 5_S_M model from hugging face, it's 3:30 and 1:40 respectively run directly from the text of the partitions that a search returns.

if I use AskAsync it has been running for over 45 minutes and still hasn't returned.

@adammikulis
Copy link

Thanks for all of the hard work integrating KernelMemory with LLamaSharp, I have been able to get a working RAG prototype going in Godot with these two libraries. I noticed that for some reason, the IKernelMemoryBuilder doesn't actually load any layers onto the GPU, no matter how many layers I have specified. This happens with my desktop that has an RTX 3080 12GB and my laptop with an eGPU (RTX 3060 12GB).

I can confirm that I only have the Cuda12 backend added, and when I am using just LLamaSharp to ChatAsync it loads the model just fine onto the GPU (for both setups). Perhaps I'm doing something wrong, but if other users can look at their Task Manager to see if the IKernelMemoryBuilder actually utilizes the GPU or not that could explain the slow speed. It runs my 5800X3D at ~66% for all file embeddings and queries.

I am also trying to figure out if there is a way to interact with the loaded model in a standard chat fashion, but I do not see any methods in IKernelMemory that interact with the model without searching the database. From what I can tell, using an IKernelMemoryBuilder utilizes the standard LLamaSharp weights/context/executor/embedder but does not make them available after creation of the IKernelMemory. It would be great to have access to the standard LLamaSharp session.ChatAsync method when utilizing a KernelMemory, if possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants