Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spike: small model presets experiments #976

Open
jalling97 opened this issue Sep 3, 2024 · 6 comments
Open

spike: small model presets experiments #976

jalling97 opened this issue Sep 3, 2024 · 6 comments
Assignees
Labels
Milestone

Comments

@jalling97
Copy link
Contributor

jalling97 commented Sep 3, 2024

Model presets

LeapfrogAI currently has two primary models that are used on the backend, but more should be added/tested. By implementing certain small models and evaluating their efficacy from a human perspective, we can make better decisions as to what models to use and evaluate against.

Goal

To determine a list of models and model configs that work well in LFAI from a human-in-the-loop perspective (no automated evals).

Methodology

  • Determine a short list (up to 5) models to test in LFAI (from HuggingFace is the simplest way to do this)
  • Run LFAI in a local dev context, replacing the model backend with one of these choices
  • Change the config parameters to gauge performance
  • Record config

Limitations

  • Model licenses are very permissive (e.g MIT, Apache-2.0)
  • Must be compatible with vllm or llama-cpp-python (the two frameworks currently supported by LFAI)
  • Limited vRAM requirements (12-16Gb including model weights and context)

Delivery

  • A list of models, different config options, and a respective set of subjective scores gauging their performance within LFAI.
  • A report of the methodology used to evaluate these models (so the experiment can be replicated)
  • Potentially a repository that contains code used to run these evaluations (and what was evaluated on)
@jalling97 jalling97 added the spike label Sep 5, 2024
@jxtngx
Copy link

jxtngx commented Sep 6, 2024

I can take this.

Here's an example from AWS on collecting HIL feedback for LM evals:

https://github.com/aws-samples/human-in-the-loop-llm-eval-blog

Are the questions in that example appropriate for the intent of this experiment?

@jxtngx
Copy link

jxtngx commented Sep 6, 2024

@jxtngx
Copy link

jxtngx commented Sep 6, 2024

I realize that not all of the models may be supported by llama-cpp-python or vLLM, and that I may need to add support for a custom model; especially for CPU only on macOS.

Here are two GGUF conversion examples for reference:

  1. https://github.com/ggerganov/llama.cpp/discussions/ 2948
  2. https://brev.dev/blog/convert-to-llamacpp

@jxtngx
Copy link

jxtngx commented Sep 6, 2024

As for the models that have proprietary licenses – if anything, it will be beneficial to run these subjective evals and provide feedback for the community.

From a user perspective – I'd probably opt for these models on my own, foregoing the default selections – though I understand the need for the LFAI team to ship with a default model that is very permissive in terms of licensing.

@jalling97
Copy link
Contributor Author

For reference I've updated the issue description a little bit to help clarify a few things.

For the AWS HIL example, that framework looks like it makes sense for what we're asking for, so if you would like to use it as a basis, go for it!

I added these to the description, but we have a few limitations I didn't outline originally:

  • Model licenses are very permissive (e.g MIT, Apache-2.0)
  • Models must be compatible with vllm or llama-cpp-python (the two frameworks currently supported by LFAI)
  • Limited vRAM requirements (12-16Gb including model weights and context)

If you're on MacOS, we won't ask you to work outside of the deployment context available to you, so anything that can be run on llama-cpp-python is great. For simplicity, let's stick to model licenses that are as permissive as Apache-2.0 or greater.

The vRAM requirements are ideally less than 12Gb, but anything that would fit under 16Gb is worth checking for our purposes (i.e single-GPU laptop deployment scenarios). That likely means quantizations, so if you can find quantizations for models you want to test, great! We're also open to managing our own quantized models, so feel free to experiment with your own quantizations if you want to, but it's certainly not required.

It would be greatly helpful to compare any models you test with the current defaults from the docs (as you already listed). That would act as a fantastic point of comparison.

@jalling97 jalling97 added this to the EVERGREEN milestone Sep 6, 2024
@jxtngx
Copy link

jxtngx commented Sep 6, 2024

Important

RWKV is a recurrent model architecture (paper)

RWKV is a different architecture than transformers based models. The model arch is not available in llama-cpp, but can be made available for use with llama-cpp-python by using the gguf-my-repo HF Space.

With regard to the HF Space, a user must understand the quantization key provided below:

https://huggingface.co/docs/hub/en/gguf#quantization-types.

Note

quantized, llama-cpp-python compatible models will be made available in this HF collection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants