-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add common benchmarks #50
Comments
Can I work on this? |
@LuvvAggarwal Sure thing. The scope of this one is a bit large because we currently don't have any common benchmarks. I think a simple case would be the following
Some benchmarks to start with would be HellaSwag and TruthfulQA, or perhaps simpler ones like ROGUE and BLEU Feel free to deviate from this plan, it's just a suggestion for how to get started. |
Thanks @steventkrawczyk for the guidance, based on my initial research I have found a package "Evaluate:" that can provide the methods for evaluating the model Please free to suggest better ways as I am new to ML stuff but love to contribute |
@steventkrawczyk, can we use the "Datasets" library for loading metrics dataset instead of creating a separate directory And it can also be used for quick tests on a prebuilt dataset |
@LuvvAggarwal using datasets sounds like a good start. As far as using evaluate, we want to write our own eval methods that support more than just huggingface (e.g. OpenAI, Anthropic) |
@steventkrawczyk Sure, but I have no idea about eval methods it would be great if you can share any references so I could code. |
For example, if you are using the hellaswag dataset, we need to compute the accuracy of predictions, e.g. https://github.com/openai/evals/blob/main/evals/metrics.py#L12 |
@LuvvAggarwal I kick started the code for benchmarks here if you would like to branch: #72 |
Thanks @HashemAlsaket, I will branch it |
🚀 The feature
We need to add benchmark test sets so folks can run on models / embeddings / systems
A few essentials:
Motivation, pitch
Users have told us that they want to run academic benchmarks as "smoke tests" on new models.
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: