A wrapper around lm-eval-harness and Unitxt designed for evaluation of a local inference endpoint.
- Python 3.10 or newer
- An OpenAI API-compatible inference server; like vLLM
- A directory containing the necessary datasets for the benchmark (see example)
# Create a Virtual Environment
python -m venv venv
source venv/bin/activate
# Install the package
pip install git+https://github.com/sjmonson/llm-eval-test.git
# View run options
llm-eval-test run --help
usage: llm-eval-test download [-h] [--catalog-path PATH] [--tasks-path PATH] [--offline | --no-offline] [-v | -q] -t TASKS [-d DATASETS] [-f | --force-download | --no-force-download]
download datasets for open-llm-v1 tasks
options:
-h, --help show this help message and exit
-t TASKS, --tasks TASKS
comma separated tasks to download for example: arc_challenge,hellaswag
-d DATASETS, --datasets DATASETS
Dataset directory
-f, --force-download, --no-force-download
Force download datasets even it already exist
usage: llm-eval-test run [-h] [--catalog-path PATH] [--tasks-path PATH] [--offline | --no-offline] [-v | -q] -H ENDPOINT -m MODEL -t TASKS -d PATH [-T TOKENIZER] [-b INT] [-r INT] [-o OUTPUT | --no-output] [--format {full,summary}] [--chat-template | --no-chat-template]
Run tasks
options:
-h, --help show this help message and exit
--catalog-path PATH unitxt catalog directory
--tasks-path PATH lm-eval tasks directory
--offline, --no-offline
Disable/enable updating datasets from the internet
-v, --verbose set loglevel to DEBUG
-q, --quiet set loglevel to ERROR
-T, --tokenizer TOKENIZER
path or huggingface tokenizer name, if none uses model name (default: None)
-b, --batch INT per-request batch size
-r, --retry INT max number of times to retry a single request
-o, --output OUTPUT results output file
--no-output disable results output file
--format {full,summary}
format of output file
required:
-H, --endpoint ENDPOINT
OpenAI API-compatible endpoint
-m, --model MODEL name of the model under test
-t, --tasks TASKS comma separated list of tasks
-d, --datasets PATH path to dataset storage
prompt parameters:
these modify the prompt sent to the server and thus will affect the results
--chat-template, --no-chat-template
use chat template for requests
# Create dataset directory
DATASETS_DIR=$(pwd)/datasets
mkdir $DATASETS_DIR
# Download the MMLU-Pro dataset
DATASET=TIGER-Lab/MMLU-Pro
llm-eval-test download --datasets $DATASETS_DIR --tasks mmlu_pro
# Run the benchmark
ENDPOINT=http://127.0.0.1:8080/v1/completions # An OpenAI API-compatable completions endpoint
MODEL_NAME=meta-llama/Llama-3.1-8B # Name of the model hosted on the inference server
TOKENIZER=ibm-granite/granite-3.1-8b-instruct
llm-eval-test run --endpoint $ENDPOINT --model $MODEL_NAME --datasets $DATASETS_DIR --tasks mmlu_pro