LLM360 Evaluation

Amber

Before running this script, please do wandb login.

cd scripts
python evaluate.py\
    --experiment_ckpt /lustre/scratch/users/<home directory>/<model directory>/workdir_7b/\
    --experiment_name <experiment name> \
    --output_folder ../output/<output folder> \
    --run_every 5

Parameter definition:

experiment_ckpt is the path to the list of all experiment checkpoints.
experiment_name is the experiment name for wandb.
output_folder is the path to the output folder
run_every means to evaluate the checkpoint for every certain multiple of checkpoint

This script needs to be run in a tmux session. It will run in an infinite loop to check new checkpoints and regularly update the wandb for the evaluation scores.

CrystalCoder

We rely on Bigcode harness and EleutherAI harness to run evaluations for CrystalCoder. crystalcoder_eval.py records all the configuration for the tests we have ran so far.

Sample commands:

For bigcode harness: CUDA_VISIBLE_DEVICES=0 python bigcode-evaluation-harness/main.py --model <MODEL> --batch_size=1 --max_length_generation <MAX_LEN> --n_samples 1 --temperature <TEMP> --tasks humaneval --allow_code_execution --trust_remote_code --save_generations --save_generations_path <YOUR_PATH>/<SOME_OUTPUT_NAME>.json --metric_output_path <YOUR_PATH>/<SOME_OUTPUT_NAME>.json --precision bf16
For lm-harness: CUDA_VISIBLE_DEVICES=0 python lm-evaluation-harness/main.py --no_cache --model=hf-causal-experimental --batch_size=2 --model_args="pretrained=<MODEL>,trust_remote_code=True,dtype=bfloat16" --tasks=<TASK> --num_fewshot=<KSHOT> --output_path=<YOUR_PATH>/<SOME_OUTPUT_NAME>.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

LLM360 Evaluation

Amber

CrystalCoder

Files

README.md

Latest commit

History

README.md

File metadata and controls

LLM360 Evaluation

Amber

CrystalCoder