Incorrect scoring due to answer format mismatch in MMLU evaluation #2939

DerryChan · 2024-08-16T09:25:04Z

Description

During the MMLU evaluation of our LLM, we encountered an issue where correct answers are being marked as incorrect due to format mismatches. Specifically, when the model outputs the correct numerical answer but includes a preceding letter (e.g., "C. 12" instead of just "12"), the scoring system fails to recognize it as correct.

Current Behavior

The scoring system marks answers as incorrect if they don't exactly match the reference answer format, even if the numerical value is correct.

Expected Behavior

The scoring system should be able to correctly identify and score answers that are numerically correct, regardless of minor formatting differences such as preceding letters.

yifanmai · 2024-08-19T17:38:12Z

Hi @DerryChan, unfortunately this is something we don't plan to support for the default built-in MMLU scenario.

Some suggestions that you could try for your use case:

Your could change your model to respect the max_tokens parameter, which is set 1 for to MMLU. This will usually cause the model to only output the letter.
If your model is instruction-tuned, you can try adding an additional prompt that tells the model to only respond with a single letter. In particular, adding output_format_instructions=mmlu to your run entry (e.g. mmlu:output_format_instructions=mmlu,model=text) will add "Answer with only a single letter." to the prompt.
You could implement your own MMLU variant that uses a modified metric that performs the additional post-processing necessary to interpret your model's output.

yifanmai added the user question label Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect scoring due to answer format mismatch in MMLU evaluation #2939

Incorrect scoring due to answer format mismatch in MMLU evaluation #2939

DerryChan commented Aug 16, 2024

yifanmai commented Aug 19, 2024

Incorrect scoring due to answer format mismatch in MMLU evaluation #2939

Incorrect scoring due to answer format mismatch in MMLU evaluation #2939

Comments

DerryChan commented Aug 16, 2024

Description

Current Behavior

Expected Behavior

yifanmai commented Aug 19, 2024