Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect scoring due to answer format mismatch in MMLU evaluation #2939

Open
DerryChan opened this issue Aug 16, 2024 · 1 comment
Open

Comments

@DerryChan
Copy link

image

Description

During the MMLU evaluation of our LLM, we encountered an issue where correct answers are being marked as incorrect due to format mismatches. Specifically, when the model outputs the correct numerical answer but includes a preceding letter (e.g., "C. 12" instead of just "12"), the scoring system fails to recognize it as correct.

Current Behavior

The scoring system marks answers as incorrect if they don't exactly match the reference answer format, even if the numerical value is correct.

Expected Behavior

The scoring system should be able to correctly identify and score answers that are numerically correct, regardless of minor formatting differences such as preceding letters.

@yifanmai
Copy link
Collaborator

Hi @DerryChan, unfortunately this is something we don't plan to support for the default built-in MMLU scenario.

Some suggestions that you could try for your use case:

  • Your could change your model to respect the max_tokens parameter, which is set 1 for to MMLU. This will usually cause the model to only output the letter.
  • If your model is instruction-tuned, you can try adding an additional prompt that tells the model to only respond with a single letter. In particular, adding output_format_instructions=mmlu to your run entry (e.g. mmlu:output_format_instructions=mmlu,model=text) will add "Answer with only a single letter." to the prompt.
  • You could implement your own MMLU variant that uses a modified metric that performs the additional post-processing necessary to interpret your model's output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants