Getting different scores for medalpaca 7B and medalpaca 13b #47

anand-subu · 2023-10-24T18:28:13Z

Hi there!

Great work with medalpaca! I was trying to reproduce your scores on the USMLE eval sets for medalpaca 7B and medalpaca 13B. However, when I run the notebook shared in #40, I'm getting the following scores:

To double verify I also calculated the scores directly myself, by ignoring the questions with images and I'm getting the same scores as the notebook.

I ran the eval code as follows:

python eval_usmle.py     --model_name medalpaca/medalpaca-7b     --prompt_template ../medalpaca/prompt_templates/medalpaca.json    --base_model False     --peft False   --load_in_8bit False  --path_to_exams medical_meadow_usmle_self_assessment

python eval_usmle.py     --model_name medalpaca/medalpaca-13b     --prompt_template ../medalpaca/prompt_templates/medalpaca.json    --base_model False     --peft False   --load_in_8bit False  --path_to_exams medical_meadow_usmle_self_assessment

But I'm still seeing considerable differences. The medalpaca 7B is quite close to its reported scores on the github readme but not so for medalpaca-13B. Could you let me know if I might be doing something wrong on my side?

Thank you!

The text was updated successfully, but these errors were encountered:

samuelvkwong · 2024-02-15T14:50:45Z

I am also having trouble reproducing the scores on the USMLE eval sets for medalpaca 13b, following the same steps as noted above. I had to make changes to the notebook used to compute the scores because there is a discrepancy between how the notebook assumes the answers in the generated json files should look like and what they actually look like.
In the evaluation script, if the response is not in the correct format, the LM is prompted a maximum of 5 times with the final response being saved as the answer (all previous responses are not saved in the answer).

medAlpaca/eval/eval_usmle.py

Line 169 in 63448c5

question["answer"] = response

In the scoring notebook, it assumes that all responses up to the final answer are saved as "answer_1", "answer_2", etc. which does not match the set up in the evaluation script where only "answer" contains the final answer. Here is the adjusted notebook: eval_usmle_edited.ipynb.zip
Here are the scores:

When I took a look at the generated answers in the json files, I noticed that a lot of the answers were unintelligible (answer did not start with a letter option, or multiple letter options were given).
So I prepared my own evaluation script aiming to fix what I thought was formatting errors. In my script I use the medalpaca-13b model in a HuggingFace pipeline and do fewshotting to more likely get the answer in the correct format. To ensure that the answer is in the correct format, I also pass the answer as well as the list of options to an OpenAI LM instance, asking it to select the answer option that is closest to the provided answer. With my script, the EM scores for step 1, step 2, and step 3 are 0.250, 0.257, and 0.290 respectively. There is improvement (most likely from solving the formatting issues) but still quite different from the reported scores for medalpaca-13b.

From what I read in your paper, the reported scores were achieved with zeroshot and no additional prompting techniques. Could you let me know if there is something I am missing or if you've been able to reproduce the scores recently?

jzy-dyania · 2024-05-17T15:37:15Z

I tried inference with huggingface pipeline, which gave higher results, but still 4~7% lower than the reported USMLE scores. Does anyone have idea about the discrepancy between the two methods?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting different scores for medalpaca 7B and medalpaca 13b #47

Getting different scores for medalpaca 7B and medalpaca 13b #47

anand-subu commented Oct 24, 2023 •

edited

Loading

samuelvkwong commented Feb 15, 2024

jzy-dyania commented May 17, 2024

Getting different scores for medalpaca 7B and medalpaca 13b #47

Getting different scores for medalpaca 7B and medalpaca 13b #47

Comments

anand-subu commented Oct 24, 2023 • edited Loading

samuelvkwong commented Feb 15, 2024

jzy-dyania commented May 17, 2024

anand-subu commented Oct 24, 2023 •

edited

Loading