Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting different scores for medalpaca 7B and medalpaca 13b #47

Open
anand-subu opened this issue Oct 24, 2023 · 2 comments
Open

Getting different scores for medalpaca 7B and medalpaca 13b #47

anand-subu opened this issue Oct 24, 2023 · 2 comments

Comments

@anand-subu
Copy link

anand-subu commented Oct 24, 2023

Hi there!

Great work with medalpaca! I was trying to reproduce your scores on the USMLE eval sets for medalpaca 7B and medalpaca 13B. However, when I run the notebook shared in #40, I'm getting the following scores:

scores

To double verify I also calculated the scores directly myself, by ignoring the questions with images and I'm getting the same scores as the notebook.

I ran the eval code as follows:

python eval_usmle.py     --model_name medalpaca/medalpaca-7b     --prompt_template ../medalpaca/prompt_templates/medalpaca.json    --base_model False     --peft False   --load_in_8bit False  --path_to_exams medical_meadow_usmle_self_assessment
python eval_usmle.py     --model_name medalpaca/medalpaca-13b     --prompt_template ../medalpaca/prompt_templates/medalpaca.json    --base_model False     --peft False   --load_in_8bit False  --path_to_exams medical_meadow_usmle_self_assessment

But I'm still seeing considerable differences. The medalpaca 7B is quite close to its reported scores on the github readme but not so for medalpaca-13B. Could you let me know if I might be doing something wrong on my side?

Thank you!

@samuelvkwong
Copy link

I am also having trouble reproducing the scores on the USMLE eval sets for medalpaca 13b, following the same steps as noted above. I had to make changes to the notebook used to compute the scores because there is a discrepancy between how the notebook assumes the answers in the generated json files should look like and what they actually look like.
In the evaluation script, if the response is not in the correct format, the LM is prompted a maximum of 5 times with the final response being saved as the answer (all previous responses are not saved in the answer).

question["answer"] = response

In the scoring notebook, it assumes that all responses up to the final answer are saved as "answer_1", "answer_2", etc. which does not match the set up in the evaluation script where only "answer" contains the final answer. Here is the adjusted notebook: eval_usmle_edited.ipynb.zip
Here are the scores:
Screen Shot 2024-02-15 at 2 58 20 PM

When I took a look at the generated answers in the json files, I noticed that a lot of the answers were unintelligible (answer did not start with a letter option, or multiple letter options were given).
So I prepared my own evaluation script aiming to fix what I thought was formatting errors. In my script I use the medalpaca-13b model in a HuggingFace pipeline and do fewshotting to more likely get the answer in the correct format. To ensure that the answer is in the correct format, I also pass the answer as well as the list of options to an OpenAI LM instance, asking it to select the answer option that is closest to the provided answer. With my script, the EM scores for step 1, step 2, and step 3 are 0.250, 0.257, and 0.290 respectively. There is improvement (most likely from solving the formatting issues) but still quite different from the reported scores for medalpaca-13b.

From what I read in your paper, the reported scores were achieved with zeroshot and no additional prompting techniques. Could you let me know if there is something I am missing or if you've been able to reproduce the scores recently?

@jzy-dyania
Copy link

I tried inference with huggingface pipeline, which gave higher results, but still 4~7% lower than the reported USMLE scores. Does anyone have idea about the discrepancy between the two methods?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants