-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting different scores for medalpaca 7B and medalpaca 13b #47
Comments
I am also having trouble reproducing the scores on the USMLE eval sets for medalpaca 13b, following the same steps as noted above. I had to make changes to the notebook used to compute the scores because there is a discrepancy between how the notebook assumes the answers in the generated json files should look like and what they actually look like. Line 169 in 63448c5
In the scoring notebook, it assumes that all responses up to the final answer are saved as "answer_1", "answer_2", etc. which does not match the set up in the evaluation script where only "answer" contains the final answer. Here is the adjusted notebook: eval_usmle_edited.ipynb.zip Here are the scores: When I took a look at the generated answers in the json files, I noticed that a lot of the answers were unintelligible (answer did not start with a letter option, or multiple letter options were given). From what I read in your paper, the reported scores were achieved with zeroshot and no additional prompting techniques. Could you let me know if there is something I am missing or if you've been able to reproduce the scores recently? |
I tried inference with huggingface pipeline, which gave higher results, but still 4~7% lower than the reported USMLE scores. Does anyone have idea about the discrepancy between the two methods? |
Hi there!
Great work with medalpaca! I was trying to reproduce your scores on the USMLE eval sets for medalpaca 7B and medalpaca 13B. However, when I run the notebook shared in #40, I'm getting the following scores:
To double verify I also calculated the scores directly myself, by ignoring the questions with images and I'm getting the same scores as the notebook.
I ran the eval code as follows:
But I'm still seeing considerable differences. The medalpaca 7B is quite close to its reported scores on the github readme but not so for medalpaca-13B. Could you let me know if I might be doing something wrong on my side?
Thank you!
The text was updated successfully, but these errors were encountered: