Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace Claude Safety Judge with Llama3.1 #2974

Merged
merged 3 commits into from
Sep 5, 2024
Merged

Conversation

farzaank
Copy link
Contributor

@farzaank farzaank commented Sep 4, 2024

Replaces Claude with Llama 3.1.

Motivation is that Claude would refuse judging for some model-scenario pairs. We decided to try 2 open models as a judge as an alternative: Llama Guard 3 and Llama 3.1 if that didn't work out.

Llama Guard 3, has the behavior of occasionally rating refusals as unsafe when they mention unsafe subject matter (this false positive rate is well documented: "while deploying Llama Guard 3 will likely improve the safety of your system, it might increase refusals to benign prompts (False Positives)") so to keep behavioral consistency with our other judge (GPT-4o), we use Llama 3.1 instead. For now, I set the size to the largest one.

@@ -2,7 +2,7 @@

from helm.benchmark.adaptation.request_state import RequestState
from helm.benchmark.annotation.annotator import Annotator
from helm.benchmark.annotation.score_util import score_with_reasoning
from helm.benchmark.annotation.model_as_judge import score_with_reasoning
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missed this in the last PR for some files, so here it is

Copy link
Collaborator

@yifanmai yifanmai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, thanks!

@yifanmai yifanmai merged commit 13b718a into main Sep 5, 2024
6 checks passed
@yifanmai yifanmai deleted the farzaan/llama-safety branch September 5, 2024 22:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants