Skip to content

Commit 28aa20c

Browse files
committed
Update ArmoRM blog
1 parent 4849e42 commit 28aa20c

File tree

1 file changed

+1
-1
lines changed
  • content/posts/2024-05-29-multi-objective-reward-modeling

1 file changed

+1
-1
lines changed

content/posts/2024-05-29-multi-objective-reward-modeling/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -130,7 +130,7 @@ The gating layer is trained on top of the ArmoRM obtained from stage-1. Here we
130130

131131
- **Gating Layer Architecture**: A ReLU MLP with 3 hidden layers of 1024 hidden units
132132
- **Training:** Train the gating layer only, with the rest of the parameters (backbone & regression layer) frozen.
133-
- **Reward Adjustment (for verbosity bias mitigation):** We use the Spearman correlation coefficient as the correlation metric, $\mathrm{Corr}$, and adopt a [binarized UltraFeedback dataset](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences-cleaned) of 61k examples as the reference data distribution, $\mathcal D$. The penalty coefficients, $\{\lambda_i\}$, are chosen such that $\mathbb{E}_{\mathcal D}[\mathrm{Corr}(r_i', r_{\mathrm{verbose}})] \approx 0$.
133+
- **Reward Adjustment (for verbosity bias mitigation):** We use the Spearman correlation coefficient as the correlation metric, $\mathrm{Corr}$, and adopt a [binarized UltraFeedback dataset](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences-cleaned) of 61k examples as the reference data distribution, $\mathcal D$. The penalty coefficients, $\{\lambda_i\}$, are chosen such that $\mathbb{E}_ {\mathcal D}[\mathrm{Corr}(r_i', r_{\mathrm{verbose}})] \approx 0$.
134134
- **Datasets**: [HelpSteer](https://huggingface.co/datasets/nvidia/HelpSteer), [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback), [SHP](https://huggingface.co/datasets/stanfordnlp/SHP?row=0), [HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf), [PKU-SafeRLHF-30K](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-30K), [Argilla-Capybara](https://huggingface.co/datasets/argilla/Capybara-Preferences-Filtered), [Argilla-Math-Preferences](https://huggingface.co/datasets/argilla/distilabel-math-preference-dpo), [CodeUltraFeedback](https://huggingface.co/datasets/coseal/CodeUltraFeedback), [PRM-Phase-2](https://github.com/openai/prm800k), [Prometheus2-Preference-Collection](https://huggingface.co/datasets/prometheus-eval/Preference-Collection)
135135
- For datasets that are not binarized into response pairs (e.g., HelpSteer, UltraFeedback, SHP), we take the binarized versions pre-processed in [RLHF Workflow](https://arxiv.org/abs/2405.07863).
136136

0 commit comments

Comments
 (0)