Update ArmoRM blog

Haoxiang-Wang · Haoxiang-Wang · commit 28aa20cfc45e · 2024-05-29T15:40:49.000-05:00
diff --git a/content/posts/2024-05-29-multi-objective-reward-modeling/index.md b/content/posts/2024-05-29-multi-objective-reward-modeling/index.md
@@ -130,7 +130,7 @@ The gating layer is trained on top of the ArmoRM obtained from stage-1. Here we
 
 - **Gating Layer Architecture**: A ReLU MLP with 3 hidden layers of 1024 hidden units
 - **Training:** Train the gating layer only, with the rest of the parameters (backbone & regression layer) frozen.
-- **Reward Adjustment (for verbosity bias mitigation):** We use the Spearman correlation coefficient as the correlation metric, $\mathrm{Corr}$, and adopt a [binarized UltraFeedback dataset](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences-cleaned) of 61k examples as the reference data distribution, $\mathcal D$. The penalty coefficients, $\{\lambda_i\}$, are chosen such that $\mathbb{E}_{\mathcal D}[\mathrm{Corr}(r_i', r_{\mathrm{verbose}})] \approx 0$.
+- **Reward Adjustment (for verbosity bias mitigation):** We use the Spearman correlation coefficient as the correlation metric, $\mathrm{Corr}$, and adopt a [binarized UltraFeedback dataset](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences-cleaned) of 61k examples as the reference data distribution, $\mathcal D$. The penalty coefficients, $\{\lambda_i\}$, are chosen such that $\mathbb{E}_ {\mathcal D}[\mathrm{Corr}(r_i', r_{\mathrm{verbose}})] \approx 0$.
 - **Datasets**: [HelpSteer](https://huggingface.co/datasets/nvidia/HelpSteer), [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback), [SHP](https://huggingface.co/datasets/stanfordnlp/SHP?row=0), [HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf), [PKU-SafeRLHF-30K](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-30K), [Argilla-Capybara](https://huggingface.co/datasets/argilla/Capybara-Preferences-Filtered), [Argilla-Math-Preferences](https://huggingface.co/datasets/argilla/distilabel-math-preference-dpo), [CodeUltraFeedback](https://huggingface.co/datasets/coseal/CodeUltraFeedback), [PRM-Phase-2](https://github.com/openai/prm800k), [Prometheus2-Preference-Collection](https://huggingface.co/datasets/prometheus-eval/Preference-Collection)
     - For datasets that are not binarized into response pairs (e.g., HelpSteer, UltraFeedback, SHP), we take the binarized versions pre-processed in [RLHF Workflow](https://arxiv.org/abs/2405.07863).