Update ArmoRM blog

Haoxiang-Wang · Haoxiang-Wang · commit 23d1188f8df0 · 2024-05-29T18:33:18.000-05:00
diff --git a/content/posts/2024-05-29-multi-objective-reward-modeling/index.md b/content/posts/2024-05-29-multi-objective-reward-modeling/index.md
@@ -86,7 +86,18 @@ We provide the implementation details of our ArmoRM model, including the archite
 - **Parameter Initialization**: [FsfairX-LLaMA3-RM-v0.1](https://huggingface.co/sfairXC/FsfairX-LLaMA3-RM-v0.1), a [Bradley-Terry reward model](https://rlhflow.github.io/posts/2024-03-23-bradley-terry-reward-model/) trained from [Llama3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), using our [RLHFlow codebase for Reward Modeling](https://github.com/RLHFlow/RLHF-Reward-Modeling/tree/main/bradley-terry-rm).
 - **Training**: Linear Probing (training the newly initialized linear layer only while keeping all transformer layers frozen)
     - We tried full fine-tuning from [Llama3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and [FsfairX-LLaMA3-RM-v0.1](https://huggingface.co/sfairXC/FsfairX-LLaMA3-RM-v0.1) using the approach of [Linear-Probing then Full Fine-Tuning](https://arxiv.org/abs/2202.10054) (LP-FT), but we have not found notable performance improvement over this simple linear probing approach. Therefore, we stick to the linear probing approach for its efficiency (in terms of compute costs and memory requirements).
-- **Datasets**: [HelpSteer](https://huggingface.co/datasets/nvidia/HelpSteer), [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback), [BeaverTails](https://huggingface.co/datasets/PKU-Alignment/BeaverTails), [Argilla-Capybara](https://huggingface.co/datasets/argilla/Capybara-Preferences-Filtered), [Argilla-Math-Preferences](https://huggingface.co/datasets/argilla/distilabel-math-preference-dpo), [CodeUltraFeedback](https://huggingface.co/datasets/coseal/CodeUltraFeedback)
+- **Datasets**: [HelpSteer](https://huggingface.co/datasets/nvidia/HelpSteer), [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback), [BeaverTails](https://huggingface.co/datasets/PKU-Alignment/BeaverTails), [Argilla-Capybara](https://huggingface.co/datasets/argilla/Capybara-Preferences-Filtered), [Argilla-Math-Preferences](https://huggingface.co/datasets/argilla/distilabel-math-preference-dpo), [CodeUltraFeedback](https://huggingface.co/datasets/coseal/CodeUltraFeedback), [Argilla-OpenOrca](https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs)
+- **Objectives**: We have $k=19$ reward objectives in total obtained from the datasets:
+  - [HelpSteer](https://huggingface.co/datasets/nvidia/HelpSteer): `helpsteer-helpfulness`,`helpsteer-correctness`,`helpsteer-coherence`,
+   `helpsteer-complexity`,`helpsteer-verbosity`
+  - [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback): `ultrafeedback-overall_score`, `ultrafeedback-instruction_following`, `ultrafeedback-truthfulness`, `ultrafeedback-honesty`,`ultrafeedback-helpfulness`
+    - The [Argilla-Math-Preferences](https://huggingface.co/datasets/argilla/distilabel-math-preference-dpo) dataset shares the objective `ultrafeedback-instruction_following`
+  - [BeaverTails](https://huggingface.co/datasets/PKU-Alignment/BeaverTails): `beavertails-is_safe`
+  - [CodeUltraFeedback](https://huggingface.co/datasets/coseal/CodeUltraFeedback): `code-complexity`,
+   `code-style`,`code-explanation`,`code-instruction-following`,`code-readability`
+  - [Prometheus](https://huggingface.co/datasets/prometheus-eval/Feedback-Collection): `prometheus-score`
+  - [Argilla-Capybara](https://huggingface.co/datasets/argilla/Capybara-Preferences-Filtered): `argilla-overall_quality`
+  - [Argilla-OpenOrca](https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs): `argilla-judge_lm`
 - **Data Processing**: When merging multiple datasets with absolute ratings (e.g., [UltraFeedback](https://arxiv.org/abs/2310.01377) and [HelpSteer](https://arxiv.org/abs/2311.09528)), we observe some issues with the data. Here, we present the issues and our approach to tackle them:
     1. **Different Rating Scales**: Different datasets may have different scales for the ratings. For instance, [HelpSteer](https://huggingface.co/datasets/nvidia/HelpSteer?row=0) has a rating scale of 0-4, while [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback)'s is 1-10. We linearly transform all ratings to make them between 0 and 1. For [BeaverTails](https://huggingface.co/datasets/PKU-Alignment/BeaverTails) with True/False ratings (indicating safe or unsafe), we treat True as 1 and False as 0.
     2. **Similar Objectives**: There are some very similar objectives from different datasets. For example, the `Helpfulness` objective appears in both HelpSteer and UltraFeedback, and the `Correctness` objective of HelpSteer is quite similar to the `Truthfulness` of UltraFeedback. After carefully examining the datasets, we decided to treat similar objectives as separate objectives, as they are rated by different judges following different rubrics. For instance, data from HelpSteer are rated by 200 U.S.-based human annotators following customized rubrics, and UltraFeedback data are labeled with GPT-4 following another set of rubrics.