You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/posts/2024-05-29-multi-objective-reward-modeling/index.md
+12-1Lines changed: 12 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -86,7 +86,18 @@ We provide the implementation details of our ArmoRM model, including the archite
86
86
-**Parameter Initialization**: [FsfairX-LLaMA3-RM-v0.1](https://huggingface.co/sfairXC/FsfairX-LLaMA3-RM-v0.1), a [Bradley-Terry reward model](https://rlhflow.github.io/posts/2024-03-23-bradley-terry-reward-model/) trained from [Llama3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), using our [RLHFlow codebase for Reward Modeling](https://github.com/RLHFlow/RLHF-Reward-Modeling/tree/main/bradley-terry-rm).
87
87
-**Training**: Linear Probing (training the newly initialized linear layer only while keeping all transformer layers frozen)
88
88
- We tried full fine-tuning from [Llama3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and [FsfairX-LLaMA3-RM-v0.1](https://huggingface.co/sfairXC/FsfairX-LLaMA3-RM-v0.1) using the approach of [Linear-Probing then Full Fine-Tuning](https://arxiv.org/abs/2202.10054) (LP-FT), but we have not found notable performance improvement over this simple linear probing approach. Therefore, we stick to the linear probing approach for its efficiency (in terms of compute costs and memory requirements).
- The [Argilla-Math-Preferences](https://huggingface.co/datasets/argilla/distilabel-math-preference-dpo) dataset shares the objective `ultrafeedback-instruction_following`
-**Data Processing**: When merging multiple datasets with absolute ratings (e.g., [UltraFeedback](https://arxiv.org/abs/2310.01377) and [HelpSteer](https://arxiv.org/abs/2311.09528)), we observe some issues with the data. Here, we present the issues and our approach to tackle them:
91
102
1.**Different Rating Scales**: Different datasets may have different scales for the ratings. For instance, [HelpSteer](https://huggingface.co/datasets/nvidia/HelpSteer?row=0) has a rating scale of 0-4, while [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback)'s is 1-10. We linearly transform all ratings to make them between 0 and 1. For [BeaverTails](https://huggingface.co/datasets/PKU-Alignment/BeaverTails) with True/False ratings (indicating safe or unsafe), we treat True as 1 and False as 0.
92
103
2.**Similar Objectives**: There are some very similar objectives from different datasets. For example, the `Helpfulness` objective appears in both HelpSteer and UltraFeedback, and the `Correctness` objective of HelpSteer is quite similar to the `Truthfulness` of UltraFeedback. After carefully examining the datasets, we decided to treat similar objectives as separate objectives, as they are rated by different judges following different rubrics. For instance, data from HelpSteer are rated by 200 U.S.-based human annotators following customized rubrics, and UltraFeedback data are labeled with GPT-4 following another set of rubrics.
0 commit comments