Skip to content

Commit 4849e42

Browse files
committed
Update ArmoRM blog
1 parent add0c99 commit 4849e42

File tree

1 file changed

+3
-3
lines changed
  • content/posts/2024-05-29-multi-objective-reward-modeling

1 file changed

+3
-3
lines changed

content/posts/2024-05-29-multi-objective-reward-modeling/index.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ This work is authored by [Haoxiang Wang*](https://haoxiang-wang.github.io/), [We
2020
- **Code:** [https://github.com/RLHFlow/RLHF-Reward-Modeling](https://github.com/RLHFlow/RLHF-Reward-Modeling)
2121
- **Model:** [https://huggingface.co/RLHFlow/ArmoRM-Llama3-8B-v0.1](https://huggingface.co/RLHFlow/ArmoRM-Llama3-8B-v0.1)
2222
- **Technical Report:** To be released in June, 2024
23-
23+
- **Contact:** Haoxiang Wang ([hwang264@illinois.edu](mailto:wx13@illinois.edu))
2424
---
2525
# Abstract
2626

@@ -73,7 +73,7 @@ As the training examples come with multi-objective ratings, the straightforward
7373
We consider each example to consist of a prompt $x$ (including contexts from previous conversation turns), response $y$, and a $k$-dimensional rating vector $r\in \mathbb{R}^{k}$, where each dimension corresponds to a reward objective such as helpfulness and truthfulness. Now, we take a pre-trained decoder-only LLM without the original output linear layer as the feature extractor $f_\theta$, and pass $(x,y)$ through the decoder layers to take the hidden state of the final decoder layer on the last token as a $d$-dimensional feature. Also, we attach a new linear regression layer $w\in \mathbb{R}^{d \times k}$ on top of $f_\theta$, which outputs $k$-dimensional rating prediction. The model can be straightforwardly trained with regression loss:
7474

7575
$$
76-
\min_{\theta, w} \mathbb{E}_ {x,y,r} \| w^\top f_\theta(x,y) - r \|_2^2
76+
\min_{\theta, w} \mathbb{E}_ {x,y,r} || w^\top f_\theta(x,y) - r ||_2^2
7777
$$
7878

7979
<img src="Regression.png" alt="Regression" width="625">
@@ -154,7 +154,7 @@ The table above presents the evaluation results of our ArmoRM-MoE model on the R
154154
2. Our model also outperforms the LLM-as-a-Judge approach with a GPT-4 judge by a considerable margin, indicating that our model could be used as a replacement for GPT-4 in many annotation jobs or even serve as a judge model for benchmarks (e.g., [MT-Bench](https://arxiv.org/abs/2306.05685), [AlpacaEval-2.0](https://tatsu-lab.github.io/alpaca_eval/), [ArenaHard](https://lmsys.org/blog/2024-04-19-arena-hard/)).
155155
3. The Cohere May 2024 model, developed by Cohere AI, is a closed model with unknown size and training details. Despite the lack of information about this model, our ArmoRM-Llama3-8B-v0.1 still manages to outperform it on the Reward-Bench benchmark.
156156

157-
# Usage Examples (Demo Code)
157+
# Usage Example (Code Demo)
158158

159159
```python
160160
import torch

0 commit comments

Comments
 (0)