Skip to content

Commit add0c99

Browse files
committed
Update ArmoRM blog
1 parent c4e8d43 commit add0c99

File tree

1 file changed

+3
-3
lines changed
  • content/posts/2024-05-29-multi-objective-reward-modeling

1 file changed

+3
-3
lines changed

content/posts/2024-05-29-multi-objective-reward-modeling/index.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ As the training examples come with multi-objective ratings, the straightforward
7373
We consider each example to consist of a prompt $x$ (including contexts from previous conversation turns), response $y$, and a $k$-dimensional rating vector $r\in \mathbb{R}^{k}$, where each dimension corresponds to a reward objective such as helpfulness and truthfulness. Now, we take a pre-trained decoder-only LLM without the original output linear layer as the feature extractor $f_\theta$, and pass $(x,y)$ through the decoder layers to take the hidden state of the final decoder layer on the last token as a $d$-dimensional feature. Also, we attach a new linear regression layer $w\in \mathbb{R}^{d \times k}$ on top of $f_\theta$, which outputs $k$-dimensional rating prediction. The model can be straightforwardly trained with regression loss:
7474

7575
$$
76-
\min_{\theta, w} \mathbb{E}_{x,y,r\in D}\|w^\top f_\theta(x,y) - r\|_2^2
76+
\min_{\theta, w} \mathbb{E}_ {x,y,r} \| w^\top f_\theta(x,y) - r \|_2^2
7777
$$
7878

7979
<img src="Regression.png" alt="Regression" width="625">
@@ -113,15 +113,15 @@ $$
113113
where the penalty coefficient $\lambda_i$ is chosen such that for a proper correction metric (e.g., Pearson or Spearman correlation coefficient) and a reference data distribution $\mathcal D$,
114114

115115
$$
116-
\mathbb{E}_{\mathcal D}[\mathrm{Corr}(r_i', r_{\mathrm{verbose}})] = 0
116+
\mathbb{E}_ {\mathcal D}\mathrm{Corr}(r_i', r_{\mathrm{verbose}}) = 0
117117
$$
118118

119119
The adjusted reward vector is denoted as $r'\in \mathbb{R}^k$.
120120

121121
Finally, we multiply the gating coefficients to the multi-objective rewards, to obtain a scalar score $s$ for the response $y$ given prompt $x,$
122122

123123
$$
124-
s = g_\phi(f_\theta(x))^\top r'
124+
\mathrm{score} = g_\phi(f_\theta(x))^\top r'
125125
$$
126126

127127
### Implementation of ArmoRM-MoE

0 commit comments

Comments
 (0)