Skip to content

Commit 51778ee

Browse files
committed
Update ArmoRM blog
1 parent 23d1188 commit 51778ee

File tree

1 file changed

+9
-1
lines changed
  • content/posts/2024-05-29-multi-objective-reward-modeling

1 file changed

+9
-1
lines changed

content/posts/2024-05-29-multi-objective-reward-modeling/index.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -132,9 +132,17 @@ The adjusted reward vector is denoted as $r'\in \mathbb{R}^k$.
132132
Finally, we multiply the gating coefficients to the multi-objective rewards, to obtain a scalar score $s$ for the response $y$ given prompt $x,$
133133

134134
$$
135-
\mathrm{score} = g_\phi(f_\theta(x))^\top r'
135+
R = g_\phi(f_\theta(x))^\top r'
136136
$$
137137

138+
To train the gating layer, we freeze the parameters of the backbone and the regression layer, and only train the gating layer with the Bradley-Terry loss,
139+
140+
$$
141+
\min_\phi \mathbb{E} \left[ -\log \frac{\exp(R_{\mathrm{chosen}})}{\exp(R_\mathrm{chosen}+R_\mathrm{rejected})} \right]
142+
$$
143+
144+
where $R_{\mathrm{chosen}}$ and $R_{\mathrm{rejected}}$ are the preference scores for the chosen and rejected responses in each pairwise example, $(x, y_{\mathrm{chosen}}, y_{\mathrm{rejected}})$.
145+
138146
### Implementation of ArmoRM-MoE
139147

140148
The gating layer is trained on top of the ArmoRM obtained from stage-1. Here we provide implementation details:

0 commit comments

Comments
 (0)