Skip to content

Commit 08f2951

Browse files
committed
Update decision tree blog
1 parent 90326a2 commit 08f2951

File tree

1 file changed

+3
-3
lines changed
  • content/posts/2025-01-22-decision-tree-reward-model

1 file changed

+3
-3
lines changed

content/posts/2025-01-22-decision-tree-reward-model/index.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ Thanks to decent instruction-following capabilities in modern LLMs, this templat
7676
Here we show the the probability distribution of each attribute and the Pearson correlation between each pair of attributes.
7777
![Distribution](helpsteer_distribution.png)
7878
<p align="center">
79-
<img src="./helpsteer_correlation.png" alt="HelpSteer Correlation" width="50%">
79+
<img src="/content/posts/2025-01-22-decision-tree-reward-model/helpsteer_correlation.png" alt="HelpSteer Correlation" width="50%">
8080
</p>
8181
**Models:** We applied this methodology across a comprehensive set of 34 LLMs, encompassing both closed and open-source models. Our selection includes 9 closed-source models from industry leaders (OpenAI's GPT series, Anthropic's Claude series, and Google's Gemini series) and 25 open-source models (including model variants of the Llama-3, Mistral, Gemma, Qwen, and DeepSeek families). For closed-source models, we utilized their official APIs, while open-source model inference was conducted through the Together API platform. This diverse model selection enables us to examine preference patterns across different architectures, scales, and training approaches.
8282
* **Open-Source models:**
@@ -210,9 +210,9 @@ Once our multi-objective reward model can output a 5D rating vector $\hat{r} \in
210210

211211
1. **Compute Rating Differences**. For each pair $(a^1, a^2)$ in HelpSteer2, we feed both responses into the fine-tuned reward model to obtain
212212
$$
213-
\hat{r}^1 \;=\; (\hat{r}^1_{\text{helpfulness}}, \ldots, \hat{r}^1_{\text{verbosity}}),
213+
\hat{r}^1 = (\hat{r}^1_{\text{helpfulness}}, \ldots, \hat{r}^1_{\text{verbosity}}),
214214
\quad
215-
\hat{r}^2 \;=\; (\hat{r}^2_{\text{helpfulness}}, \ldots, \hat{r}^2_{\text{verbosity}}).
215+
\hat{r}^2 = (\hat{r}^2_{\text{helpfulness}}, \ldots, \hat{r}^2_{\text{verbosity}}).
216216
$$
217217
2. **Fit a Decision Tree**. Finally, we train a depth‐3 decision tree $f(\hat{r_1} - \hat{r_2})$ to predict the pairwise preference label $y$ on the training set of HelpSteer2-Preference. This matches the procedure in our earlier analysis of human‐labeled data, except that the multi-objective rewards come from **model**‐predicted rating ($\hat{r_1}, \hat{r_2}$) rather than human-annotated ratings ($r_1, r_2$).
218218

0 commit comments

Comments
 (0)