You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
**Models:** We applied this methodology across a comprehensive set of 34 LLMs, encompassing both closed and open-source models. Our selection includes 9 closed-source models from industry leaders (OpenAI's GPT series, Anthropic's Claude series, and Google's Gemini series) and 25 open-source models (including model variants of the Llama-3, Mistral, Gemma, Qwen, and DeepSeek families). For closed-source models, we utilized their official APIs, while open-source model inference was conducted through the Together API platform. This diverse model selection enables us to examine preference patterns across different architectures, scales, and training approaches.
- For instance, $r^1 = (r^1_{\text{helpfulness}}, r^1_{\text{correctness}}, r^1_{\text{coherence}}, r^1_{\text{complexity}}, r^1_{\text{verbosity}})$.
147
146
- $y$: Preference label, indicating whether $a^1$ is better than $a^2$, defined as $y = \mathbb{I}(a^1 \succ a^2)$, where $\mathbb{I}(\cdot)$ is the indicator function
148
147
149
-
We train a decision tree $f$ using a simple logistic-regression-style loss:
148
+
We grow a decision tree $f$ by recursively splitting nodes based on the attribute differences $(r^1 - r^2)$ in the 5 dimensions (helpfulness, correctness, coherence, complexity, verbosity). At each potential split point, we use the log-loss criterion:
where the features are the differences $(r^1 - r^2)$ in the 5 attributes (helpfulness, correctness, coherence, complexity, verbosity). The scalar output $f(\cdot)$ is then the predicted probability that $a^1$ is the better response.
154
+
155
+
where $S_{\text{left}}$ and $S_{\text{right}}$ are the sets of samples in the left and right child nodes after the split, and $p_{\text{left}}$ and $p_{\text{right}}$ are the proportion of positive samples in each child node. The tree selects the split that minimizes this loss at each step. The scalar output $f(\cdot)$ at each leaf node is then the proportion of positive samples (preference for $a^1$) that reached that leaf.
156
+
159
157
160
158
Notably, the preference label $y$ can be provided either by humans or by a particular LLM. By swapping in different preference labels, we can train a separate decision tree for each “judge” (whether human or a specific LLM). This allows us to compare how different judges structure their decision criteria.
0 commit comments