You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/posts/2025-01-22-decision-tree-reward-model/index.md
+13-4Lines changed: 13 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -20,12 +20,21 @@ math: true
20
20
+**Dataset**:
21
21
+**Tech Report**: To release soon
22
22
---
23
+
23
24
# Abstract
24
25
25
26
Modern Large Language Models (LLMs) are typically aligned with human preferences through Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF). While the goal is for aligned LLMs to exhibit human-like preferences, the internal decision-making mechanisms driving these preferences remain opaque. We present a novel interpretability framework that leverages decision trees to model and analyze how LLMs make preference judgments between response pairs to the same prompt.
26
27
Using the HelpSteer2 dataset with 5-dimensional human ratings for responses, we train decision trees to predict preferences across modern LLMs of different sizes and compare them with human preference structures. Our analysis reveals that leading models like GPT-4 and Claude-3.5 demonstrate decision-making patterns that closely mirror human preference architectures, while some other LLMs exhibit systematic divergences. A key finding shows that while humans exhibit minimal dependence on response verbosity in their judgments, certain LLMs strongly rely on verbosity in their preference judgments.
27
28
We extend this decision-tree framework to develop interpretable reward models that serve as human preference proxies. By fine-tuning Skywork-Reward-Llama-3.1-8B-v0.2 on HelpSteer2 to predict multi-objective ratings, we further fit a decision tree on top of it with human preference data. Our resulting model, Llama-3.1-8B-Decision-Tree-Reward, achieves state-of-the-art performance on RewardBench while providing explicit decision-making paths for its preferences. To facilitate further research in interpretable preference learning, we release our codebase, collected LLM preference data, and trained models.
28
29
30
+
31
+
32
+

33
+
34
+
**Decision-tree reward model architecture**
35
+
Architecture of our decision tree reward model. It consists of a regression score rating module that continuously evaluates response quality across multiple dimensions, and a decision tree that recursively grows via greedy splitting on reward differences to yield interpretable decision paths.
36
+
37
+
29
38
# Motivation
30
39
31
40
The alignment of Large Language Models (LLMs) with human preferences has become a central challenge in AI development. While methods like RLHF have shown impressive results in making LLMs more helpful and aligned with human values, they often operate as black boxes, making it difficult to understand how these models actually make preference decisions between different responses.
@@ -73,11 +82,11 @@ Thanks to decent instruction-following capabilities in modern LLMs, this templat
73
82
*`coherence`: Consistency and clarity of expression.
74
83
*`complexity`: Intellectual depth required to write response (i.e. whether the response can be written by anyone with basic language competency or requires deep domain expertise).
75
84
*`verbosity`: Amount of detail included in the response, relative to what is asked for in the prompt.
76
-
Here we show the the probability distribution of each attribute and the Pearson correlation between each pair of attributes.
85
+
Here we show the probability distribution of each attribute and the Pearson correlation between each pair of attributes.
**Models:** We applied this methodology across a comprehensive set of 34 LLMs, encompassing both closed and open-source models. Our selection includes 9 closed-source models from industry leaders (OpenAI's GPT series, Anthropic's Claude series, and Google's Gemini series) and 25 open-source models (including model variants of the Llama-3, Mistral, Gemma, Qwen, and DeepSeek families). For closed-source models, we utilized their official APIs, while open-source model inference was conducted through the Together API platform. This diverse model selection enables us to examine preference patterns across different architectures, scales, and training approaches.
0 commit comments