Skip to content

Commit f5c4ee9

Browse files
committed
update blog
1 parent 08f2951 commit f5c4ee9

File tree

2 files changed

+13
-4
lines changed

2 files changed

+13
-4
lines changed
Loading

content/posts/2025-01-22-decision-tree-reward-model/index.md

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -20,12 +20,21 @@ math: true
2020
+ **Dataset**:
2121
+ **Tech Report**: To release soon
2222
---
23+
2324
# Abstract
2425

2526
Modern Large Language Models (LLMs) are typically aligned with human preferences through Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF). While the goal is for aligned LLMs to exhibit human-like preferences, the internal decision-making mechanisms driving these preferences remain opaque. We present a novel interpretability framework that leverages decision trees to model and analyze how LLMs make preference judgments between response pairs to the same prompt.
2627
Using the HelpSteer2 dataset with 5-dimensional human ratings for responses, we train decision trees to predict preferences across modern LLMs of different sizes and compare them with human preference structures. Our analysis reveals that leading models like GPT-4 and Claude-3.5 demonstrate decision-making patterns that closely mirror human preference architectures, while some other LLMs exhibit systematic divergences. A key finding shows that while humans exhibit minimal dependence on response verbosity in their judgments, certain LLMs strongly rely on verbosity in their preference judgments.
2728
We extend this decision-tree framework to develop interpretable reward models that serve as human preference proxies. By fine-tuning Skywork-Reward-Llama-3.1-8B-v0.2 on HelpSteer2 to predict multi-objective ratings, we further fit a decision tree on top of it with human preference data. Our resulting model, Llama-3.1-8B-Decision-Tree-Reward, achieves state-of-the-art performance on RewardBench while providing explicit decision-making paths for its preferences. To facilitate further research in interpretable preference learning, we release our codebase, collected LLM preference data, and trained models.
2829

30+
31+
32+
![Workflow](decision_tree_workflow.png)
33+
34+
**Decision-tree reward model architecture**
35+
Architecture of our decision tree reward model. It consists of a regression score rating module that continuously evaluates response quality across multiple dimensions, and a decision tree that recursively grows via greedy splitting on reward differences to yield interpretable decision paths.
36+
37+
2938
# Motivation
3039

3140
The alignment of Large Language Models (LLMs) with human preferences has become a central challenge in AI development. While methods like RLHF have shown impressive results in making LLMs more helpful and aligned with human values, they often operate as black boxes, making it difficult to understand how these models actually make preference decisions between different responses.
@@ -73,11 +82,11 @@ Thanks to decent instruction-following capabilities in modern LLMs, this templat
7382
* `coherence`: Consistency and clarity of expression.
7483
* `complexity`: Intellectual depth required to write response (i.e. whether the response can be written by anyone with basic language competency or requires deep domain expertise).
7584
* `verbosity`: Amount of detail included in the response, relative to what is asked for in the prompt.
76-
Here we show the the probability distribution of each attribute and the Pearson correlation between each pair of attributes.
85+
Here we show the probability distribution of each attribute and the Pearson correlation between each pair of attributes.
7786
![Distribution](helpsteer_distribution.png)
78-
<p align="center">
79-
<img src="/content/posts/2025-01-22-decision-tree-reward-model/helpsteer_correlation.png" alt="HelpSteer Correlation" width="50%">
80-
</p>
87+
![Correlation](helpsteer_correlation.png)
88+
89+
8190
**Models:** We applied this methodology across a comprehensive set of 34 LLMs, encompassing both closed and open-source models. Our selection includes 9 closed-source models from industry leaders (OpenAI's GPT series, Anthropic's Claude series, and Google's Gemini series) and 25 open-source models (including model variants of the Llama-3, Mistral, Gemma, Qwen, and DeepSeek families). For closed-source models, we utilized their official APIs, while open-source model inference was conducted through the Together API platform. This diverse model selection enables us to examine preference patterns across different architectures, scales, and training approaches.
8291
* **Open-Source models:**
8392
* **Llama-3**:Llama-3-8B, Llama-3-70B, Llama-3.1-8B, Llama-3.1-70B, Llama-3.1-405B, , Llama-3.1-Nemotron-70B, Llama-3.2-3B, Llama-3.2-11B-Vision, Llama-3.2-90B-Vision, Llama-3.3-70B

0 commit comments

Comments
 (0)