You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/posts/2025-01-22-decision-tree-reward-model/index.md
+37Lines changed: 37 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -256,6 +256,43 @@ Our evaluation reveals several key findings:
256
256
- The strong performance across all categories suggests that our decision-tree approach successfully captures human preference patterns accurately while maintaining interpretability.
257
257
258
258
259
+
# Usage Code
260
+
Before using the model, ensure you have the following dependencies installed:
261
+
-`transformers==4.45.2`
262
+
-`torch>=2.5.0`
263
+
-`flash-attn>=2.6.3`
264
+
265
+
Note: This code requires a GPU with NVIDIA Ampere architecture or newer.
266
+
```python
267
+
from transformers import AutoModelForSequenceClassification
268
+
import torch
269
+
from transformers import AutoTokenizer
270
+
model_name ="Decision-Tree-Reward-Llama-3.1-8B"# Another choice is "Decision-Tree-Reward-Gemma-2-27B"
271
+
repo_id =f"RLHFlow/{model_name}"
272
+
device ="cuda"
273
+
# Initialize the model and tokenizer
274
+
model = AutoModelForSequenceClassification.from_pretrained(repo_id, trust_remote_code=True, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map=device)
prompt ="Jane has 12 apples. She gives 4 apples to her friend Mark, then buys 1 more apple, and finally splits all her apples equally among herself and her 2 siblings. How many apples does each person get?"
281
+
response1 ="1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.\n2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.\n3. Jane splits the 9 apples equally among herself and her 2 siblings (3 people in total). 9 ÷ 3 = 3 apples each. Each person gets 3 apples."
282
+
response2 ="1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.\n2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.\n3. Jane splits the 9 apples equally among her 2 siblings (2 people in total). 9 ÷ 2 = 4.5 apples each. Each person gets 4 apples."
This paper introduces a novel framework for interpreting LLM preference mechanisms through decision trees, demonstrating that leading models like GPT-4o, Claude-3.5-Sonnet, and Llama-3.1-405B closely mirror human decision-making patterns while some open-source models show systematic biases. Building on these insights, we develop an interpretable reward modeling approach that achieves state-of-the-art performance on RewardBench while providing explicit decision paths. Our analysis reveals consistent preference patterns within model families, suggesting that architecture and training methodology significantly influence how models make decisions. By making preference mechanisms more transparent and interpretable, this work provides valuable tools for improving LLM alignment. To facilitate further research, we release our codebase, preference data, and trained models that combine strong performance with clear interpretability.
0 commit comments