You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -233,26 +234,26 @@ Looking at the decision trees fitted to our multi-objective reward models:
233
234
We evaluate our decision-tree-based reward models on [Reward-Bench](https://huggingface.co/spaces/allenai/reward-bench), a comprehensive benchmark designed to assess reward model performance across multiple dimensions of LLM alignment. Reward-Bench evaluates models on four key aspects: general chat quality, challenging chat scenarios, safety considerations, and reasoning capabilities.
234
235
235
236
236
-
Rank | Model | Base Model | Method | Overall Score | Chat | Chat Hard | Safety | Reasoning |
-The Gemma-2-27B version achieves state-of-the-art performance with a 95.3 overall score, leading in both reasoning tasks (99.1) and challenging chat scenarios (91.4).
254
-
- Both decision tree models show substantial improvements over their base Skywork versions, with relative error reductions of 26.3% for Gemma-2-27B and 17.4% for Llama-3.1-8B.
255
-
- The strong performance across all categories suggests that our decision-tree approach successfully captures nuanced preference patterns while maintaining interpretability.
254
+
-Our Gemma-2-27B version achieves state-of-the-art performance with a 95.4 overall score, leading in both Chat Hard (91.4) and Reasoning (99.2) categories.
255
+
- Both decision tree models show substantial improvements over their base Skywork versions, with relative error reductions of 19.3% for the Gemma-2-27B version and 20.3% for the Llama-3.1-8B version.
256
+
- The strong performance across all categories suggests that our decision-tree approach successfully captures human preference patterns accurately while maintaining interpretability.
0 commit comments