Papers
arxiv:2510.01857

Learning Reasoning Reward Models from Expert Demonstration via Inverse Reinforcement Learning

Published on Oct 2, 2025
Authors:
,
,

Abstract

An inverse reinforcement learning framework learns dense token-level reasoning rewards from expert demonstrations, enabling both policy optimization and inference-time improvement while providing interpretable diagnostics for logical errors.

AI-generated summary

Reasoning in large language models is typically trained via supervised fine-tuning (SFT) on expert traces, often framed as knowledge distillation, or reinforcement learning (RL) with outcome-based verifiable rewards. However, SFT focuses on imitation rather than optimisation, while outcome-based RL requires a well-defined reward function. We propose an inverse reinforcement learning (IRL) framework that learns (partially) dense token-level reasoning reward models directly from expert demonstrations. We show that this learned reward serves a dual purpose: (1) as a dense training signal that optimises policies to reason more effectively, outperforming SFT baselines on GSM8K (79\% vs. 56\%) and MedReason (74\% vs. 65\%); and (2) as an inference-time assistant that improves performance via reward-guided reranking, yielding gains of up to 12 percentage points on Llama3 architectures. Furthermore, our dense rewards provide interpretable, step-wise diagnostics that can indicate the location of logical errors. This work proposes a process-level reasoning learning framework from data, bridging the gap between imitation and reinforcement learning for reasoning.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.01857 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.01857 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.01857 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.