arxiv:2602.07670

Surprisal-Guided Selection: Compute-Optimal Test-Time Strategies for Execution-Grounded Code Generation

Published on Feb 7

· Submitted by

Jarrod Barnes on Feb 11

Upvote

Authors:

Jarrod Barnes

Abstract

Test-time training fails in verification-grounded tasks due to over-sharpening, while surprisal-guided selection improves performance by favoring diverse, low-confidence samples.

AI-generated summary

Test-time training (TTT) adapts language models through gradient-based updates at inference. But is adaptation the right strategy? We study compute-optimal test-time strategies for verifiable execution-grounded (VEG) tasks, domains like GPU kernel optimization where a deterministic evaluator provides dense, continuous reward signals. Using KernelBench as our testbed and a 120B-parameter model (GPT-OSS-120B with LoRA adaptation), we find that search outperforms minimal adaptation (1-5 gradient steps): Best-of-N sampling achieves 90% task success (18/20 tasks) at K=64 across the full KernelBench L1 eval set while TTT's best checkpoint reaches only 30.6% (3-seed mean), with TTT's "equivalent K" falling below 1, worse than single-sample inference. The failure mode is over-sharpening: gradient updates collapse diversity toward mediocre solutions rather than discovering optimal ones. Our main contribution is surprisal-guided selection: selecting the highest-surprisal (lowest-confidence) correct sample yields 80% success vs. 50% for most-confident selection, a 30% improvement. Extending to surprisal-guided-top3 matches oracle performance at 100%. This zero-cost strategy, validated through length-controlled analysis, recovers oracle performance. For dense-reward VEG tasks, compute should be allocated to sample diversity and intelligent selection rather than gradient adaptation. The surprisal-guided selection principle may generalize to other execution-grounded domains where optimal solutions occupy the distribution tail.

View arXiv page View PDF Project page GitHub 0 Add to collection

Community

Jarrodbarnes

Paper author Paper submitter about 2 hours ago

Standard practice selects the most confident model output. I tested the opposite on GPU kernel optimization and found that selecting by surprisal (the model's least confident correct solution) achieves 80% success vs 50% for confidence-guided, with a 3.5x mean speedup advantage. Evaluating just the top 3 by surprisal matches oracle performance at 100%.

The key insight: a model's probability distribution maps frequency, not quality. Expert-level CUDA kernels are rare in training data, so the model assigns them low probability despite high performance. That signal is already in the logprobs at zero additional inference cost.

I also find that test-time training (gradient adaptation) is worse than random on dense-reward tasks. TTT's best checkpoint (30.6%) falls below a single random sample (53.3%). Gradient updates over-sharpen the distribution, destroying the expert tail where optimal solutions live.

Code, model weights, and a detailed write-up are available:

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.07670 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.07670 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.