FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation Paper β’ 2410.22257 β’ Published Oct 29, 2024
Logit Arithmetic Elicits Long Reasoning Capabilities Without Training Paper β’ 2507.12759 β’ Published Jul 17
From Proof to Program: Characterizing Tool-Induced Reasoning Hallucinations in Large Language Models Paper β’ 2511.10899 β’ Published 30 days ago β’ 3
ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists Paper β’ 2506.01241 β’ Published Jun 2 β’ 9