7 27 8

Junxiao Yang

yangjunxiao2021

https://yangjunx21.github.io/

yangjunx21

AI & ML interests

Alignment/AI safety

Recent Activity

upvoted a paper 26 days ago

HaluMem: Evaluating Hallucinations in Memory Systems of Agents

upvoted a paper about 1 month ago

The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

upvoted a paper about 1 month ago

DeepAgent: A General Reasoning Agent with Scalable Toolsets

View all activity

Organizations

upvoted a paper 26 days ago

HaluMem: Evaluating Hallucinations in Memory Systems of Agents

Paper • 2511.03506 • Published Nov 5 • 93

upvoted 3 papers about 1 month ago

The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

Paper • 2510.25726 • Published Oct 29 • 45

DeepAgent: A General Reasoning Agent with Scalable Toolsets

Paper • 2510.21618 • Published Oct 24 • 99

The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs

Paper • 2509.09677 • Published Sep 11 • 34

upvoted an article about 1 month ago

Article

DABStep: Data Agent Benchmark for Multi-step Reasoning

Feb 4

•

121

upvoted a paper about 2 months ago

It Takes Two: Your GRPO Is Secretly DPO

Paper • 2510.00977 • Published Oct 1 • 31

upvoted a collection about 2 months ago

Agent & RL

Collection

55 items • Updated 10 days ago • 20

upvoted 5 papers about 2 months ago

Glyph: Scaling Context Windows via Visual-Text Compression

Paper • 2510.17800 • Published Oct 20 • 67

A Theoretical Study on Bridging Internal Probability and Self-Consistency for LLM Reasoning

Paper • 2510.15444 • Published Oct 17 • 147

Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL

Paper • 2508.07976 • Published Aug 11 • 51

In-the-Flow Agentic System Optimization for Effective Planning and Tool Use

Paper • 2510.05592 • Published Oct 7 • 105

Fathom-DeepResearch: Unlocking Long Horizon Information Retrieval and Synthesis for SLMs

Paper • 2509.24107 • Published Sep 28 • 78

upvoted 2 papers 2 months ago

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Paper • 2504.12516 • Published Apr 16 • 1

BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

Paper • 2504.19314 • Published Apr 27 • 7

upvoted 4 papers 3 months ago

Reasoning Introduces New Poisoning Attacks Yet Makes Them More Complicated

Paper • 2509.05739 • Published Sep 6 • 2

Can Understanding and Generation Truly Benefit Together -- or Just Coexist?

Paper • 2509.09666 • Published Sep 11 • 34

SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge

Paper • 2509.07968 • Published Sep 9 • 14

Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers

Paper • 2509.03059 • Published Sep 3 • 24

upvoted a paper 6 months ago

MetaMind: Modeling Human Social Thoughts with Metacognitive Multi-Agent Systems

Paper • 2505.18943 • Published May 25 • 24

upvoted a paper 7 months ago

BARREL: Boundary-Aware Reasoning for Factual and Reliable LRMs

Paper • 2505.13529 • Published May 18 • 11

Junxiao Yang

AI & ML interests

Recent Activity

Organizations

yangjunxiao2021's activity

DABStep: Data Agent Benchmark for Multi-step Reasoning