Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression
Abstract
Deep Forcing, a training-free method, enhances real-time video diffusion by addressing temporal repetition and motion issues through Deep Sink and Participative Compression, achieving high-quality, long-duration video generation.
Recent advances in autoregressive video diffusion have enabled real-time frame streaming, yet existing solutions still suffer from temporal repetition, drift, and motion deceleration. We find that naively applying StreamingLLM-style attention sinks to video diffusion leads to fidelity degradation and motion stagnation. To overcome this, we introduce Deep Forcing, which consists of two training-free mechanisms that address this without any fine-tuning. Specifically, 1) Deep Sink dedicates half of the sliding window to persistent sink tokens and re-aligns their temporal RoPE phase to the current timeline, stabilizing global context during long rollouts. 2) Participative Compression performs importance-aware KV cache pruning that preserves only tokens actively participating in recent attention while safely discarding redundant and degraded history, minimizing error accumulation under out-of-distribution length generation. Together, these components enable over 12x extrapolation (e.g. 5s-trained to 60s+ generation) with better imaging quality than LongLive, better aesthetic quality than RollingForcing, almost maintaining overall consistency, and substantial gains in dynamic degree, all while maintaining real-time generation. Our results demonstrate that training-free KV-cache management can match or exceed training-based approaches for autoregressively streaming long-video generation.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout (2025)
- Adaptive Begin-of-Video Tokens for Autoregressive Video Diffusion Models (2025)
- MotionStream: Real-Time Video Generation with Interactive Motion Controls (2025)
- BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation (2025)
- MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation (2025)
- FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution (2025)
- StreamingTOM: Streaming Token Compression for Efficient Video Understanding (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper