LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation

This is the official model repository for LiveTalk.

LiveTalk enables real-time multimodal interactive avatar video generation through an improved on-policy distillation approach. By distilling bidirectional diffusion models into causal, few-step autoregressive models, LiveTalk achieves over 20× speedup, enabling a seamless real-time interactive experience.

LiveTalk System Overview

⭐ Highlights

Real-Time Generation: Achieves 24.82 FPS throughput with 0.33s first-frame latency.
Multimodal Conditioning: Supports text, image, and audio inputs for flexible avatar control.
Efficient Inference: Reduces inference time from ~83s to real-time through 4-step diffusion distillation.
Multi-Turn Coherence: Demonstrates competitive performance against state-of-the-art models in multi-round interaction benchmarks.
End-to-End System: Provides integration with audio language models for conversational AI applications.

🚀 Get started

Installation

For detailed setup instructions, including environment configuration and dependency installation, please refer to the official GitHub repository.

Inference

Once the environment and checkpoints are prepared, you can execute the inference script:

bash ./scripts/inference.sh

Input Requirements:

Image: Reference image of the person (JPG/PNG format).
Audio: Speech audio file (WAV format, 16kHz sample rate recommended).
Text Prompt: Description of the desired video characteristics.

🔍 Method Overview

LiveTalk addresses challenges in distilling multimodal video diffusion models by using an improved on-policy distillation recipe. It introduces curated multimodal conditions, converged ODE initialization, and aggressive optimization to eliminate training instability (like flickering or black frames) while delivering high-quality, lip-synced results.

📚 Citation

If you find this work useful for your research, please cite:

@article{chern2025livetalk,
  title={LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation},
  author={Chern, Ethan and Hu, Zhulin and Tang, Bohao and Su, Jiadi and Chern, Steffi and Deng, Zhijie and Liu, Pengfei},
  journal={arXiv preprint arXiv:2512.23576},
  year={2025}
}

Downloads last month: 43