Image-to-Image

πŸ’‘ DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

DeepGen 1.0 Paper on arXiv Github Github

DeepGen 1.0 is a lightweight unified multimodal model with only 5B parameters (3B VLM + 2B DiT). It integrates five core capabilitiesβ€”general image generation, general image editing, reasoning image generation, reasoning image editing, and text renderingβ€”within a single model. Across multiple authoritative benchmarks, DeepGen 1.0 is competitive with competitive with or surpassing the state-of-the-art unified multimodal models that are 3Γ— to 16Γ— larger, achieving comprehensive performance, demonstrating that massive scaling is not the sole path to high-performance multimodal generation.

🧠 Method

Our core observation is that a lightweight model, when empowered by synergistic architecture design and data-centric training strategies, can achieve comprehensive capabilities competitive with or even surpassing much larger counterparts. To overcome the limitations of lightweight models in semantic understanding and fine-grained control, we introduce Stacked Channel Bridging (SCB), a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable ``think tokens'' to provide the generative backbone with structured, reasoning-rich guidance. We further design a data-centric training strategy spanning three progressive stages: (1) Alignment Pre-training on large-scale image-text pairs and editing triplets to synchronize VLM and DiT representations, (2) Joint Supervised Fine-tuning on a high-quality mixture of generation, editing, and reasoning tasks to foster omni-capabilities, and (3) Reinforcement Learning with MR-GRPO, which leverages a mixture of reward functions and supervision signals, resulting in substantial gains in generation quality and alignment with human preferences, while maintaining stable training progress and avoiding visual artifacts.

πŸ“Š Benchmarks

1. General Image Generation

Model Params Geneval ↑ DPGBench ↑ UniGenBench ↑
OmniGen2 3B + 4B 0.80 83.57 63.09
BAGEL 14B 0.82 85.10 61.53
X-Omni 7B + 12B 0.83 87.65πŸ₯‰ 53.77
Lumina-DiMOO 8B 0.88πŸ₯‡ 86.04 71.12
Hunyuan-Image-3.0 80B 0.72 86.10 β€”
Qwen-Image 7B + 20B 0.87 πŸ₯ˆ 88.32 πŸ₯‡ 78.81 πŸ₯‡
LongCat-Image 7B + 6B 0.87 πŸ₯ˆ 86.80 β€”
Z-Image-Turbo 4B + 6B 0.84 85.15 71.40
GLM-Image 9B + 7B β€” 84.78 β€”
DeepGen 1.0 (SFT) 3B + 2B 0.86 πŸ₯‰ 87.05 74.18 πŸ₯‰
DeepGen 1.0 (RL) 3B + 2B 0.87 πŸ₯ˆ 87.90 πŸ₯ˆ 75.74 πŸ₯ˆ

2. General Image Editing

Model Params GEdit-EN ↑ ImgEdit ↑
BAGEL 14B 6.52 3.20
Qwen-Image-Edit [2509] 7B + 20B 7.54 πŸ₯ˆ 4.35 πŸ₯ˆ
LongCat-Image-Edit 7B + 6B 7.60 πŸ₯‡ 4.50 πŸ₯‡
Mammoth2 8B + 3B + 2B 6.60 4.06
DeepGen 1.0 (SFT) 3B + 2B 7.12 4.09
DeepGen 1.0 (RL) 3B + 2B 7.17 πŸ₯‰ 4.14 πŸ₯‰

3. Reasoning Image Generation

Model Params WISE ↑ T2I-CoREBench ↑
OmniGen2 3B + 4B 0.47 36.1
BAGEL 14B 0.70 πŸ₯‰ 41.1
Hunyuan-Image-3.0 80B 0.57 46.0
Qwen-Image 7B + 20B 0.62 46.3 πŸ₯‰
LongCat-Image 7B + 6B 0.65 52.2 πŸ₯‡
Z-Image-Turbo 4B + 6B - 43.7
DeepGen 1.0 (SFT) 3B + 2B 0.72 πŸ₯ˆ 45.7
DeepGen 1.0 (RL) 3B + 2B 0.73 πŸ₯‡ 46.5 πŸ₯ˆ

4. Reasoning Image Editing

Model Params RISE ↑ UniREditBench ↑
OmniGen2 3B + 4B - 43.4
BAGEL 14B 11.9 πŸ₯ˆ 51.0
Qwen-Image-Edit [2509] 7B + 20B 8.9 56.5 πŸ₯‰
DeepGen 1.0 (SFT) 3B + 2B 13.3 πŸ₯‡ 77.5 πŸ₯‡
DeepGen 1.0 (RL) 3B + 2B 10.8 πŸ₯‰ 75.7 πŸ₯ˆ

🎨 Quantitative results

πŸ› οΈ Usage

Merge ZIP Files

To use the DeepGen checkpoints, please merge the sharded model files first. We release Pre-traning, Supervised Fine-Tuning and Reinforcement Learning checkpoints.

# Merge zip
cat DeepGen_CKPT.zip.part-* > DeepGen_CKPT.zip
# Unzip DeepGen checkpoints 
unzip DeepGen_CKPT.zip
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for deepgenteam/DeepGen-1.0

Finetuned
(660)
this model

Dataset used to train deepgenteam/DeepGen-1.0

Paper for deepgenteam/DeepGen-1.0