--- license: other library_name: diffusers pipeline_tag: image-to-video tags: - wan - image-to-video - video-generation - gguf --- # Wan 2.2 Image-to-Video (I2V-A14B) - GGUF FP16 Quantized Models This repository contains GGUF quantized versions of the **Wan 2.2 Image-to-Video A14B** model, optimized for efficient inference with reduced VRAM requirements while maintaining high-quality video generation capabilities. ## Model Description Wan 2.2 is an advanced large-scale video generative model that uses a **Mixture-of-Experts (MoE) architecture** specifically designed for image-to-video synthesis. The A14B variant features a dual-expert design with approximately 14 billion parameters per expert: - **High-Noise Expert**: Optimized for early denoising stages, focusing on overall layout and composition - **Low-Noise Expert**: Specialized for later denoising stages, refining video details and quality The model generates videos at **480P and 720P resolutions** from static images, with support for text-guided prompts to control the generation process. Wan 2.2 incorporates meticulously curated aesthetic data with detailed labels for lighting, composition, contrast, and color tone, enabling precise cinematic-style video generation. ## Repository Contents This repository contains three GGUF model files optimized for different use cases: ``` diffusion_models/wan/ ├── wan22-i2v-a14b-high.gguf (15 GB) - Full FP16 high-noise expert ├── wan22-i2v-a14b-high-q4-k-s.gguf (8.2 GB) - Q4_K_S quantized high-noise expert └── wan22-i2v-a14b-low-q4-k-s.gguf (8.2 GB) - Q4_K_S quantized low-noise expert ``` **Total Repository Size**: 31 GB ### Model Files Explained - **wan22-i2v-a14b-high.gguf**: Full precision FP16 high-noise expert model for maximum quality - **wan22-i2v-a14b-high-q4-k-s.gguf**: Q4_K_S quantized high-noise expert (46% size reduction) - **wan22-i2v-a14b-low-q4-k-s.gguf**: Q4_K_S quantized low-noise expert (46% size reduction) **Quantization Format**: Q4_K_S (4-bit K-quant Small) provides an optimal balance between model size, memory usage, and generation quality. ## Hardware Requirements ### Minimum Requirements | Configuration | VRAM | Disk Space | RAM | |--------------|------|------------|-----| | **Full FP16** | 24 GB | 31 GB | 32 GB | | **Q4_K_S Quantized** | 12 GB | 31 GB | 16 GB | | **Mixed (FP16 + Q4_K_S)** | 18 GB | 31 GB | 24 GB | ### Recommended Requirements - **GPU**: NVIDIA RTX 4090 (24GB), RTX 6000 Ada (48GB), or A6000 (48GB) - **CPU**: Modern multi-core processor (8+ cores recommended) - **Storage**: SSD for faster model loading - **Operating System**: Windows 10/11, Linux (Ubuntu 22.04+) ### Performance Notes - **FP16 models** provide the highest quality but require more VRAM - **Q4_K_S quantization** reduces VRAM usage by ~50% with minimal quality loss - Video generation time depends on resolution (480P ~30-60s, 720P ~60-120s per video) - Batch processing can improve throughput but requires additional VRAM ## Usage Examples ### ComfyUI Integration The most common way to use these GGUF models is through [ComfyUI](https://github.com/comfyanonymous/ComfyUI) with the [ComfyUI-GGUF](https://github.com/city96/ComfyUI-GGUF) custom node. **Installation**: ```bash # Navigate to ComfyUI custom nodes directory cd ComfyUI/custom_nodes # Clone the GGUF node git clone https://github.com/city96/ComfyUI-GGUF # Install dependencies cd ComfyUI-GGUF pip install -r requirements.txt ``` **Model Setup**: ```bash # Copy models to ComfyUI directory cp E:\huggingface\wan22-fp16-i2v-gguf\diffusion_models\wan\*.gguf ComfyUI\models\unet\ ``` **Workflow Configuration**: 1. Load image input node 2. Add **GGUF Model Loader** node 3. Select `wan22-i2v-a14b-high-q4-k-s.gguf` (for high-noise expert) 4. Add prompt conditioning (optional) 5. Configure video sampler with: - Steps: 50-100 - CFG Scale: 7-9 - Resolution: 480P or 720P 6. Connect to video output node ### Python Usage (Diffusers) For direct Python usage with absolute paths: ```python from diffusers import DiffusionPipeline import torch # Note: GGUF models require conversion or specialized loaders # For native Diffusers support, use the base model: # pipe = DiffusionPipeline.from_pretrained("Wan-AI/Wan2.2-I2V-A14B-Diffusers") # For GGUF files, use ComfyUI or llama.cpp-based loaders # Example using custom GGUF loader (requires compatible library): from comfyui_gguf_loader import load_gguf_model model_path = r"E:\huggingface\wan22-fp16-i2v-gguf\diffusion_models\wan\wan22-i2v-a14b-high-q4-k-s.gguf" model = load_gguf_model(model_path, device="cuda", dtype=torch.float16) # Generate video from image image = load_image("input_image.jpg") video = model.generate( image=image, prompt="A serene landscape with gentle wind moving through grass", num_frames=48, resolution="720p", guidance_scale=8.0, num_inference_steps=75 ) # Save video video.save("output_video.mp4") ``` ### Advanced Configuration ```python # Memory-optimized configuration for 12GB VRAM config = { "model_path": r"E:\huggingface\wan22-fp16-i2v-gguf\diffusion_models\wan\wan22-i2v-a14b-high-q4-k-s.gguf", "vae_tiling": True, # Reduce VAE memory usage "enable_xformers": True, # Memory-efficient attention "gradient_checkpointing": True, "low_vram_mode": True, "chunk_size": 2, # Process video in chunks } ``` ## Model Specifications ### Architecture - **Base Model**: Wan 2.2 I2V-A14B (Image-to-Video) - **Parameters**: 14.3 billion per expert (~27B total, 14B active) - **Architecture**: Mixture-of-Experts (MoE) Diffusion Transformer - **Experts**: Dual-expert design (high-noise + low-noise) - **Precision**: FP16 (full) / Q4_K_S (quantized) - **Format**: GGUF (GPT-Generated Unified Format) ### Capabilities - **Input**: Static images (any resolution, recommended 512x512 or higher) - **Output**: Video sequences at 480P (854x480) or 720P (1280x720) - **Frame Count**: Configurable (typically 24-96 frames) - **Frame Rate**: 24 FPS (configurable) - **Duration**: 1-4 seconds typical output - **Text Conditioning**: Optional prompt-guided generation - **Style Control**: Lighting, composition, contrast, color tone ### Quantization Details **Q4_K_S Quantization**: - **Bit Depth**: 4-bit per weight (mixed with some 6-bit components) - **Method**: K-quant Small (balanced quality/size trade-off) - **Size Reduction**: ~46% compared to FP16 - **Quality Loss**: Minimal (~2-5% perceptual difference) - **Speed**: Similar or faster inference due to reduced memory bandwidth ## Performance Tips and Optimization ### Memory Optimization 1. **Use Quantized Models**: Start with Q4_K_S versions for 12GB VRAM systems 2. **Enable VAE Tiling**: Reduces memory usage by processing image tiles 3. **Lower Resolution**: Generate at 480P first, upscale if needed 4. **Reduce Batch Size**: Process one video at a time on limited VRAM 5. **Model Offloading**: Move models to CPU between inference steps ### Quality Optimization 1. **Inference Steps**: Use 75-100 steps for best quality (50 minimum) 2. **Guidance Scale**: CFG 7-9 provides good prompt adherence 3. **Prompt Engineering**: Describe motion, lighting, and camera movement 4. **Input Image Quality**: Higher quality input = better video output 5. **Resolution Matching**: Match input aspect ratio to output resolution ### Speed Optimization 1. **Use Quantized Models**: Q4_K_S inference is 10-20% faster 2. **Enable xFormers**: Memory-efficient attention for faster processing 3. **Optimize Steps**: Balance quality vs speed (50-75 steps for faster generation) 4. **Compile Model**: Use `torch.compile()` for 15-25% speedup (PyTorch 2.0+) 5. **GPU Warmup**: Run one generation to compile kernels before batch processing ### Example Prompts **Good Prompts**: - "Gentle camera pan right, golden hour lighting, soft wind through trees" - "Slow zoom in, dramatic lighting from left, subtle motion in background" - "Static camera, clouds moving across sky, soft ambient lighting" **Avoid**: - Overly complex multi-action prompts - Conflicting motion directions - Unrealistic physics or transformations ## License This model is released under a **custom Wan license**. Please refer to the original Wan 2.2 model repository for complete licensing terms. ### Usage Terms Users are accountable for the content they generate and must not: - Violate laws or regulations - Cause harm to individuals or groups - Generate or spread misinformation or disinformation - Target or harm vulnerable populations ### Commercial Use Please consult the original Wan 2.2 license for commercial use terms and conditions. ## Citation If you use Wan 2.2 models in your research or applications, please cite: ```bibtex @article{wan2025, title={Wan: Open and Advanced Large-Scale Video Generative Models}, author={Team Wan and Contributors}, journal={arXiv preprint arXiv:2503.20314}, year={2025} } ``` ## Related Resources ### Official Resources - **Original Model**: [Wan-AI/Wan2.2-I2V-A14B](https://huggingface.co/Wan-AI/Wan2.2-I2V-A14B) - **Diffusers Version**: [Wan-AI/Wan2.2-I2V-A14B-Diffusers](https://huggingface.co/Wan-AI/Wan2.2-I2V-A14B-Diffusers) - **GGUF Collection**: [QuantStack/Wan2.2-I2V-A14B-GGUF](https://huggingface.co/QuantStack/Wan2.2-I2V-A14B-GGUF) - **GitHub Repository**: [Wan-Video/Wan2.2](https://github.com/Wan-Video/Wan2.2) - **Research Paper**: [arXiv:2503.20314](https://arxiv.org/abs/2503.20314) ### Community Resources - **ComfyUI Integration**: [ComfyUI-GGUF](https://github.com/city96/ComfyUI-GGUF) - **Tutorial**: [Wan 2.2 VideoGen in ComfyUI](https://www.stablediffusiontutorials.com/2025/08/wan-2.2-video-generation.html) - **Low VRAM Guide**: [Running Wan 2.2 GGUF with Low VRAM](https://www.nextdiffusion.ai/tutorials/how-to-run-wan22-image-to-video-gguf-models-in-comfyui-low-vram) ### Other Wan 2.2 Variants - **Text-to-Video**: [Wan2.2-T2V-A14B](https://huggingface.co/Wan-AI/Wan2.2-T2V-A14B) - **Text+Image-to-Video**: [Wan2.2-TI2V-5B](https://huggingface.co/Wan-AI/Wan2.2-TI2V-5B) - **Speech-to-Video**: [Wan2.2-S2V-14B](https://huggingface.co/Wan-AI/Wan2.2-S2V-14B) ## Troubleshooting ### Common Issues **Issue**: Out of memory errors **Solution**: Use Q4_K_S quantized models, enable VAE tiling, reduce resolution to 480P **Issue**: Slow generation speed **Solution**: Use quantized models, enable xFormers, reduce inference steps to 50-75 **Issue**: Poor video quality **Solution**: Increase inference steps to 75-100, use higher guidance scale (8-9), improve input image quality **Issue**: Model fails to load **Solution**: Verify GGUF loader compatibility, check file integrity, ensure sufficient disk space **Issue**: Inconsistent motion **Solution**: Use clearer motion prompts, adjust guidance scale, increase inference steps ## Support and Contact For issues, questions, or contributions: - **Model Issues**: [Wan-AI on Hugging Face](https://huggingface.co/Wan-AI) - **GGUF Issues**: [ComfyUI-GGUF GitHub](https://github.com/city96/ComfyUI-GGUF/issues) - **General Discussion**: [Hugging Face Forums](https://discuss.huggingface.co/) --- **Model Version**: v2.2 **README Version**: v1.3 **Last Updated**: 2025-10-14 **Format**: GGUF (FP16 + Q4_K_S) **Base Model**: Wan-AI/Wan2.2-I2V-A14B