Simplified t2i Workflow for Flux2D (also Workaround for broken MultiGPU Nodes)
Workaround on the fly.Due to ComfyUI updates, the "large" DisTorch2MultiGPU nodes are no longer functional. This workflow has been modified to use the old "small" MultiGPU nodes from the first version.
The workflow should run on the DisTorch2MultiGPUv2 nodes to control VRAM (RAM) allocation, preventing VRAM overflow and the resulting swapping. Unfortunately, these nodes are currently broken due to the latest ComfyUI updates. As a fallback, the older MultiGPUv1 nodes are used (with up to 10β15% lower inference speed compared to DisTorch2).
π GitHub (right-click to open in new tab)
pollockjj/ComfyUI-MultiGPU
Model Loading & GPU|CPU Distribution
UnetLoaderGGUFMultiGPU and VAELoaderMultiGPU are assigned directly to cuda:0.
ClipLoaderGGUFMultiGPU is fixed to cpu (CPU offloading).
By setting fixed devices, noticeable VRAM swapping is prevented.
Test System: RTX 3090 (24GB VRAM) + 32GB RAM
- RAM usage: ~65-77%
- VRAM usage: ~22-23GB
- Virtual VRAM: not used
Quick Reference: FLUX.2 + Mistral-3-Small GGUF
| VRAM | FLUX.2 UNet | Mistral Text Encoder (CPU) | Notes |
|---|---|---|---|
| 24GB | Q8_0 (35GB) | Q8_K (29GB) | Best quality setup |
| 16GB | Q5_K_M (24.1GB) - try Q6_K (27.4GB) | Q6_K (19.3GB) or Q4_K_M (14.3GB) | Balanced quality |
| 12GB | Q4_K_M (20.1GB) | Q4_K_M (14.3GB) or Q3_K_M (11.5GB) | Speed priority |
π HF (right-click to open in new tab)
city96/FLUX.2-dev-gguf
unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF
Active Configuration
VRAM (24GB GPU):
- ~18-20GB: UNet:
flux2-dev-Q8_0.gguf(Alternative:FP8-Mixed.safetensors) - ~1-1.5GB: VAE:
flux2-vae.safetensors - ~1-2GB: Overhead
- = ~22-23GB total
RAM (32GB CPU):
- ~25GB: Text Encoder:
Mistral-Small-3.2-24B-Instruct-2506-UD-Q8_K_XL.gguf(Alternative:fp8.safetensors)
π HF (right-click to open in new tab)
Comfy-Org/flux2-dev
Memory Management
RAMCleanup Node (active):
- Required: On first run
- Optional: For resolutions ~1MP (e.g. 832Γ1216px)
- Required: From >1MP, β₯2MP onwards (see Performance section)
Settings:
- Clean File Cache: β
- Clean Processes: β
Clean dlls- Retry attempts: 3
- Runs between VAE-Decode and Save
π GitHub (right-click to open in new tab)
LAOGOU-666/Comfyui-Memory_Cleanup
run_nvidia_gpu.bat
.\python_embeded\python.exe -s ComfyUI\main.py --windows-standalone-build --use-sage-attention --fast
Important:
- β Remove
--lowvram,--disable-smart-memoryor similar flags - β οΈ No VRAM/Memory flags! Let ComfyUI breathe.
The details of this memory management are intelligently implemented by
comfyUI β city96's GGUF nodes β and the MultiGPU nodes on them.
SageAttention Setup
With GGUF: The acceleration provided by SageAttention is significantly lower than with FP8 models and may sometimes not take effect at all, since GGUF formats are primarily designed for CPUβoptimized inference and do not fully leverage the GPU kernels of SageAttention.
β Disable --use-sage-attention. Instead, use --fast (standard PyTorch optimization) and rely on the internal optimizations of the GGUF Nodes(backend).
Check Triton installation (Windows):
.\python_embeded\python.exe -m pip show triton
If not installed:
.\python_embeded\python.exe -m pip install triton-windows
GPU-specific SageAttention versions:
| GPU Series | Version | Reason |
|---|---|---|
| RTX 30xx (Ampere) | 1.0.6 | Version 2.x offers no performance improvement |
| RTX 40xx (Ada) | 2.2.0 | Primarily optimized for these or newer architectures |
Installation:
# RTX 30xx
.\python_embeded\python.exe -m pip install sageattention==1.0.6
# RTX 40xx
.\python_embeded\python.exe -m pip install sageattention==2.2.0
π‘ Tip: New to SageAttention 2.2.0 installation?
π GitHub (right-click to open in new tab)
Check out this π§ Installation Guide: SageAttention + Triton for ComfyUI
Performance
Test Setup:
Guidance: 4 | Steps: 20 (Production: 30-40 Steps)
Based on: ~80 runs with different resolutions
First Run Initial loading of required layers into VRAM|RAM, first inference; Further inferences are then significantly faster as memory management is already initialized. Exact timings, loaded partially etc. see Console-Output / Screenshots.
FP8 Format
First Run
- 832Γ1216px: ~380-400s
Subsequent Runs:
- 832Γ1216px: 75-80s (~3.70-3.90s/it)
- 1080Γ1920px: 135-150s (~6.75-7.50s/it)
- 1440Γ2160px: 225-240s (~11.00-11.50s/it) Dare to click β opens fixed-size copy
GGUF Format (expectedly higher runtimes)
First Run
- 832Γ1216px: ~420-440s
Subsequent Runs:
- 832Γ1216px: 105-120s (~5.30-5.50s/it)
- 1440Γ2160px: 250-260s (~12.00-12.75s/it)
- Dare to click β opens fixed-size copy
Model tree for GegenDenTag/comfyUI-Flux2D-t2i-workflow
Base model
black-forest-labs/FLUX.2-dev