Simplified t2i Workflow for Flux2D (also Workaround for broken MultiGPU Nodes)

Workaround on the fly.
Due to ComfyUI updates, the "large" DisTorch2MultiGPU nodes are no longer functional. This workflow has been modified to use the old "small" MultiGPU nodes from the first version.

The workflow should run on the DisTorch2MultiGPUv2 nodes to control VRAM (RAM) allocation, preventing VRAM overflow and the resulting swapping. Unfortunately, these nodes are currently broken due to the latest ComfyUI updates. As a fallback, the older MultiGPUv1 nodes are used (with up to 10–15% lower inference speed compared to DisTorch2).

πŸ”— GitHub (right-click to open in new tab)
pollockjj/ComfyUI-MultiGPU


Workflow Flux2 t2i

Flux2 t2i.json


Model Loading & GPU|CPU Distribution

UnetLoaderGGUFMultiGPU and VAELoaderMultiGPU are assigned directly to cuda:0.
ClipLoaderGGUFMultiGPU is fixed to cpu (CPU offloading).

By setting fixed devices, noticeable VRAM swapping is prevented.

Test System: RTX 3090 (24GB VRAM) + 32GB RAM

  • RAM usage: ~65-77%
  • VRAM usage: ~22-23GB
  • Virtual VRAM: not used

Quick Reference: FLUX.2 + Mistral-3-Small GGUF

VRAM FLUX.2 UNet Mistral Text Encoder (CPU) Notes
24GB Q8_0 (35GB) Q8_K (29GB) Best quality setup
16GB Q5_K_M (24.1GB) - try Q6_K (27.4GB) Q6_K (19.3GB) or Q4_K_M (14.3GB) Balanced quality
12GB Q4_K_M (20.1GB) Q4_K_M (14.3GB) or Q3_K_M (11.5GB) Speed priority

πŸ”— HF (right-click to open in new tab)
city96/FLUX.2-dev-gguf
unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF

Active Configuration

VRAM (24GB GPU):

  • ~18-20GB: UNet: flux2-dev-Q8_0.gguf (Alternative: FP8-Mixed.safetensors)
  • ~1-1.5GB: VAE: flux2-vae.safetensors
  • ~1-2GB: Overhead
  • = ~22-23GB total

RAM (32GB CPU):

  • ~25GB: Text Encoder: Mistral-Small-3.2-24B-Instruct-2506-UD-Q8_K_XL.gguf (Alternative: fp8.safetensors)

πŸ”— HF (right-click to open in new tab)
Comfy-Org/flux2-dev


Memory Management

RAMCleanup Node (active):

  • Required: On first run
  • Optional: For resolutions ~1MP (e.g. 832Γ—1216px)
  • Required: From >1MP, β‰₯2MP onwards (see Performance section)

Settings:

  • Clean File Cache: βœ“
  • Clean Processes: βœ“
  • Clean dlls
  • Retry attempts: 3
  • Runs between VAE-Decode and Save

πŸ”— GitHub (right-click to open in new tab)
LAOGOU-666/Comfyui-Memory_Cleanup


run_nvidia_gpu.bat

.\python_embeded\python.exe -s ComfyUI\main.py --windows-standalone-build --use-sage-attention --fast

Important:

  • ❌ Remove --lowvram, --disable-smart-memory or similar flags
  • ⚠️ No VRAM/Memory flags! Let ComfyUI breathe.
    The details of this memory management are intelligently implemented by
    comfyUI β†’ city96's GGUF nodes β†’ and the MultiGPU nodes on them.

SageAttention Setup

With GGUF: The acceleration provided by SageAttention is significantly lower than with FP8 models and may sometimes not take effect at all, since GGUF formats are primarily designed for CPU‑optimized inference and do not fully leverage the GPU kernels of SageAttention.
β†’ Disable --use-sage-attention. Instead, use --fast (standard PyTorch optimization) and rely on the internal optimizations of the GGUF Nodes(backend).

Check Triton installation (Windows):

.\python_embeded\python.exe -m pip show triton

If not installed:

.\python_embeded\python.exe -m pip install triton-windows

GPU-specific SageAttention versions:

GPU Series Version Reason
RTX 30xx (Ampere) 1.0.6 Version 2.x offers no performance improvement
RTX 40xx (Ada) 2.2.0 Primarily optimized for these or newer architectures

Installation:

# RTX 30xx
.\python_embeded\python.exe -m pip install sageattention==1.0.6

# RTX 40xx
.\python_embeded\python.exe -m pip install sageattention==2.2.0

πŸ’‘ Tip: New to SageAttention 2.2.0 installation?
πŸ”— GitHub (right-click to open in new tab)
Check out this πŸ”§ Installation Guide: SageAttention + Triton for ComfyUI


Performance

Test Setup:
Guidance: 4 | Steps: 20 (Production: 30-40 Steps)
Based on: ~80 runs with different resolutions

First Run Initial loading of required layers into VRAM|RAM, first inference; Further inferences are then significantly faster as memory management is already initialized. Exact timings, loaded partially etc. see Console-Output / Screenshots.


FP8 Format

First Run

  • 832Γ—1216px: ~380-400s

Subsequent Runs:

  • 832Γ—1216px: 75-80s (~3.70-3.90s/it)
  • 1080Γ—1920px: 135-150s (~6.75-7.50s/it)
  • 1440Γ—2160px: 225-240s (~11.00-11.50s/it) Dare to click β€” opens fixed-size copy

GGUF Format (expectedly higher runtimes)

First Run

  • 832Γ—1216px: ~420-440s

Subsequent Runs:

  • 832Γ—1216px: 105-120s (~5.30-5.50s/it)
  • 1440Γ—2160px: 250-260s (~12.00-12.75s/it)
  • Dare to click β€” opens fixed-size copy
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for GegenDenTag/comfyUI-Flux2D-t2i-workflow

Finetuned
(13)
this model