16GB Is Not Enough: The FLUX OOM Journey and Why VRAM Rules Everything
FLUX.1-schnell needs ~12GB just for the transformer. My RTX 5070 Ti has 16GB. Here's the three-attempt journey from crash to working generation.
The VRAM Reality Check
My RTX 5070 Ti has 16GB of VRAM. On paper, that's a lot. In practice, it's tight — especially when you're running other things. Ollama with a 9B model takes ~8GB. The display compositor grabs ~500MB. Chrome headless for the browser tools? Another 600MB. You turn around and you've got 6-7GB free in a "16GB" GPU.
So when I set out to run FLUX.1-schnell locally, the numbers should have been a warning sign. The transformer alone is ~12GB in bfloat16. The VAE, text encoders, and other pipeline components add another ~2.5GB. That's 14.5GB before inference even starts. Inference needs working memory — activations, attention caches, intermediate tensors. There's no room.
I tried anyway. Three times.
Attempt 1: pipe.to("cuda") — The Obvious Approach
Every tutorial, every blog post, every HuggingFace example shows the same thing:
pipe = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-schnell",
torch_dtype=torch.bfloat16,
)
pipe.to("cuda")
Result:
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 18.00 MiB.
GPU 0 has a total capacity of 15.46 GiB of which 22.25 MiB is free.
This process: 14.73 GiB memory in use.
14.73GB used, 22MB free, and it couldn't allocate 18MB. The model loaded, barely, but there was zero headroom for inference. A model that "fits in VRAM" isn't a model that runs in VRAM.
This is the fundamental misunderstanding: model size ≠ VRAM requirement. You need the model weights plus working memory. For FLUX, that gap is at least 2-3GB.
Attempt 2: enable_model_cpu_offload() — The "Smart" Approach
Diffusers has a built-in feature for this. Model CPU offload keeps pipeline components in RAM and moves them to GPU one at a time during inference:
pipe.enable_model_cpu_offload()
Result:
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 18.00 MiB.
GPU 0 has a total capacity of 15.46 GiB of which 34.25 MiB is free.
This process: 14.72 GiB memory in use.
Same error. Slightly different free memory (34MB vs 22MB), but still dead. The problem? Model CPU offload moves entire components to GPU. The FLUX transformer is a single component — ~12GB. Moving it to GPU all at once is the same as pipe.to("cuda") for that component. The GPU fills up, inference can't start, OOM.
Model CPU offload works great when your components are small (SD 1.5 has ~4GB components). It doesn't work when one component is 75% of your VRAM.
Attempt 3: 8-bit quantisation + sequential offload — the one that worked
Two changes, both necessary:
1. 8-bit quantisation shrinks the transformer from ~12GB to ~6GB:
from diffusers import FluxTransformer2DModel
transformer = FluxTransformer2DModel.from_pretrained(
"black-forest-labs/FLUX.1-schnell",
subfolder="transformer",
torch_dtype=torch.bfloat16,
token=hf_token,
quantization_config={"load_in_8bit": True},
)
This requires bitsandbytes — uv pip install bitsandbytes accelerate. The 8-bit quantisation happens at load time, not as a separate conversion step. Zero extra setup.
2. Sequential CPU offload moves layers one at a time instead of entire components:
pipe = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-schnell",
transformer=transformer,
torch_dtype=torch.bfloat16,
token=hf_token,
)
pipe.enable_sequential_cpu_offload()
Now each layer of the transformer moves to GPU, does its forward pass, and moves back to RAM. Peak VRAM usage drops to ~6-8GB instead of 14.7GB.
One subtle footgun: the random generator must use device="cpu", not "cuda":
# WRONG with sequential offload:
generator=torch.Generator(device="cuda").manual_seed(42)
# CORRECT:
generator=torch.Generator(device="cpu").manual_seed(42)
The Performance Trade-off
| Method | Time (4 steps, 1024x1024) | Peak VRAM | Works on 16GB? |
|---|---|---|---|
| pipe.to("cuda") | ~3-5 sec | ~15GB | No — OOM |
| model_cpu_offload | ~5-8 sec | ~15GB | No — OOM |
| 8-bit + sequential offload | ~13 sec | ~6-8GB | Yes |
13 seconds vs 3-5 seconds. That's the cost of making it work at all. On a 24GB card (4090, A5000), you'd use pipe.to("cuda") and get the fast path. On 16GB, you take the slower path or you don't generate images.
The quality impact of 8-bit quantisation is negligible for FLUX-schnell. It's already a distilled model optimised for 4-step generation. You're not losing meaningful precision — the bottleneck is step count, not weight precision.
The Broader Lesson
VRAM is the scarcest resource in local AI. Not CPU, not RAM, not storage. A 64GB RAM machine with a 16GB GPU is still constrained by that 16GB.
The math is unforgiving:
- 16GB GPU - 1GB OS/display = 15GB available
- 15GB - 8GB Ollama model = 7GB free
- 7GB isn't enough for any image generation model worth using
You must manage VRAM consciously. Unload text models before loading image models. Use quantisation. Accept slower generation for the ability to generate at all.
And always remember: "loads in VRAM" and "runs in VRAM" are different things. Plan for working memory, not just model weights. The gap between them is where your OOM errors live.
Found this useful? Follow Raf_VRS on X for more from the VRS Computing trenches and support the work: ko-fi.com/rafvrs.
Stop Scrolling. Start Building. #LocalAI #AIAgents #VRSComputing