Reducing WAN VRAM requirements by pipelining encoding->high noise->low_noise stages #1129

rendang-github · 2025-12-22T11:29:20Z

rendang-github
Dec 22, 2025

I have been perplexed by my inability to run WAN 2.2 renders on my RTX 3090 (24GB VRAM), even when using sd-cli and the demo prompts in docs/wan.md and related Q8 GGUFs. So I dug into the codebase with vim and gdb to figure out what's going on, and it appears that the system attempts to load the T5 encoder, the low noise model, and the high noise model concurrently.

My understanding of how WAN 2.2 works is that the text encoding, high noise pass, and low noise pass all take place in distinct stages, where there should not be any overlap of model usage. I was wondering if there was a reason why they are all pre-loaded into VRAM instead of loading and loading each one sequentially in lower VRAM environments? I get that this approach would be slower than loading everything in VRAM all at once at startup, however my budget for H200's appears to have run dry (I checked for spare change under the sofa too), so I'm prepared to compromise on slightly slower run times to facilitate swapping in and out of models as they are required.

SkutteOleg · 2025-12-22T11:59:46Z

SkutteOleg
Dec 22, 2025

Did you use --offload-to-cpu argument?

1 reply

rendang-github Dec 22, 2025
Author

I did try to that too, but it's not ideal for a number of reasons. Firstly, storing extra copies of the models in RAM is nice if you're running a DL380 with gobs of RAM, but for normal end users the less copies of anything stored in RAM the better. Better to keep those things on SSD/nvme and mmap() them than to make copies in heap which will probably get swapped back to disk anyway in a consumer grade system. Besides, it seems to want to load everything into VRAM eventually anyway:

(gdb) r -M vid_gen --diffusion-model ~/models/diffusion_models/wan2.2/Wan2.2-T2V-A14B-LowNoise-Q8_0.gguf --high-noise-diffusion-model ~/models/diffusion_models/wan2.2/Wan2.2-T2V-A14B-HighNoise-Q8_0.gguf --vae ~/models/vae/wan_2.1_vae.safetensors --t5xxl ~/models/text_encoders/umt5-xxl-encoder-Q8_0.gguf -p "a lovely cat" --cfg-scale 3.5 --sampling-method euler --steps 10 --high-noise-cfg-scale 3.5 --high-noise-sampling-method euler --high-noise-steps 10 -v -n "色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走" -W 640 -H 640 --diffusion-fa --video-frames 33 --flow-shift 8.0 --scheduler simple -t 1 --diffusion-fa --offload-to-cpu

... allocates the 6 GB of t5:

Thread 1 "sd-cli" hit Breakpoint 1, ggml_backend_cuda_buffer_type_alloc_buffer (
    buft=0x555558edb3c0 <ggml_backend_cuda_buffer_type::ggml_backend_cuda_buffer_types>, size=6036701184)
    at ~/src/stable-diffusion.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:699
699	static ggml_backend_buffer_t ggml_backend_cuda_buffer_type_alloc_buffer(ggml_backend_buffer_type_t buft, size_t size) {
(gdb) p size
$5 = 6036701184

... some more t5 allocs (omitted) ...

... we hit the high noise sampler alloc, 15.4 GB:

[INFO ] stable-diffusion.cpp:3760 - sampling(high noise) using Euler method
Thread 1 "sd-cli" hit Breakpoint 1, ggml_backend_cuda_buffer_type_alloc_buffer (
    buft=0x555558edb3c0 <ggml_backend_cuda_buffer_type::ggml_backend_cuda_buffer_types>, size=15414087936)
    at ~/src/stable-diffusion.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:699
699	static ggml_backend_buffer_t ggml_backend_cuda_buffer_type_alloc_buffer(ggml_backend_buffer_type_t buft, size_t size) {
(gdb) p size
$9 = 15414087936
(gdb) c
Continuing.
[INFO ] ggml_extend.hpp:1796 - Wan2.x-T2V-14B offload params (14700.02 MB, 1095 tensors) to runtime backend (CUDA0), taking 8.50s

... We then hit an extension to 34 GB, which seems like a lot of VRAM:

Thread 1 "sd-cli" hit Breakpoint 1, ggml_backend_cuda_buffer_type_alloc_buffer (
    buft=0x555558edb3c0 <ggml_backend_cuda_buffer_type::ggml_backend_cuda_buffer_types>, size=34689466752)
    at ~/src/stable-diffusion.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:699
699	static ggml_backend_buffer_t ggml_backend_cuda_buffer_type_alloc_buffer(ggml_backend_buffer_type_t buft, size_t size) {
(gdb) p size
$10 = 34689466752

... aaaaand fireworks:

(gdb) c
Continuing.
[ERROR] ggml_extend.hpp:83   - ggml_backend_cuda_buffer_type_alloc_buffer: allocating 33082.45 MiB on device 0: cudaMalloc failed: out of memory
[ERROR] ggml_extend.hpp:83   - ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 34689466752
[ERROR] ggml_extend.hpp:1691 - Wan2.x-T2V-14B: failed to allocate the compute buffer
[ERROR] ggml_extend.hpp:1961 - Wan2.x-T2V-14B alloc compute buffer failed
[ERROR] stable-diffusion.cpp:1714 - diffusion model compute failed
[ERROR] stable-diffusion.cpp:1848 - Diffusion model sampling failed
[INFO ] stable-diffusion.cpp:3792 - sampling(high noise) completed, taking 63.88s
[DEBUG] stable-diffusion.cpp:3801 - sample 80x80x9

Thread 1 "sd-cli" received signal SIGSEGV, Segmentation fault.
0x000055555653d751 in ggml_dup_tensor (ctx=0x55555981eca0, src=0x0)
    at ~/src/stable-diffusion.cpp/ggml/src/ggml.c:1791

So maybe it got further with cpu offload, or maybe it's just a different failure mode. Hard to tell with only 24GB (only!) :)

rene-descartes2021 · 2025-12-23T18:41:51Z

rene-descartes2021
Dec 23, 2025

Feel free to code review #1059

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reducing WAN VRAM requirements by pipelining encoding->high noise->low_noise stages #1129

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Reducing WAN VRAM requirements by pipelining encoding->high noise->low_noise stages #1129

Uh oh!

rendang-github Dec 22, 2025

Replies: 2 comments · 1 reply

Uh oh!

SkutteOleg Dec 22, 2025

Uh oh!

rendang-github Dec 22, 2025 Author

Uh oh!

rene-descartes2021 Dec 23, 2025

rendang-github
Dec 22, 2025

Replies: 2 comments 1 reply

SkutteOleg
Dec 22, 2025

rendang-github Dec 22, 2025
Author

rene-descartes2021
Dec 23, 2025