-
Notifications
You must be signed in to change notification settings - Fork 534
Description
--vae-tiling causes GGML_ASSERT crash during VAE encode for Wan 2.2 I2V A14B
Environment
- sd-cli version: master-504-636d3cb (Feb 10, 2026)
- GPU: AMD Radeon RX 7900 XTX (24GB VRAM)
- Backend: ROCm (gfx1100)
- OS: Windows
- Models: Wan2.2-I2V-A14B-HighNoise-Q5_K_M.gguf / LowNoise-Q5_K_M.gguf (QuantStack)
- VAE: wan_2.1_vae.safetensors
Bug
Using --vae-tiling with Wan 2.2 I2V A14B causes an assertion failure immediately after the VAE encode step completes. The tiled encode finishes all 8 tiles successfully, but the reassembled latent has an unexpected channel dimension:
GGML_ASSERT(latent->ne[channel_dim] == 16 || latent->ne[channel_dim] == 48 || latent->ne[channel_dim] == 128) failed
Removing --vae-tiling resolves the crash and generation completes normally.
Command (crashes)
sd-cli.exe -M vid_gen ^
--diffusion-model Wan2.2-I2V-A14B-LowNoise-Q5_K_M.gguf ^
--high-noise-diffusion-model Wan2.2-I2V-A14B-HighNoise-Q5_K_M.gguf ^
--vae wan_2.1_vae.safetensors ^
--t5xxl umt5-xxl-encoder-Q8_0.gguf ^
-i input.png ^
-p "a lovely cat" ^
-W 480 -H 832 --video-frames 33 ^
--steps 10 --high-noise-steps 8 ^
--sampling-method euler --high-noise-sampling-method euler ^
--cfg-scale 3.5 --high-noise-cfg-scale 3.5 ^
--flow-shift 3.0 --diffusion-fa --offload-to-cpu ^
--vae-tiling ^
-v
Command (works — identical but without --vae-tiling)
Same command without --vae-tiling completes successfully. Full VAE decode at 33 frames uses ~20 GB VRAM compute buffer without tiling.
Relevant log output (crash)
[INFO ] stable-diffusion.cpp:3894 - IMG2VID
[DEBUG] stable-diffusion.cpp:2546 - VAE Tile size: 41x41
[DEBUG] src\ggml_extend.hpp:838 - num tiles : 2, 4
[DEBUG] src\ggml_extend.hpp:839 - optimal overlap : 0.536585, 0.487805 (targeting 0.500000)
[DEBUG] src\ggml_extend.hpp:870 - tile work buffer size: 40.94 MB
[DEBUG] src\ggml_extend.hpp:883 - processing 8 tiles
[INFO ] src\ggml_extend.hpp:1865 - wan_vae offload params (242.10 MB, 194 tensors) to runtime backend (ROCm0), taking 1.13s
[DEBUG] src\ggml_extend.hpp:1765 - wan_vae compute buffer size: 4177.32 MB(VRAM)
|======> | 1/8 - 4.65s/it
[... tiles 2-7 ...]
|==================================================| 8/8 - 3.00s/it
[DEBUG] stable-diffusion.cpp:2570 - computing vae encode graph completed, taking 25.67s
D:/a/stable-diffusion.cpp/stable-diffusion.cpp/src/stable-diffusion.cpp:2340: GGML_ASSERT(latent->ne[channel_dim] == 16 || latent->ne[channel_dim] == 48 || latent->ne[channel_dim] == 128) failed
All 8 tiles encode without error, but the reassembled output latent has a channel count that doesn't match any expected value (16, 48, or 128).
Why this matters
Without tiling, the VAE decode buffer scales linearly with frame count:
- 33 frames → ~20 GB (fits 24GB VRAM)
- 81 frames → ~49 GB (doesn't fit)
This means users with 24GB GPUs cannot generate videos longer than ~2 seconds at 480x832 using the full-quality VAE. The only workaround is using the tiny VAE decoder (--tae taew2_1.safetensors), which works but has reduced decode quality.
T2V is not affected since it has no VAE encode step.
Analysis
The Wan VAE is a 3D spatiotemporal autoencoder (as opposed to the 2D VAEs used by SD/SDXL/FLUX). The tiling logic may be assembling tiles using 2D assumptions that produce incorrect channel dimensions when applied to the 3D latent structure. The I2V pipeline triggers this because it VAE-encodes the input image, while T2V starts from noise and skips encoding.