Skip to content

--vae-tiling causes GGML_ASSERT crash during VAE encode for Wan 2.2 I2V A14B #1284

@BetaDoctor

Description

@BetaDoctor

--vae-tiling causes GGML_ASSERT crash during VAE encode for Wan 2.2 I2V A14B

Environment

  • sd-cli version: master-504-636d3cb (Feb 10, 2026)
  • GPU: AMD Radeon RX 7900 XTX (24GB VRAM)
  • Backend: ROCm (gfx1100)
  • OS: Windows
  • Models: Wan2.2-I2V-A14B-HighNoise-Q5_K_M.gguf / LowNoise-Q5_K_M.gguf (QuantStack)
  • VAE: wan_2.1_vae.safetensors

Bug

Using --vae-tiling with Wan 2.2 I2V A14B causes an assertion failure immediately after the VAE encode step completes. The tiled encode finishes all 8 tiles successfully, but the reassembled latent has an unexpected channel dimension:

GGML_ASSERT(latent->ne[channel_dim] == 16 || latent->ne[channel_dim] == 48 || latent->ne[channel_dim] == 128) failed

Removing --vae-tiling resolves the crash and generation completes normally.

Command (crashes)

sd-cli.exe -M vid_gen ^
  --diffusion-model Wan2.2-I2V-A14B-LowNoise-Q5_K_M.gguf ^
  --high-noise-diffusion-model Wan2.2-I2V-A14B-HighNoise-Q5_K_M.gguf ^
  --vae wan_2.1_vae.safetensors ^
  --t5xxl umt5-xxl-encoder-Q8_0.gguf ^
  -i input.png ^
  -p "a lovely cat" ^
  -W 480 -H 832 --video-frames 33 ^
  --steps 10 --high-noise-steps 8 ^
  --sampling-method euler --high-noise-sampling-method euler ^
  --cfg-scale 3.5 --high-noise-cfg-scale 3.5 ^
  --flow-shift 3.0 --diffusion-fa --offload-to-cpu ^
  --vae-tiling ^
  -v

Command (works — identical but without --vae-tiling)

Same command without --vae-tiling completes successfully. Full VAE decode at 33 frames uses ~20 GB VRAM compute buffer without tiling.

Relevant log output (crash)

[INFO ] stable-diffusion.cpp:3894 - IMG2VID
[DEBUG] stable-diffusion.cpp:2546 - VAE Tile size: 41x41
[DEBUG] src\ggml_extend.hpp:838  - num tiles : 2, 4
[DEBUG] src\ggml_extend.hpp:839  - optimal overlap : 0.536585, 0.487805 (targeting 0.500000)
[DEBUG] src\ggml_extend.hpp:870  - tile work buffer size: 40.94 MB
[DEBUG] src\ggml_extend.hpp:883  - processing 8 tiles
[INFO ] src\ggml_extend.hpp:1865 - wan_vae offload params (242.10 MB, 194 tensors) to runtime backend (ROCm0), taking 1.13s
[DEBUG] src\ggml_extend.hpp:1765 - wan_vae compute buffer size: 4177.32 MB(VRAM)
  |======>                                           | 1/8 - 4.65s/it
  [... tiles 2-7 ...]
  |==================================================| 8/8 - 3.00s/it
[DEBUG] stable-diffusion.cpp:2570 - computing vae encode graph completed, taking 25.67s
D:/a/stable-diffusion.cpp/stable-diffusion.cpp/src/stable-diffusion.cpp:2340: GGML_ASSERT(latent->ne[channel_dim] == 16 || latent->ne[channel_dim] == 48 || latent->ne[channel_dim] == 128) failed

All 8 tiles encode without error, but the reassembled output latent has a channel count that doesn't match any expected value (16, 48, or 128).

Why this matters

Without tiling, the VAE decode buffer scales linearly with frame count:

  • 33 frames → ~20 GB (fits 24GB VRAM)
  • 81 frames → ~49 GB (doesn't fit)

This means users with 24GB GPUs cannot generate videos longer than ~2 seconds at 480x832 using the full-quality VAE. The only workaround is using the tiny VAE decoder (--tae taew2_1.safetensors), which works but has reduced decode quality.

T2V is not affected since it has no VAE encode step.

Analysis

The Wan VAE is a 3D spatiotemporal autoencoder (as opposed to the 2D VAEs used by SD/SDXL/FLUX). The tiling logic may be assembling tiles using 2D assumptions that produce incorrect channel dimensions when applied to the 3D latent structure. The I2V pipeline triggers this because it VAE-encodes the input image, while T2V starts from noise and skips encoding.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions