Describe the bug
While working on the modular pipeline for HunyuanVideo 1.5 (#13389), I found a bug in prepare_cond_latents_and_mask in pipeline_hunyuan_video1_5_image2video.py.
Line 614 shadows the pixel height/width parameters with latent dims from latents.shape:
def prepare_cond_latents_and_mask(self, latents, image, batch_size, height, width, dtype, device):
batch, channels, frames, height, width = latents.shape # overwrites pixel h/w with latent h/w
image_latents = self._get_image_latents(..., height=height, width=width)
_get_image_latents then calls image_processor.preprocess(image, height=height, width=width) with latent dims (e.g. 30x44 instead of 480x704). After snapping to the nearest vae_scale_factor (16) multiple, the image gets resized to ~16x32 pixels before VAE encoding, producing a ~1x2 latent instead of the expected 30x44.
The original Tencent implementation (HunyuanVideo-1.5) resizes at pixel resolution before encoding.
I will open a pr for this
Reproduction
from diffusers.pipelines.hunyuan_video1_5.pipeline_hunyuan_video1_5_image2video import HunyuanVideo15ImageToVideoPipeline
import inspect
# Line 614 of prepare_cond_latents_and_mask shadows pixel height/width with latent dims
source = inspect.getsource(HunyuanVideo15ImageToVideoPipeline.prepare_cond_latents_and_mask)
# "batch, channels, frames, height, width = latents.shape" overwrites the pixel h/w params
assert "batch, channels, frames, height, width = latents.shape" in source
print("height/width parameters are shadowed by latent dims from latents.shape")
Logs
System Info
diffusers 0.38.0.dev0, Python 3.12, PyTorch 2.6
Who can help?
@sayakpaul @DN6 @yiyixuxu
Describe the bug
While working on the modular pipeline for HunyuanVideo 1.5 (#13389), I found a bug in prepare_cond_latents_and_mask in pipeline_hunyuan_video1_5_image2video.py.
Line 614 shadows the pixel height/width parameters with latent dims from latents.shape:
_get_image_latents then calls image_processor.preprocess(image, height=height, width=width) with latent dims (e.g. 30x44 instead of 480x704). After snapping to the nearest vae_scale_factor (16) multiple, the image gets resized to ~16x32 pixels before VAE encoding, producing a ~1x2 latent instead of the expected 30x44.
The original Tencent implementation (HunyuanVideo-1.5) resizes at pixel resolution before encoding.
I will open a pr for this
Reproduction
Logs
System Info
diffusers 0.38.0.dev0, Python 3.12, PyTorch 2.6
Who can help?
@sayakpaul @DN6 @yiyixuxu