HunyuanVideo 1.5 I2V image conditioning preprocessed at latent resolution instead of pixel resolution

### Describe the bug

While working on the modular pipeline for HunyuanVideo 1.5 (#13389), I found a bug in prepare_cond_latents_and_mask in pipeline_hunyuan_video1_5_image2video.py.

Line 614 shadows the pixel height/width parameters with latent dims from latents.shape:

```python
def prepare_cond_latents_and_mask(self, latents, image, batch_size, height, width, dtype, device):
    batch, channels, frames, height, width = latents.shape  # overwrites pixel h/w with latent h/w
    image_latents = self._get_image_latents(..., height=height, width=width)
```

_get_image_latents then calls image_processor.preprocess(image, height=height, width=width) with latent dims (e.g. 30x44 instead of 480x704). After snapping to the nearest vae_scale_factor (16) multiple, the image gets resized to ~16x32 pixels before VAE encoding, producing a ~1x2 latent instead of the expected 30x44.

The original Tencent implementation ([HunyuanVideo-1.5](https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5)) resizes at pixel resolution before encoding.

I will open a pr for this

### Reproduction

```python
from diffusers.pipelines.hunyuan_video1_5.pipeline_hunyuan_video1_5_image2video import HunyuanVideo15ImageToVideoPipeline
import inspect

# Line 614 of prepare_cond_latents_and_mask shadows pixel height/width with latent dims
source = inspect.getsource(HunyuanVideo15ImageToVideoPipeline.prepare_cond_latents_and_mask)
# "batch, channels, frames, height, width = latents.shape" overwrites the pixel h/w params
assert "batch, channels, frames, height, width = latents.shape" in source
print("height/width parameters are shadowed by latent dims from latents.shape")
```

### Logs

```shell

```

### System Info

diffusers 0.38.0.dev0, Python 3.12, PyTorch 2.6



### Who can help?

@sayakpaul @DN6 @yiyixuxu 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HunyuanVideo 1.5 I2V image conditioning preprocessed at latent resolution instead of pixel resolution #13439

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

HunyuanVideo 1.5 I2V image conditioning preprocessed at latent resolution instead of pixel resolution #13439

Description

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions