Wan I2V expand_timesteps: why hard-clamp only first frame (no last-frame clamp)?

I have a question on `expand_timesteps` option within `WanImageToVideoPipeline`. I see that the mechanism hard-clamps only the first frame. For FLF (first+last) conditioning, is there a reason we can’t (or shouldn’t) apply the same mechanism to *both* endpoints?

The related code is in:
- `diffusers/src/diffusers/pipelines/wan/pipeline_wan_i2v.py`

In `prepare_latents`, when `expand_timesteps` is enabled, `video_condition` ignores `last_image` and `first_frame_mask` is defined only for frame 0:
```py
# prepare_latents
if self.config.expand_timesteps:
    video_condition = image

...
if self.config.expand_timesteps:
    first_frame_mask = torch.ones(1, 1, num_latent_frames, latent_height, latent_width, ...)
    first_frame_mask[:, :, 0] = 0
    return latents, latent_condition, first_frame_mask
```
In the denoising loop, the clamp/mix uses only first_frame_mask:

```py

if self.config.expand_timesteps:
    latent_model_input = (1 - first_frame_mask) * condition + first_frame_mask * latents
    temp_ts = (first_frame_mask[0][0][:, ::2, ::2] * t).flatten()
```

Question
Is it a limitation of the Wan2.2 I2V checkpoint training (only first-frame conditioning)?
Or a design choice because per-token timestep masking doesn’t extend cleanly to a last-frame constraint?

I see that the corresponding part was written by @yiyixuxu, thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wan I2V expand_timesteps: why hard-clamp only first frame (no last-frame clamp)? #13167

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Wan I2V expand_timesteps: why hard-clamp only first frame (no last-frame clamp)? #13167

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions