Skip to content

Wan I2V expand_timesteps: why hard-clamp only first frame (no last-frame clamp)? #13167

@emirks

Description

@emirks

I have a question on expand_timesteps option within WanImageToVideoPipeline. I see that the mechanism hard-clamps only the first frame. For FLF (first+last) conditioning, is there a reason we can’t (or shouldn’t) apply the same mechanism to both endpoints?

The related code is in:

  • diffusers/src/diffusers/pipelines/wan/pipeline_wan_i2v.py

In prepare_latents, when expand_timesteps is enabled, video_condition ignores last_image and first_frame_mask is defined only for frame 0:

# prepare_latents
if self.config.expand_timesteps:
    video_condition = image

...
if self.config.expand_timesteps:
    first_frame_mask = torch.ones(1, 1, num_latent_frames, latent_height, latent_width, ...)
    first_frame_mask[:, :, 0] = 0
    return latents, latent_condition, first_frame_mask

In the denoising loop, the clamp/mix uses only first_frame_mask:

if self.config.expand_timesteps:
    latent_model_input = (1 - first_frame_mask) * condition + first_frame_mask * latents
    temp_ts = (first_frame_mask[0][0][:, ::2, ::2] * t).flatten()

Question
Is it a limitation of the Wan2.2 I2V checkpoint training (only first-frame conditioning)?
Or a design choice because per-token timestep masking doesn’t extend cleanly to a last-frame constraint?

I see that the corresponding part was written by @yiyixuxu, thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions