-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Description
I have a question on expand_timesteps option within WanImageToVideoPipeline. I see that the mechanism hard-clamps only the first frame. For FLF (first+last) conditioning, is there a reason we can’t (or shouldn’t) apply the same mechanism to both endpoints?
The related code is in:
diffusers/src/diffusers/pipelines/wan/pipeline_wan_i2v.py
In prepare_latents, when expand_timesteps is enabled, video_condition ignores last_image and first_frame_mask is defined only for frame 0:
# prepare_latents
if self.config.expand_timesteps:
video_condition = image
...
if self.config.expand_timesteps:
first_frame_mask = torch.ones(1, 1, num_latent_frames, latent_height, latent_width, ...)
first_frame_mask[:, :, 0] = 0
return latents, latent_condition, first_frame_maskIn the denoising loop, the clamp/mix uses only first_frame_mask:
if self.config.expand_timesteps:
latent_model_input = (1 - first_frame_mask) * condition + first_frame_mask * latents
temp_ts = (first_frame_mask[0][0][:, ::2, ::2] * t).flatten()Question
Is it a limitation of the Wan2.2 I2V checkpoint training (only first-frame conditioning)?
Or a design choice because per-token timestep masking doesn’t extend cleanly to a last-frame constraint?
I see that the corresponding part was written by @yiyixuxu, thanks!