Does `Cosmos-Predict2.5` pipeline generate video? or reconstruct input videos?

### Describe the bug

Hi,

I tested `Cosmos-Predict2.5` Video2World, but the generated video is quite different from that generated with NVIDIA's original implementation.

It seems that the pipeline reconstructs the input video rather than generates the next frames.

Can you check whether the pipeline functioned properly, or if there is a problem? (e.g., example code, implementation, etc.)

I attached the result files generated with the HuggingFace implementation and NVIDIA's original implementation.


(To generate a video with NVIDIA's implementation, I followed the example script in NVIDIA's official repository.)

### Reproduction

https://github.com/huggingface/diffusers/blob/3996788b602eaae4da41a1d45726b62e662b73cf/src/diffusers/pipelines/cosmos/pipeline_cosmos2_5_predict.py#L138-L160

| HuggingFace |NVIDIA|
|-------|--------------|
|https://github.com/user-attachments/assets/10c2a085-519f-46b5-957b-36d3b83955dd|https://github.com/user-attachments/assets/b9bb3f02-c5ad-467e-9dd5-8503507e5f9c|

### Logs

```shell

```

### System Info

- 🤗 Diffusers version: 0.37.0.dev0
- Platform: Linux-6.8.0-60-generic-x86_64-with-glibc2.35
- Running on Google Colab?: No
- Python version: 3.10.12
- PyTorch version (GPU?): 2.9.1+cu128 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.36.0
- Transformers version: 4.57.3
- Accelerate version: 1.12.0
- PEFT version: 0.18.0
- Bitsandbytes version: not installed
- Safetensors version: 0.7.0
- xFormers version: not installed
- Accelerator: NVIDIA A100 80GB PCIe, 81920 MiB
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no

### Who can help?

@yiyixuxu @DN6 @a-r-r-o-w 

	>>> # Video2World: condition on an input clip and predict a 93-frame world video.
	>>> prompt = (
	... "The video opens with an aerial view of a large-scale sand mining construction operation, showcasing extensive piles "
	... "of brown sand meticulously arranged in parallel rows. A central water channel, fed by a water pipe, flows through the "
	... "middle of these sand heaps, creating ripples and movement as it cascades down. The surrounding area features dense green "
	... "vegetation on the left, contrasting with the sandy terrain, while a body of water is visible in the background on the right. "
	... "As the video progresses, a piece of heavy machinery, likely a bulldozer, enters the frame from the right, moving slowly along "
	... "the edge of the sand piles. This machinery's presence indicates ongoing construction work in the operation. The final frame "
	... "captures the same scene, with the water continuing its flow and the bulldozer still in motion, maintaining the dynamic yet "
	... "steady pace of the construction activity."
	... )
	>>> input_video = load_video(
	... "https://github.com/nvidia-cosmos/cosmos-predict2.5/raw/refs/heads/main/assets/base/sand_mining.mp4"
	... )
	>>> video = pipe(
	... image=None,
	... video=input_video,
	... prompt=prompt,
	... negative_prompt=negative_prompt,
	... num_frames=93,
	... generator=torch.Generator().manual_seed(1),
	... ).frames[0]
	>>> export_to_video(video, "video2world.mp4", fps=16)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Does `Cosmos-Predict2.5` pipeline generate video? or reconstruct input videos? #12995

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Does Cosmos-Predict2.5 pipeline generate video? or reconstruct input videos? #12995

Description

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Does `Cosmos-Predict2.5` pipeline generate video? or reconstruct input videos? #12995