Skip to content

Does Cosmos-Predict2.5 pipeline generate video? or reconstruct input videos? #12995

@rebel-dkhong

Description

@rebel-dkhong

Describe the bug

Hi,

I tested Cosmos-Predict2.5 Video2World, but the generated video is quite different from that generated with NVIDIA's original implementation.

It seems that the pipeline reconstructs the input video rather than generates the next frames.

Can you check whether the pipeline functioned properly, or if there is a problem? (e.g., example code, implementation, etc.)

I attached the result files generated with the HuggingFace implementation and NVIDIA's original implementation.

(To generate a video with NVIDIA's implementation, I followed the example script in NVIDIA's official repository.)

Reproduction

>>> # Video2World: condition on an input clip and predict a 93-frame world video.
>>> prompt = (
... "The video opens with an aerial view of a large-scale sand mining construction operation, showcasing extensive piles "
... "of brown sand meticulously arranged in parallel rows. A central water channel, fed by a water pipe, flows through the "
... "middle of these sand heaps, creating ripples and movement as it cascades down. The surrounding area features dense green "
... "vegetation on the left, contrasting with the sandy terrain, while a body of water is visible in the background on the right. "
... "As the video progresses, a piece of heavy machinery, likely a bulldozer, enters the frame from the right, moving slowly along "
... "the edge of the sand piles. This machinery's presence indicates ongoing construction work in the operation. The final frame "
... "captures the same scene, with the water continuing its flow and the bulldozer still in motion, maintaining the dynamic yet "
... "steady pace of the construction activity."
... )
>>> input_video = load_video(
... "https://github.com/nvidia-cosmos/cosmos-predict2.5/raw/refs/heads/main/assets/base/sand_mining.mp4"
... )
>>> video = pipe(
... image=None,
... video=input_video,
... prompt=prompt,
... negative_prompt=negative_prompt,
... num_frames=93,
... generator=torch.Generator().manual_seed(1),
... ).frames[0]
>>> export_to_video(video, "video2world.mp4", fps=16)

HuggingFace NVIDIA
https://github.com/user-attachments/assets/10c2a085-519f-46b5-957b-36d3b83955dd https://github.com/user-attachments/assets/b9bb3f02-c5ad-467e-9dd5-8503507e5f9c

Logs

System Info

  • 🤗 Diffusers version: 0.37.0.dev0
  • Platform: Linux-6.8.0-60-generic-x86_64-with-glibc2.35
  • Running on Google Colab?: No
  • Python version: 3.10.12
  • PyTorch version (GPU?): 2.9.1+cu128 (True)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Huggingface_hub version: 0.36.0
  • Transformers version: 4.57.3
  • Accelerate version: 1.12.0
  • PEFT version: 0.18.0
  • Bitsandbytes version: not installed
  • Safetensors version: 0.7.0
  • xFormers version: not installed
  • Accelerator: NVIDIA A100 80GB PCIe, 81920 MiB
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: no

Who can help?

@yiyixuxu @DN6 @a-r-r-o-w

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions