Skip to content

[modular] Add LTX Video modular pipeline#13378

Open
akshan-main wants to merge 21 commits intohuggingface:mainfrom
akshan-main:modular-ltx
Open

[modular] Add LTX Video modular pipeline#13378
akshan-main wants to merge 21 commits intohuggingface:mainfrom
akshan-main:modular-ltx

Conversation

@akshan-main
Copy link
Copy Markdown
Contributor

@akshan-main akshan-main commented Apr 1, 2026

What does this PR do?

Adds modular pipeline support for LTX Video, covering both text-to-video and image-to-video. The implementation follows the same structure as the existing Wan modular pipeline.

Text-to-video

LTXBlocks (SequentialPipelineBlocks)
  text_encoder      LTXTextEncoderStep
  denoise           LTXCoreDenoiseStep
    input               LTXTextInputStep
    set_timesteps       LTXSetTimestepsStep
    prepare_latents     LTXPrepareLatentsStep
    denoise             LTXDenoiseStep (LoopSequentialPipelineBlocks)
      before_denoiser       LTXLoopBeforeDenoiser
      denoiser              LTXLoopDenoiser
      after_denoiser        LTXLoopAfterDenoiser
  decode            LTXVaeDecoderStep

Image-to-video

LTXImage2VideoBlocks (SequentialPipelineBlocks)
  text_encoder      LTXTextEncoderStep
  denoise           LTXImage2VideoCoreDenoiseStep
    input               LTXTextInputStep
    set_timesteps       LTXSetTimestepsStep
    prepare_latents     LTXImage2VideoPrepareLatentsStep
    denoise             LTXImage2VideoDenoiseStep (LoopSequentialPipelineBlocks)
      before_denoiser       LTXImage2VideoLoopBeforeDenoiser
      denoiser              LTXImage2VideoLoopDenoiser
      after_denoiser        LTXImage2VideoLoopAfterDenoiser
  decode            LTXVaeDecoderStep

Verification

Parity tested against standard pipelines with identical parameters (H100, bfloat16, 297 frames, 30 steps, seed 42):

Standard shape Modular shape MAD
T2V (1, 297, 512, 704, 3) (1, 297, 512, 704, 3) 0.021609
I2V (1, 297, 512, 704, 3) (1, 297, 512, 704, 3) 0.016330

T2V - Standard vs Modular:

ltx_standard.mp4
ltx_modular.mp4
T2V reproduction code
import torch
import numpy as np
from diffusers import LTXPipeline, LTXBlocks
from diffusers.utils import export_to_video

model_id = "Lightricks/LTX-Video-0.9.7-dev"
prompt = "A cat walking across a sunlit garden"
height, width, num_frames = 512, 704, 297
steps, cfg, seed = 30, 3.0, 42

# Standard pipeline
std_pipe = LTXPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16).to("cuda")
gen = torch.Generator("cuda").manual_seed(seed)
std_result = std_pipe(
    prompt=prompt, height=height, width=width, num_frames=num_frames,
    num_inference_steps=steps, guidance_scale=cfg, generator=gen,
    output_type="np",
).frames
export_to_video(std_result[0], "ltx_standard.mp4", fps=25)

del std_pipe
torch.cuda.empty_cache()

# Modular pipeline
blocks = LTXBlocks()
mod_pipe = blocks.init_pipeline(model_id)
mod_pipe.load_components(torch_dtype=torch.bfloat16)
mod_pipe.to("cuda")
gen = torch.Generator("cuda").manual_seed(seed)
mod_result = mod_pipe(
    prompt=prompt, height=height, width=width, num_frames=num_frames,
    num_inference_steps=steps, guidance_scale=cfg, generator=gen,
    output="videos",
)
export_to_video(mod_result[0], "ltx_modular.mp4", fps=25)

diff = np.abs(np.array(std_result).astype(float) - np.array(mod_result).astype(float)).mean()
print(f"Mean absolute difference: {diff:.6f}")

I2V - Standard vs Modular:

ltx_i2v_standard.mp4
ltx_i2v_modular.mp4
I2V reproduction code
import torch
import numpy as np
from diffusers import LTXImageToVideoPipeline, LTXImage2VideoBlocks
from diffusers.utils import export_to_video, load_image

model_id = "Lightricks/LTX-Video-0.9.7-dev"
image = load_image("https://cdn.pixabay.com/photo/2014/11/30/14/11/cat-551554_640.jpg").resize((704, 512))
prompt = "A cat slowly turns its head"
height, width, num_frames = 512, 704, 297
steps, cfg, seed = 30, 3.0, 42

# Standard pipeline
std_pipe = LTXImageToVideoPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16).to("cuda")
gen = torch.Generator("cuda").manual_seed(seed)
std_result = std_pipe(
    image=image, prompt=prompt, height=height, width=width, num_frames=num_frames,
    num_inference_steps=steps, guidance_scale=cfg, generator=gen, output_type="np",
).frames
export_to_video(std_result[0], "ltx_i2v_standard.mp4", fps=25)

del std_pipe
torch.cuda.empty_cache()

# Modular pipeline
blocks = LTXImage2VideoBlocks()
pipe = blocks.init_pipeline(model_id)
pipe.load_components(torch_dtype=torch.bfloat16)
pipe.to("cuda")
gen = torch.Generator("cuda").manual_seed(seed)
mod_result = pipe(
    image=image, prompt=prompt, height=height, width=width, num_frames=num_frames,
    num_inference_steps=steps, guidance_scale=cfg, generator=gen, output="videos",
)
export_to_video(mod_result[0], "ltx_i2v_modular.mp4", fps=25)

diff = np.abs(np.array(std_result).astype(float) - np.array(mod_result).astype(float)).mean()
print(f"Mean absolute difference: {diff:.6f}")

Files added

src/diffusers/modular_pipelines/ltx/
  __init__.py
  encoders.py              LTXTextEncoderStep
  before_denoise.py        LTXTextInputStep, LTXSetTimestepsStep, LTXPrepareLatentsStep, LTXImage2VideoPrepareLatentsStep
  denoise.py               T2V and I2V denoise loop blocks
  decoders.py              LTXVaeDecoderStep
  modular_blocks_ltx.py    LTXBlocks, LTXImage2VideoBlocks
  modular_pipeline.py      LTXModularPipeline, LTXImage2VideoModularPipeline

tests/modular_pipelines/ltx/
  test_modular_pipeline_ltx.py

Files modified

  • src/diffusers/__init__.py
  • src/diffusers/modular_pipelines/__init__.py
  • src/diffusers/modular_pipelines/modular_pipeline.py

Note: tiny test model at akshan-main/tiny-ltx-modular-pipe on hf, will have to be moved to hf-internal-testing/ before merge if this is to be okayed.

Contribution to #13295

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline?
  • Did you read our philosophy doc (important for complex PRs)?
  • Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case. Modular Diffusers 🧨 #13295
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

@sayakpaul @yiyixuxu @asomoza

@akshan-main akshan-main marked this pull request as ready for review April 1, 2026 10:58
@yiyixuxu
Copy link
Copy Markdown
Collaborator

yiyixuxu commented Apr 1, 2026

cc @asomoza
can you help check if our current LTX (0.97) is broken? the output does not seem right, especailly the T2V one

@akshan-main
Copy link
Copy Markdown
Contributor Author

akshan-main commented Apr 1, 2026

Reran with the official example params Lightricks/LTX-Video instead of 0.97, and 480x704, 161 frames, 50 steps, negative prompt. Updated videos:

T2V standard:

ltx_t2v_standard.mp4

T2V modular:

ltx_t2v_modular.mp4
T2V code
import torch
import numpy as np
from diffusers import LTXPipeline, LTXBlocks
from diffusers.utils import export_to_video

model_id = "Lightricks/LTX-Video"
prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The scene appears to be real-life footage"
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
height, width, num_frames = 480, 704, 161
steps, cfg, seed = 50, 3.0, 42

print("=== Standard T2V ===")
std_pipe = LTXPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16).to("cuda")
gen = torch.Generator("cuda").manual_seed(seed)
std_result = std_pipe(
    prompt=prompt, negative_prompt=negative_prompt,
    height=height, width=width, num_frames=num_frames,
    num_inference_steps=steps, guidance_scale=cfg, generator=gen,
    output_type="np",
).frames
export_to_video(std_result[0], "/content/ltx_t2v_standard.mp4", fps=24)
print(f"Standard shape: {np.array(std_result).shape}")

del std_pipe
torch.cuda.empty_cache()

print("\n=== Modular T2V ===")
blocks = LTXBlocks()
mod_pipe = blocks.init_pipeline(model_id)
mod_pipe.load_components(torch_dtype=torch.bfloat16)
mod_pipe.to("cuda")
gen = torch.Generator("cuda").manual_seed(seed)
mod_result = mod_pipe(
    prompt=prompt, negative_prompt=negative_prompt,
    height=height, width=width, num_frames=num_frames,
    num_inference_steps=steps, guidance_scale=cfg, generator=gen,
    output="videos",
)
export_to_video(mod_result[0], "/content/ltx_t2v_modular.mp4", fps=24)
print(f"Modular shape: {np.array(mod_result).shape}")

diff = np.abs(np.array(std_result).astype(float) - np.array(mod_result).astype(float)).mean()
print(f"\nT2V MAD: {diff:.6f}")
print("T2V PARITY:", "PASS" if diff < 1.0 else "FAIL")

del mod_pipe, blocks
torch.cuda.empty_cache()

I2V standard:

ltx_i2v_standard.mp4

I2V modular:

ltx_i2v_modular.mp4
I2V code
from diffusers import LTXImageToVideoPipeline, LTXImage2VideoBlocks
from diffusers.utils import load_image

image = load_image("https://cdn.pixabay.com/photo/2014/11/30/14/11/cat-551554_640.jpg").resize((704, 480))
i2v_prompt = "A cat slowly turns its head and looks around"

print("=== Standard I2V ===")
std_pipe = LTXImageToVideoPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16).to("cuda")
gen = torch.Generator("cuda").manual_seed(seed)
std_result = std_pipe(
    image=image, prompt=i2v_prompt, negative_prompt=negative_prompt,
    height=height, width=width, num_frames=num_frames,
    num_inference_steps=steps, guidance_scale=cfg, generator=gen,
    output_type="np",
).frames
export_to_video(std_result[0], "/content/ltx_i2v_standard.mp4", fps=24)
print(f"Standard shape: {np.array(std_result).shape}")

del std_pipe
torch.cuda.empty_cache()

print("\n=== Modular I2V ===")
blocks = LTXImage2VideoBlocks()
pipe = blocks.init_pipeline(model_id)
pipe.load_components(torch_dtype=torch.bfloat16)
pipe.to("cuda")
gen = torch.Generator("cuda").manual_seed(seed)
mod_result = pipe(
    image=image, prompt=i2v_prompt, negative_prompt=negative_prompt,
    height=height, width=width, num_frames=num_frames,
    num_inference_steps=steps, guidance_scale=cfg, generator=gen,
    output="videos",
)
export_to_video(mod_result[0], "/content/ltx_i2v_modular.mp4", fps=24)
print(f"Modular shape: {np.array(mod_result).shape}")

diff = np.abs(np.array(std_result).astype(float) - np.array(mod_result).astype(float)).mean()
print(f"\nI2V MAD: {diff:.6f}")
print("I2V PARITY:", "PASS" if diff < 1.0 else "FAIL")

print("\n=== Done ===")
print("Videos saved: ltx_t2v_standard.mp4, ltx_t2v_modular.mp4, ltx_i2v_standard.mp4, ltx_i2v_modular.mp4")

Also verified that without CFG (guidance_scale=1.0), MAD drops to 0.008. The small visual difference with CFG enabled comes from the guider running cond/uncond as separate batches vs the standard pipeline's single concatenated batch. This is same behavior as the Wan modular pipeline.

No CFG code
import torch
import numpy as np
from diffusers import LTXPipeline, LTXBlocks

model_id = "Lightricks/LTX-Video"
prompt = "A woman with long brown hair smiles"
height, width, num_frames = 480, 704, 41
steps, seed = 20, 42

# Standard - no CFG
std_pipe = LTXPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16).to("cuda")
gen = torch.Generator("cuda").manual_seed(seed)
std_result = std_pipe(
    prompt=prompt, height=height, width=width, num_frames=num_frames,
    num_inference_steps=steps, guidance_scale=1.0, generator=gen,
    output_type="np",
).frames

del std_pipe; torch.cuda.empty_cache()

# Modular - no CFG
blocks = LTXBlocks()
pipe = blocks.init_pipeline(model_id)
pipe.load_components(torch_dtype=torch.bfloat16)
pipe.to("cuda")
gen = torch.Generator("cuda").manual_seed(seed)
mod_result = pipe(
    prompt=prompt, height=height, width=width, num_frames=num_frames,
    num_inference_steps=steps, guidance_scale=1.0, generator=gen,
    output="videos",
)

diff = np.abs(np.array(std_result).astype(float) - np.array(mod_result).astype(float)).mean()
print(f"No-CFG MAD: {diff:.6f}")

Copy link
Copy Markdown
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I left some feedbacks!

return False


class LTXImage2VideoModularPipeline(LTXModularPipeline):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we don't need a modular pipeline for I2V
for wan, we have T2V and I2V pipelines because they are different checkpoints/models
for LTX, it's same checkpoint support multiple workflow, no?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! Removed LTXImage2VideoModularPipeline. Both workflows use LTXModularPipeline now.

logger = logging.get_logger(__name__)


class LTXCoreDenoiseStep(SequentialPipelineBlocks):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add docstrings to assembled blocks using #auto_docstring? see this PR introduced it #12958
@claude are you there, can you help explaIn a bit more on #auto_docstring?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@yiyixuxu
Copy link
Copy Markdown
Collaborator

yiyixuxu commented Apr 1, 2026

@claude can you do a review here?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

Claude finished @yiyixuxu's task in 4m 1s —— View job


Code Review: LTX Video Modular Pipeline

  • Read PR diff and source files
  • Review correctness issues
  • Answer #auto_docstring question
  • Post final review

On #auto_docstring

@yiyixuxu, @akshan-main#auto_docstring is a doc-generation marker introduced in PR #12958. When placed before an assembled block class (like LTXBlocks or LTXImage2VideoBlocks), the documentation tooling introspects that class's inputs, intermediate_outputs, outputs, and description properties and automatically generates a comprehensive API docstring. This means developers don't have to manually maintain the docstring as the pipeline evolves — it stays in sync with the declared parameter specs automatically.

Usage looks like:

# #auto_docstring
class LTXBlocks(SequentialPipelineBlocks):
    ...

Correctness Issues

src/diffusers/__init__.py

[src/diffusers/__init__.py:1214] LTXImage2VideoBlocks and LTXImage2VideoModularPipeline are missing from the TYPE_CHECKING block — they appear in the lazy _import_structure but not in the static import branch, so type-checkers (mypy/pyright) will not resolve them. Fix this →

Also, the four LTX entries are inserted between HeliosPyramidDistilledModularPipeline and HeliosPyramidModularPipeline, splitting the Helios group. Since H < L alphabetically, all LTX entries should be placed after HeliosPyramidModularPipeline.


src/diffusers/modular_pipelines/ltx/before_denoise.py

[before_denoise.py:484] LTXImage2VideoPrepareLatentsStep.__call__ — when generator is a list, the loop iterates range(batch_size) where batch_size = block_state.batch_size * block_state.num_videos_per_prompt. But image only contains block_state.batch_size frames, so image[i] will raise IndexError when num_videos_per_prompt > 1. Fix: iterate over range(block_state.batch_size) and handle video repetition after encoding. Fix this →


src/diffusers/modular_pipelines/ltx/encoders.py

[encoders.py:1330] LTXTextEncoderStep.__call__ — the early return when prompt_embeds is pre-provided skips generating negative_prompt_embeds, even if CFG is active. If a user passes prompt_embeds without negative_prompt_embeds, LTXLoopDenoiser will fail because it declares negative_prompt_embeds as required=True. At minimum, the guard should also check negative_prompt_embeds when requires_unconditional_embeds is true. Fix this →

[encoders.py:1300] The type mismatch check type(prompt) is not type(negative_prompt) is dead code: by this point prompt has already been converted to a list (line 1283) and negative_prompt has also been converted to a list (line 1298), so they will always have the same type.


src/diffusers/modular_pipelines/ltx/denoise.py

[denoise.py:739,965] LTXLoopDenoiser and LTXImage2VideoLoopDenoiser both use a mutable dict as a default argument for guider_input_fields. This is a Python anti-pattern — if the dict is mutated by any caller, it affects all future calls using the default. Use None as the default and set the dict inside the function body instead. Fix this →

[denoise.py:709,960] LTXLoopDenoiser and LTXImage2VideoLoopDenoiser are ~90% identical — the only runtime difference is that I2V uses block_state.timestep_adjusted instead of t.expand(...) for the timestep argument. Consider refactoring into a base class with a _get_timestep(block_state, t) hook to avoid duplicated maintenance surface.


src/diffusers/modular_pipelines/ltx/decoders.py

[decoders.py:637] latents.to(block_state.dtype if hasattr(block_state, 'dtype') else torch.float32)dtype is set by LTXTextInputStep and should always be present by the time the decoder runs. The hasattr fallback silently hides a pipeline state management bug if dtype is somehow missing; better to let it fail clearly, or assert its presence.


src/diffusers/modular_pipelines/ltx/modular_blocks_ltx.py

[modular_blocks_ltx.py:32] Missing #auto_docstring on LTXBlocks and LTXImage2VideoBlocks as noted by @yiyixuxu.


Architecture Note

Echoing @yiyixuxu's comment: since LTX uses a single checkpoint for both T2V and I2V (unlike Wan where they're separate model families), LTXImage2VideoModularPipeline as a distinct pipeline class adds minimal value. The two workflows are differentiated entirely by the block set (LTXBlocks vs LTXImage2VideoBlocks) which is already the modular pattern's intended separation point. The ltx-i2v key in modular_pipeline.py's registry (and LTXImage2VideoModularPipeline class) could be dropped.

@akshan-main
Copy link
Copy Markdown
Contributor Author

@yiyixuxu Would you like me to add the condition and latent upsampler workflows to this PR as well, or keep those for a follow-up? Also happy to do LTX2 modular as a next PR if that's useful.

condition is same checkpoint, upsampler is different.

@yiyixuxu
Copy link
Copy Markdown
Collaborator

yiyixuxu commented Apr 2, 2026

@akshan-main let's merge this one first and follow up with conditionnal + upsample
LTX2.3 I'm working on it already:)

@akshan-main
Copy link
Copy Markdown
Contributor Author

Sounds good! I'll work on the conditional + upsampler follow-up PR next, then.

Meanwhile, I also opened a PR for HunyuanVideo 1.5 modular #13389. Both have MAD 0.000000 parity since the standard HunyuanVideo 1.5 pipeline already uses the guider abstraction, so there's no batching difference like with LTX's manual CFG concat. Would love your feedback on it when you get a chance!

@akshan-main
Copy link
Copy Markdown
Contributor Author

added docstrings to assembled blocks using #auto_docstring


Inputs:
num_videos_per_prompt (`None`, *optional*, defaults to 1):
TODO: Add description.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yiyixuxu
Copy link
Copy Markdown
Collaborator

yiyixuxu commented Apr 3, 2026

@claude are you here? can you look at th PR and explain what steps needs to be done for this to work

modular_ltx = ModularPipeline.from_pretrained("Lightricks/LTX-Video")
modular_ltx.load_components(torch_dtype=torch.bfloat16)

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 3, 2026

Claude finished @yiyixuxu's task in 2m 27s —— View job


What's needed for ModularPipeline.from_pretrained("Lightricks/LTX-Video")

  • Read PR code and modular pipeline registry
  • Understand current from_pretrained dispatch mechanism
  • Explain what's needed

Here's what ModularPipeline.from_pretrained(repo) actually does (from modular_pipeline.py:2062):

# Step 1: try to load blocks from the repo's auto_map
blocks = ModularPipelineBlocks.from_pretrained(repo, ...)  # needs auto_map in config.json
# Step 2: load config from modular_model_index.json
config_dict = cls.load_config(repo)                        # needs modular_model_index.json
pipeline_class = _get_pipeline_class(cls, config=config_dict)
# Step 3: construct
pipeline = pipeline_class(blocks=blocks, pretrained_model_name_or_path=repo)

Lightricks/LTX-Video is a standard model repo — it has a model_index.json but no modular_model_index.json and no auto_map for blocks. So both steps above fail gracefully and you end up with blocks=None and an empty ModularPipeline.

Three things need to be in place:

1. A hub repo with modular_model_index.json

This JSON tells the pipeline where to load each component from:

{
  "scheduler": ["diffusers", "FlowMatchEulerDiscreteScheduler", {"repo": "Lightricks/LTX-Video", "subfolder": "scheduler"}],
  "vae":        ["diffusers", "AutoencoderKLLTXVideo",          {"repo": "Lightricks/LTX-Video", "subfolder": "vae"}],
  "text_encoder": [...],
  "tokenizer":    [...],
  "transformer":  [...]
}

This can either live in the official Lightricks/LTX-Video repo, or in a separate modular pipeline repo (e.g. hf-internal-testing/ltx-modular-pipe) that points back to the weights. The test already uses the latter pattern (akshan-main/tiny-ltx-modular-pipe).

2. A config.json with auto_map pointing to LTXBlocks

{ "auto_map": { "ModularPipelineBlocks": "modular_blocks_ltx.LTXBlocks" } }

This lives in the same repo as the modular_model_index.json. Without it, blocks=None and no pipeline logic is wired up.

3. LTXModularPipelineLTXBlocks entry in MODULAR_PIPELINE_BLOCKS_MAPPING

In modular_pipeline.py:66, the current mapping is:

MODULAR_PIPELINE_BLOCKS_MAPPING = OrderedDict([
    ("StableDiffusionXLModularPipeline", "StableDiffusionXLAutoBlocks"),
])

This is used in ModularPipeline.__init__ (line 1873) as a fallback when blocks=None. Without ("LTXModularPipeline", "LTXBlocks") here, even LTXModularPipeline(pretrained_model_name_or_path=...) won't wire up the right blocks automatically. This entry is missing from the PR.


Summary — current PR state vs. what's needed:

Requirement Status
("ltx", _create_default_map_fn("LTXModularPipeline")) in MODULAR_PIPELINE_MAPPING ✅ Added by PR (modular_pipeline.py)
("LTXModularPipeline", "LTXBlocks") in MODULAR_PIPELINE_BLOCKS_MAPPING ❌ Missing
Hub repo with modular_model_index.json + auto_map ⚠️ Only in akshan-main/tiny-ltx-modular-pipe (needs to move to hf-internal-testing/)

So the usage would be:

# Option A — using a dedicated modular pipeline repo (current pattern, works today)
pipe = ModularPipeline.from_pretrained("hf-internal-testing/ltx-modular-pipe")
pipe.load_components(torch_dtype=torch.bfloat16)

# Option B — directly from the model repo (requires Lightricks/LTX-Video
#             to have modular_model_index.json + auto_map added)
pipe = ModularPipeline.from_pretrained("Lightricks/LTX-Video")
pipe.load_components(torch_dtype=torch.bfloat16)

The PR as written only supports the explicit LTXBlocks().init_pipeline(model_id) entry point. To support the cleaner ModularPipeline.from_pretrained("Lightricks/LTX-Video") pattern, you'd need MODULAR_PIPELINE_BLOCKS_MAPPING updated and either the official model repo or a sidecar modular-pipeline repo published with the right JSON files.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@yiyixuxu
Copy link
Copy Markdown
Collaborator

yiyixuxu commented Apr 3, 2026

@claude

I think as long as MODULAR_PIPELINE_MAPPING is updated the pattern would work no?

To support the cleaner ModularPipeline.from_pretrained("Lightricks/LTX-Video") pattern, you'd need MODULAR_PIPELINE_BLOCKS_MAPPING updated and either the official model repo or a sidecar modular-pipeline repo published with the right JSON files.

can you verify and if agree with me, maybe we should add an item to the skill file? https://github.com/huggingface/diffusers/blob/main/.ai/skills/model-integration/modular-conversion.md

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 3, 2026

Claude finished @yiyixuxu's task in 2m 47s —— View job


  • Read modular_pipeline.py to trace the from_pretrained flow
  • Read the skill file
  • Verify the claim and propose skill file update

Verification: is MODULAR_PIPELINE_BLOCKS_MAPPING sufficient?

Yes, @yiyixuxu is correct — my previous comment was wrong to list auto_map as a requirement. Here's why:

ModularPipeline.from_pretrained(repo) has two block-wiring paths:

Path Mechanism When used
A ModularPipelineBlocks.from_pretrained(repo) → reads auto_map from repo's config.json Only when the hub repo explicitly has auto_map
B MODULAR_PIPELINE_BLOCKS_MAPPING.get(pipeline_class.__name__) inside __init__ When blocks=None (path A returned nothing)

Path B is the MODULAR_PIPELINE_BLOCKS_MAPPING path. It activates when blocks=None, which is the normal case for a standard model repo like Lightricks/LTX-Video. So auto_map is not required.

The corrected requirements for ModularPipeline.from_pretrained("Lightricks/LTX-Video"):

Requirement Previous comment Corrected
("LTXModularPipeline", "LTXBlocks") in MODULAR_PIPELINE_BLOCKS_MAPPING ❌ Missing (noted) ❌ Missing — needs to be added
Hub repo with modular_model_index.json (has _class_name: "LTXModularPipeline" + component specs) ⚠️ Needed ⚠️ Still needed — so _get_pipeline_class can resolve LTXModularPipeline and load_components knows where to fetch weights
auto_map in hub repo's config.json ⚠️ Listed as needed Not required — Path B handles this

So the minimal change to make ModularPipeline.from_pretrained("Lightricks/LTX-Video") work:

  1. Add ("LTXModularPipeline", "LTXBlocks") to MODULAR_PIPELINE_BLOCKS_MAPPING in modular_pipeline.py (this PR is missing it)
  2. The hub repo (e.g. hf-internal-testing/ltx-modular-pipe) needs modular_model_index.json with correct _class_name and component loading specs — auto_map is optional

Proposed skill file addition

For .ai/skills/model-integration/modular-conversion.md, I'd suggest adding to the conversion checklist:

- [ ] Add `("ltx", "LTXModularPipeline")` to `MODULAR_PIPELINE_MAPPING` in `modular_pipeline.py`
      (enables `blocks.init_pipeline(model_id)` to create the right pipeline class)
- [ ] Add `("LTXModularPipeline", "LTXBlocks")` to `MODULAR_PIPELINE_BLOCKS_MAPPING` in `modular_pipeline.py`
      (enables `ModularPipeline.from_pretrained(repo)` without needing `auto_map` in the hub repo — 
       `auto_map` is only needed if you want `ModularPipelineBlocks.from_pretrained()` to work directly)
- [ ] Publish a sidecar modular repo (e.g. `hf-internal-testing/<model>-modular-pipe`) with 
      `modular_model_index.json` pointing to the model weights; `auto_map` is optional

Since the .ai/ directory doesn't exist on this branch (it lives on main only), and the skill file is outside src/diffusers/, I'll leave the actual edit to you or @akshan-main. But I'd be glad to implement the missing MODULAR_PIPELINE_BLOCKS_MAPPING entry in modular_pipeline.py if you want — just say the word.

@yiyixuxu
Copy link
Copy Markdown
Collaborator

yiyixuxu commented Apr 3, 2026

@akshan-main thanks, i will do a review now
just heads up we are testing our claude a a little bit so there will be come conversaion back and forth with them, hope you don't mind :)

@akshan-main
Copy link
Copy Markdown
Contributor Author

MODULAR_PIPELINE_BLOCKS_MAPPING doesn't exist in the codebase

@yiyixuxu
Copy link
Copy Markdown
Collaborator

yiyixuxu commented Apr 3, 2026

ohh right MODULAR_PIPELINE_MAPPING (you already added so don't worry about it)

Copy link
Copy Markdown
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!
I left some comments!

@claude, can you look through my comments and put together a summary on the proposed change in the skill file?

return [
InputParam.template("prompt"),
InputParam.template("negative_prompt"),
InputParam.template("prompt_embeds"),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need to list prompt embeds as input. We use this pattern in our standard pipelines to let user skip encoding etc, but in modular it is not needed, you can just pop out the text encoder block and run it separately.

raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(block_state.prompt)}")

@staticmethod
def _get_t5_prompt_embeds(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make this a regular function? so custom blocks can use it as well

block_state = self.get_block_state(state)

# Set guidance_scale on guider so CFG is configured correctly
guidance_scale = getattr(block_state, "guidance_scale", 3.0)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need to accept guidance_scale in modular pipeline,. User can configure the guider separately https://huggingface.co/docs/diffusers/modular_diffusers/guiders#changing-guider-parameters

as we support more guider types, each will have its own set of paramters and we won;t be able to forwarding all of them through the pipeline inputs.

@property
def intermediate_outputs(self) -> list[OutputParam]:
return [
OutputParam.template("latents"),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this we cannot use the template here, because it is not the "denoise latent" as defined in the output param template

import torch

from ...models import LTXVideoTransformer3DModel
from ...pipelines.ltx.pipeline_ltx import LTXPipeline
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not import the standard pipeline here
the modular and standard pipeline are meant to be parallel.

block_state.latents = randn_tensor(
shape, generator=block_state.generator, device=device, dtype=torch.float32
)
block_state.latents = LTXPipeline._pack_latents(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can redefine it as regular function here or maybe use #Copied from

see example using #Copied from https://github.com/huggingface/diffusers/blob/main/src/diffusers/modular_pipelines/wan/before_denoise.py#L495

if not isinstance(image, torch.Tensor):
from ...video_processor import VideoProcessor

processor = VideoProcessor(vae_scale_factor=components.vae_spatial_compression_ratio)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a components no?

else:
init_latents = [
retrieve_latents(
components.vae.encode(img.unsqueeze(0).unsqueeze(2).to(vae_dtype)), block_state.generator
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should extract the vae encoding into its own block in encoders.py (e.g. LTXVaeEncoderStep), and here this step should accept image_latents as input instead of raw image. This way users can run the VAE encoder standalone and pass pre-computed latents directly. See https://github.com/huggingface/diffusers/blob/main/src/diffusers/modular_pipelines/wan/encoders.py#L470


from ...configuration_utils import FrozenDict
from ...models import AutoencoderKLLTXVideo
from ...pipelines.ltx.pipeline_ltx import LTXPipeline
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here
let's either redefine or copy the pipeline methods you need


latents = block_state.latents

if block_state.output_type == "latent":
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need accept latent output_type in modular
similar to encode_prompt, we can pop out the decoder step from the pipeline if we don't need it decodded

@github-actions github-actions bot added the size/L PR with diff > 200 LOC label Apr 8, 2026
@akshan-main
Copy link
Copy Markdown
Contributor Author

@yiyixuxu addressed all in respective comments. Lmk if there are more things I need to put some work on!

@akshan-main
Copy link
Copy Markdown
Contributor Author

friendly ping @yiyixuxu. does the current LTX state look good? happy to transfer the applicable changes over to my modular HunyuanVideo 1.5 PR #13389 as well

@akshan-main
Copy link
Copy Markdown
Contributor Author

another friendly ping @yiyixuxu. Would be really glad to help the diffusers team ship this today!

Copy link
Copy Markdown
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, i left some small comments

return timesteps, num_inference_steps


def _pack_latents(latents: torch.Tensor, patch_size: int = 1, patch_size_t: int = 1) -> torch.Tensor:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not required, but consider make a pachifider so that you can use it different places

class QwenImagePachifier(ConfigMixin):

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will consider a pachifider as a follow-up pr

Comment on lines +127 to +134
def _normalize_latents(
latents: torch.Tensor, latents_mean: torch.Tensor, latents_std: torch.Tensor, scaling_factor: float = 1.0
) -> torch.Tensor:
# Normalize latents across the channel dimension [B, C, F, H, W]
latents_mean = latents_mean.view(1, -1, 1, 1, 1).to(latents.device, latents.dtype)
latents_std = latents_std.view(1, -1, 1, 1, 1).to(latents.device, latents.dtype)
latents = (latents - latents_mean) * scaling_factor / latents_std
return latents
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _normalize_latents(
latents: torch.Tensor, latents_mean: torch.Tensor, latents_std: torch.Tensor, scaling_factor: float = 1.0
) -> torch.Tensor:
# Normalize latents across the channel dimension [B, C, F, H, W]
latents_mean = latents_mean.view(1, -1, 1, 1, 1).to(latents.device, latents.dtype)
latents_std = latents_std.view(1, -1, 1, 1, 1).to(latents.device, latents.dtype)
latents = (latents - latents_mean) * scaling_factor / latents_std
return latents

i think it is not used here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Comment on lines +347 to +348
from ...configuration_utils import FrozenDict
from ...guiders import ClassifierFreeGuidance
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from ...configuration_utils import FrozenDict
from ...guiders import ClassifierFreeGuidance

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

> [!WARNING] > This is an experimental feature and is likely to change in the future.
"""

default_blocks_name = "LTXBlocks"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


mask_shape = (batch_size, 1, num_frames, height, width)

if block_state.latents is not None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so latents from user is the initial noise in T2V but here would noised image latents, no? it's confusing, no? might be an inconsistency we always had in our standard pipelines but worse here so let's fix it here: latents input should always be initial pure noise regardless of workflows

I think we can work with the LTXPrepareLatentsStep and just in this block focus on

  1. add noise into image_latents
  2. create conditioning_mask

so for I2V it would use both LTXPrepareLatentsStep -> LTXImage2VideoPrepareLatentsStep
would this work?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that works. implemented and verified (A100, LTX 0.9.1):

I2V latents now chains LTXPrepareLatentsStep (pure noise) -> LTXImage2VideoPrepareLatentsStep (mix with image_latents & conditioning mask)

MAD vs steps (256x256, 9 frames):

  • steps= 4 I2V modular: 0.009282 I2V auto: 0.009273
  • steps=10 I2V modular: 0.021779 I2V auto: 0.021794
  • steps=30 I2V modular: 0.020742 I2V auto: 0.020739

Full quality (480x704, 161 frames, 30 steps):

  • T2V MAD (LTXBlocks vs standard): 0.025949
  • I2V MAD (LTXImage2VideoBlocks vs standard): 0.046978
  • T2V MAD (LTXAutoBlocks vs standard): 0.025949
  • I2V MAD (LTXAutoBlocks vs standard): 0.046980

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@github-actions github-actions bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 9, 2026
@akshan-main akshan-main requested a review from yiyixuxu April 10, 2026 05:53
@github-actions github-actions bot added utils size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 10, 2026
@akshan-main akshan-main requested a review from yiyixuxu April 10, 2026 18:23
@github-actions github-actions bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 10, 2026
"""
Auto blocks for LTX Video that support both text-to-video and image-to-video workflows.

Supported workflows:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok so interesting, did you manually write this docstring?
missing a _workflow_map here but are documented here, what happends if you run?

blocks = LTXAutoBlocks()
blocks.available_workflows

see doc on workflow map https://huggingface.co/docs/diffusers/modular_diffusers/auto_pipeline_blocks#workflows

this is new doc on auto docstring https://huggingface.co/docs/diffusers/main/en/modular_diffusers/auto_docstring

you need to run after placed the marker

python utils/modular_auto_docstring.py --fix_and_overwrite

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup wrote it manually looking at the flux2 pattern but missed _workflow_map. Added it and ran the auto docstring tool now.

@github-actions github-actions bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 10, 2026
@github-actions github-actions bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 10, 2026
pipeline_class = LTXModularPipeline
pipeline_blocks_class = LTXAutoBlocks
pretrained_model_name_or_path = "akshan-main/tiny-ltx-modular-pipe"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

@akshan-main akshan-main Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add this since test is failing and also pachifier. Will ping you once I'm done. Meanwhile, you might want to take a look at pr #13440, a small thing I noticed

Copy link
Copy Markdown
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @akshan-main, it's a pleasure working with you:)
will merge once CI is green

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants