Skip to content

Conversation

@yaoqih
Copy link

@yaoqih yaoqih commented Nov 8, 2025

PR: Add LTXI2VLongMultiPromptPipeline (ComfyUI-parity long I2V with multi-prompt sliding windows)

What does this PR do?

  • Introduces a new pipeline LTXI2VLongMultiPromptPipeline providing long-duration image-to-video generation using temporal sliding windows with multi-prompt scheduling, first-frame hard conditioning, window-tail fusion, optional AdaIN normalization, and tiled VAE decoding.
  • Aligns behavior with ComfyUI looping sampler semantics while following Diffusers conventions (device/generator handling, progress bar, error handling, docstrings, and API).
  • Adds documentation and examples to the LTX-Video docs page to showcase long I2V workflows, seeding strategy, and multi-stage refinement.

Primary implementation and docs

  • New/updated implementation:
    • Class: LTXI2VLongMultiPromptPipeline
  • Docs:
    • Adds a “Long image-to-video, multi-prompt sliding windows (ComfyUI parity)” example: diffusers/docs/source/en/api/pipelines/ltx_video.md

Motivation and context

  • Long video generation is a common need for LTX-Video users. This pipeline provides a Diffusers-native implementation that mirrors ComfyUI behavior while maintaining Diffusers style and ergonomics.
  • The pipeline enables multi-prompt text scheduling across temporal windows, first-frame I2V hard conditioning via per-token mask, and smooth fusion across windows for consistent motion and content.

Key features and changes

  • Temporal sliding windows only (no spatial sharding in denoising); autoregressive fusion across windows.
  • Multi-prompt segmentation per window; transitions handled at the head of each window.
  • First-frame hard conditioning via per-token mask when cond_image is provided.
  • Reference and “negative index” latent injection at window head; optional AdaIN cross-window normalization for color/contrast consistency.
  • Per-window timesteps reset; ability to skip steps by sigma threshold for speed.

Usage example

  • Basic long I2V multi-prompt run:
    • See diffusers/docs/source/en/api/pipelines/ltx_video.mdfor a complete example including:
      • Multi-prompt schedule string split by “|”
      • cond_image for first-frame hard conditioning
      • Sliding window configuration (temporal_tile_size, temporal_overlap)
      • Returning latent-space video and tiled decoding via VAE
      • Optional spatial latent upsampling followed by short refinement with a compact sigma schedule

Breaking changes

  • None. New pipeline and docs are additive.

Docs updated

  • Long I2V multi-prompt example and notes: diffusers/docs/source/en/api/pipelines/ltx_video.md

Before submitting

  • This PR improves docs and adds a new pipeline.
  • Read contributor guidelines & docstring formatting guidelines.
  • Updated documentation with examples and autodoc entries.
  • Added or planned tests as suggested above (can be part of follow-up PR if needed).

Test Case

This test aims to verify the visual parity of the LTXI2VLongMultiPromptPipeline output against the ComfyUI LTXVideo plugin when configured with identical parameters.

1. Test Setup

  • Input Image:
    chimpanzee_l

  • Core Parameter Alignment:
    To ensure a fair comparison, the following key parameters were kept identical between the ComfyUI and Diffusers implementations. These are based on the ltxv-13b-i2v-long-multi-prompt.json workflow.

2. Diffusers Implementation Code

import torch
from diffusers import LTXI2VLongMultiPromptPipeline, LTXLatentUpsamplePipeline
from diffusers.pipelines.ltx.modeling_latent_upsampler import LTXLatentUpsamplerModel
from diffusers.utils import export_to_video
from PIL import Image

# Stage A: Long I2V with sliding windows and multi-prompt scheduling
pipe = LTXI2VLongMultiPromptPipeline.from_pretrained(
    "Lightricks/LTX-Video-0.9.8-13B-distilled",
    torch_dtype=torch.bfloat16
).to("cuda")

schedule = "a chimpanzee walks in the jungle |a chimpanzee stops and eats a snack |a chimpanzee lays on the ground"
cond_image = Image.open("chimpanzee_l.jpg").convert("RGB")

# -- Base long video generation --
latents = pipe(
    prompt=schedule,
    negative_prompt="worst quality, inconsistent motion, blurry, jittery, distorted",
    width=768,
    height=512,
    num_frames=361,
    temporal_tile_size=120,
    temporal_overlap=32,
    sigmas=[1.0000, 0.9937, 0.9875, 0.9812, 0.9750, 0.9094, 0.7250, 0.4219, 0.0],
    guidance_scale=1.0,
    cond_image=cond_image,
    adain_factor=0.25,
    output_type="latent",
).frames

# Decode with VAE tiling to save memory
video_pil_base = pipe.vae_decode_tiled(latents, decode_timestep=0.05, decode_noise_scale=0.025, output_type="pil")[0]
export_to_video(video_pil_base, "ltx_i2v_long_base_diffusers.mp4", fps=24)
print("Stage A: Base long video generated and saved.")

# Stage B (Optional): Spatial latent upsampling + short refinement
upsampler = LTXLatentUpsamplerModel.from_pretrained("linoyts/LTX-Video-spatial-upscaler-0.9.8/latent_upsampler", torch_dtype=torch.bfloat16)
pipe_upsample = LTXLatentUpsamplePipeline(vae=pipe.vae, latent_upsampler=upsampler).to(torch.bfloat16).to("cuda")

up_latents = pipe_upsample(
    latents=latents,
    adain_factor=1.0,
    tone_map_compression_ratio=0.6,
    output_type="latent"
).frames

# -- Load LoRA and perform refinement --
try:
    pipe.load_lora_weights(
        "Lightricks/LTX-Video-ICLoRA-detailer-13b-0.9.8/ltxv-098-ic-lora-detailer-diffusers.safetensors",
        adapter_name="ic-detailer",
    )
    pipe.fuse_lora(components=["transformer"], lora_scale=1.0)
    print("[Info] IC-LoRA detailer adapter loaded and fused.")
  
    # Short refinement pass (distilled; low steps)
    frames_refined = pipe(
        negative_prompt="worst quality, inconsistent motion, blurry, jittery, distorted",
        width=768,        # width/height match latents and are not scaled
        height=512,
        num_frames=up_latents.shape[2],
        temporal_tile_size=80,
        temporal_overlap=24,
        adain_factor=0.0, # Disable AdaIN in refinement
        latents=up_latents,
        guidance_latents=up_latents,
        sigmas=[0.99, 0.9094, 0.0], # Short sigma schedule
        output_type="pil",
    ).frames[0]

    export_to_video(frames_refined, "ltx_i2v_long_refined_diffusers.mp4", fps=24)
    print("Stage B: Refined video generated and saved.")

except Exception as e:
    print(f"[Warn] Failed to load IC-LoRA or run refinement: {e}. Skipping the second refinement sampling.")

3. Results Comparison

Stage A: Base Long Video Generation
ComfyUI Diffusers (This PR)
Original Video Link
ltxv-base_00001.1.mp4
Original Video Link
ltx_i2v_long_base.mp4
Stage B: Upsampling & Refinement
ComfyUI Diffusers (This PR)
Original Video Link
ltxv-ic-lora_00008_compressed.webm
Original Video Link
ltx_i2v_long_refined.mp4

4. Limitation

  • First-Frame Blurriness: The initial frame, despite being hard-conditioned on the input image, may exhibit minor blurring or a slight loss of sharpness. This is a characteristic observed in the current model behavior.
  • Minor Parity Differences: While this pipeline achieves strong parity in motion, content, and overall style, variations may still exist when compared to the ComfyUI reference output.

@yiyixuxu
Copy link
Collaborator

hi @yaoqih thanks for the PR, can we try turn off the tiled decoding, to see if the first image is still blurry?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants