Video-Sys: Distributed Video Generation That Actually Scales

I needed to train a video generation model on 4K footage across 8 GPUs. Single-GPU training would take months, and my attempts at distributed training kept failing with NCCL timeouts and memory errors. Video-Sys looked promising, but getting it to run stable across multiple nodes took some work.

NCCL Timeouts During Multi-Node Training

Problem

Training would run for 10-15 minutes, then crash with NCCL timeout errors. This happened across 2 nodes with 4 GPUs each. The error indicated communication failure between nodes.

Error: NCCL error: unhandled system error, connection reset by peer

What I Tried

Attempt 1: Increased NCCL timeout from 10 minutes to 30 minutes. This just made it wait longer before failing.
Attempt 2: Reduced batch size. This reduced throughput but didn't fix the timeouts.
Attempt 3: Used Ethernet instead of InfiniBand. This made timeouts happen even faster.

Actual Fix

The issue was NCCL's IB_transport settings combined with Video-Sys's gradient accumulation. I disabled IB transport for inter-node communication and used TCP with proper socket settings. Also switched from AllReduce to ReduceScatter for gradient synchronization.

# Set NCCL environment variables for stability
export NCCL_TIMEOUT=1800  # 30 minutes
export NCCL_IB_DISABLE=1  # Disable IB transport
export NCCL_SOCKET_IFNAME=eth0  # Use specific interface
export NCCL_DEBUG=INFO  # Debug info

# Use reduce-scatter instead of all-reduce
export NCCL_ALGO=Tree  # Tree algorithm for multi-node
export NCCL_PROTO=Simple  # Simple protocol, more stable

# Configure Video-Sys with stable distributed settings
import torch
from video_sys import VideoSysConfig, Trainer

config = VideoSysConfig(
    # Model settings
    model_name="stable-video-diffusion",
    resolution=(720, 1280),
    num_frames=120,

    # Distributed settings
    distributed=True,
    backend="nccl",
    # Use reduce-scatter for better multi-node performance
    gradient_reduction="reduce_scatter",
    # Enable gradient checkpointing to save memory
    gradient_checkpointing=True,

    # Training settings
    batch_size_per_gpu=2,
    gradient_accumulation_steps=8,
    num_workers=4,

    # Mixed precision
    mixed_precision="bf16",
    loss_scale="dynamic"
)

trainer = Trainer(config)
# Launch with torchrun
# torchrun --nproc_per_node=4 --nnodes=2 train.py

GPU Memory Overflow During Video Generation

Problem

When generating 120-frame videos at 1080p, GPU memory would overflow around frame 60. The model would crash with OOM errors even on 80GB A100 cards.

What I Tried

Attempt 1: Reduced resolution to 720p. This worked but quality was unacceptable.
Attempt 2: Enabled CPU offloading. This made generation 10x slower.
Attempt 3: Split generation into 3 segments. This created visible seams at segment boundaries.

Actual Fix

Used Video-Sys's chunked inference with temporal sliding windows. Generate in chunks of 24 frames with 8-frame overlap, then blend the overlaps. This keeps memory usage constant while maintaining temporal coherence.

# Chunked inference for long videos
from video_sys import VideoGenerator

generator = VideoGenerator.from_pretrained("your-model")

# Configure chunked inference
output = generator.generate(
    prompt="A person walking in the city",
    num_frames=120,
    height=1080,
    width=1920,
    # Chunked inference settings
    chunk_size=24,  # Process 24 frames at a time
    overlap=8,  # 8-frame overlap for smooth blending
    blend_mode="linear",  # Smooth blending between chunks
    # Memory optimization
    use_tiled_vae=True,  # Tiled VAE encoding/decoding
    vae_tile_size=512,
    enable_attention_slicing=True,
    # Quality settings
    guidance_scale=7.5,
    num_inference_steps=50
)

# This generates in 5 chunks:
# Chunk 1: frames 0-23 (output 0-15, overlap frames 16-23 blended)
# Chunk 2: frames 16-39 (output 16-31, overlap 32-39 blended)
# ... and so on

Slow Inference Speed Despite Multiple GPUs

Problem

With 8 GPUs, inference was only 3x faster than single GPU. Expected near-linear scaling, but got diminishing returns after 3 GPUs.

What I Tried

Attempt 1: Increased batch size. Hit memory limits.
Attempt 2: Used DeepSpeed ZeRO-3. This had too much overhead for inference.

Actual Fix

Switched to tensor parallelism for the attention layers instead of data parallelism. Each GPU processes a different spatial chunk of the same video frames. Used Flash Attention 2 for faster attention computation.

# Optimized multi-GPU inference
from video_sys import VideoGenerator
from video_sys.parallel import TensorParallelInference

# Use tensor parallelism instead of data parallelism
generator = VideoGenerator.from_pretrained(
    "your-model",
    torch_dtype=torch.float16
)

# Convert to tensor parallel
tp_generator = TensorParallelInference(
    model=generator,
    num_gpus=8,
    # Split attention heads across GPUs
    parallel_attention=True,
    # Use Flash Attention 2
    use_flash_attention_2=True,
    # Overlap computation with communication
    overlap_comm=True
)

# Now all 8 GPUs work on the same video
# Each GPU handles different spatial regions
output = tp_generator.generate(
    prompt="A person walking",
    num_frames=120,
    height=1080,
    width=1920,
    batch_size=1  # One video across all GPUs
)

What I Learned

NCCL needs specific tuning: Default settings work for single-node but fail multi-node. Disable IB transport and use TCP with Tree algorithm for stability.
Reduce-scatter > AllReduce for video: Video models have large gradient tensors. Reduce-scatter is significantly more memory efficient.
Chunked inference is essential for long videos: Can't fit 120+ frames in GPU memory. Chunking with overlap blending maintains quality while keeping memory usage constant.
Tensor parallelism for inference: Data parallelism wastes GPU memory. Tensor parallelism (splitting attention heads) scales much better for video generation.
Gradient checkpointing is worth the slowdown: Enables 2-3x larger batch sizes, which overall improves throughput despite 10% slower per-batch time.

Production Setup

Complete distributed training and inference setup for video generation.

# Install Video-Sys with distributed support
git clone https://github.com/mu-sz/Video-Sys.git
cd Video-Sys
pip install -e ".[distributed]"

# Install Flash Attention 2 for faster attention
pip install flash-attn --no-build-isolation

# Install DeepSpeed for ZeRO optimization
pip install deepspeed

Production training script:

#!/usr/bin/env python
"""
Distributed video generation training with Video-Sys.
Run with: torchrun --nproc_per_node=4 train_distributed.py
"""
import torch
from video_sys import VideoSysConfig, Trainer, VideoDataset
from video_sys.optimizers import AdamW8bit

def main():
    # Configuration
    config = VideoSysConfig(
        # Model
        model_name="stable-video-diffusion-xt",
        resolution=(720, 1280),
        num_frames=120,

        # Distributed
        distributed=True,
        backend="nccl",
        gradient_reduction="reduce_scatter",
        find_unused_parameters=False,

        # Memory optimization
        gradient_checkpointing=True,
        mixed_precision="bf16",
        loss_scale="dynamic",

        # Training
        batch_size_per_gpu=2,
        gradient_accumulation_steps=8,
        num_workers=4,
        pin_memory=True,

        # Optimizer
        optimizer="adamw_8bit",
        learning_rate=1e-5,
        weight_decay=0.01,
        lr_scheduler="cosine",
        warmup_steps=1000,

        # Logging
        log_every_n_steps=10,
        save_every_n_steps=500,
        checkpointing=True
    )

    # Create trainer
    trainer = Trainer(config)

    # Load dataset
    train_dataset = VideoDataset(
        video_dir="/path/to/videos",
        resolution=(720, 1280),
        num_frames=120,
        augmentation=True
    )

    # Train
    trainer.fit(
        train_dataset=train_dataset,
        val_dataset=None,  # Or provide validation dataset
        num_epochs=100,
        resume_from_checkpoint=None  # Or path to checkpoint
    )

if __name__ == "__main__":
    main()

Production inference script:

#!/usr/bin/env python
"""
Multi-GPU inference with Video-Sys.
Run with: torchrun --nproc_per_node=8 infer.py
"""
import torch
from video_sys import VideoGenerator
from video_sys.parallel import TensorParallelInference

def generate_video_batch(
    prompts: list[str],
    output_dir: str,
    num_gpus: int = 8
):
    """
    Generate multiple videos in parallel across GPUs.
    """
    # Load model on each GPU
    local_rank = int(os.environ["LOCAL_RANK"])
    torch.cuda.set_device(local_rank)

    generator = VideoGenerator.from_pretrained(
        "your-finetuned-model",
        torch_dtype=torch.float16,
        device=f"cuda:{local_rank}"
    )

    # Convert to tensor parallel
    tp_generator = TensorParallelInference(
        model=generator,
        num_gpus=num_gpus,
        use_flash_attention_2=True
    )

    # Generate videos
    for i, prompt in enumerate(prompts):
        if i % num_gpus == local_rank:
            output = tp_generator.generate(
                prompt=prompt,
                num_frames=120,
                height=1080,
                width=1920,
                chunk_size=24,
                overlap=8,
                guidance_scale=7.5,
                num_inference_steps=50
            )

            # Save
            output_path = f"{output_dir}/video_{i}.mp4"
            tp_generator.save_video(output, output_path)
            print(f"GPU {local_rank}: Generated {output_path}")

if __name__ == "__main__":
    prompts = [
        "A person walking through a rainy city",
        "Drone shot of mountains at sunset",
        "Close-up of a flower blooming",
        # ... more prompts
    ]

    generate_video_batch(prompts, output_dir="./output")

Monitoring & Debugging

Essential metrics and debugging for distributed video generation.

Red Flags to Watch For

NCCL timeout errors: Network issue. Check IB_transport settings and network connectivity.
GPU memory > 90%: Will OOM soon. Reduce batch_size or enable gradient checkpointing.
GPU utilization < 60%: CPU-bound or I/O-bound. Increase num_workers or use faster storage.
Gradient norm > 10.0: Training is unstable. Reduce learning rate or enable gradient clipping.
Loss not decreasing: Check learning rate, data quality, or model initialization.

Debug Commands

# Monitor GPU utilization
watch -n 1 nvidia-smi

# Check NCCL communication
NCCL_DEBUG=INFO python train.py 2>&1 | grep NCCL

# Profile training
torchrun --nproc_per_node=4 \
    pytorch_profiler.py \
    --output_dir ./profiler_logs

# Check distributed setup
python -m torch.distributed.run \
    --nproc_per_node=4 \
    test_distributed.py

Problem

What I Tried

Actual Fix

Problem

What I Tried

Actual Fix

Problem

What I Tried

Actual Fix

What I Learned

Production Setup

Monitoring & Debugging

Red Flags to Watch For

Debug Commands

Related Resources