Video-Sys: Distributed Video Generation That Actually Scales
I needed to train a video generation model on 4K footage across 8 GPUs. Single-GPU training would take months, and my attempts at distributed training kept failing with NCCL timeouts and memory errors. Video-Sys looked promising, but getting it to run stable across multiple nodes took some work.
Problem
Training would run for 10-15 minutes, then crash with NCCL timeout errors. This happened across 2 nodes with 4 GPUs each. The error indicated communication failure between nodes.
Error: NCCL error: unhandled system error, connection reset by peer
What I Tried
Attempt 1: Increased NCCL timeout from 10 minutes to 30 minutes. This just made it wait longer before failing.
Attempt 2: Reduced batch size. This reduced throughput but didn't fix the timeouts.
Attempt 3: Used Ethernet instead of InfiniBand. This made timeouts happen even faster.
Actual Fix
The issue was NCCL's IB_transport settings combined with Video-Sys's gradient accumulation. I disabled IB transport for inter-node communication and used TCP with proper socket settings. Also switched from AllReduce to ReduceScatter for gradient synchronization.
# Set NCCL environment variables for stability
export NCCL_TIMEOUT=1800 # 30 minutes
export NCCL_IB_DISABLE=1 # Disable IB transport
export NCCL_SOCKET_IFNAME=eth0 # Use specific interface
export NCCL_DEBUG=INFO # Debug info
# Use reduce-scatter instead of all-reduce
export NCCL_ALGO=Tree # Tree algorithm for multi-node
export NCCL_PROTO=Simple # Simple protocol, more stable
# Configure Video-Sys with stable distributed settings
import torch
from video_sys import VideoSysConfig, Trainer
config = VideoSysConfig(
# Model settings
model_name="stable-video-diffusion",
resolution=(720, 1280),
num_frames=120,
# Distributed settings
distributed=True,
backend="nccl",
# Use reduce-scatter for better multi-node performance
gradient_reduction="reduce_scatter",
# Enable gradient checkpointing to save memory
gradient_checkpointing=True,
# Training settings
batch_size_per_gpu=2,
gradient_accumulation_steps=8,
num_workers=4,
# Mixed precision
mixed_precision="bf16",
loss_scale="dynamic"
)
trainer = Trainer(config)
# Launch with torchrun
# torchrun --nproc_per_node=4 --nnodes=2 train.py
Problem
When generating 120-frame videos at 1080p, GPU memory would overflow around frame 60. The model would crash with OOM errors even on 80GB A100 cards.
What I Tried
Attempt 1: Reduced resolution to 720p. This worked but quality was unacceptable.
Attempt 2: Enabled CPU offloading. This made generation 10x slower.
Attempt 3: Split generation into 3 segments. This created visible seams at segment boundaries.
Actual Fix
Used Video-Sys's chunked inference with temporal sliding windows. Generate in chunks of 24 frames with 8-frame overlap, then blend the overlaps. This keeps memory usage constant while maintaining temporal coherence.
# Chunked inference for long videos
from video_sys import VideoGenerator
generator = VideoGenerator.from_pretrained("your-model")
# Configure chunked inference
output = generator.generate(
prompt="A person walking in the city",
num_frames=120,
height=1080,
width=1920,
# Chunked inference settings
chunk_size=24, # Process 24 frames at a time
overlap=8, # 8-frame overlap for smooth blending
blend_mode="linear", # Smooth blending between chunks
# Memory optimization
use_tiled_vae=True, # Tiled VAE encoding/decoding
vae_tile_size=512,
enable_attention_slicing=True,
# Quality settings
guidance_scale=7.5,
num_inference_steps=50
)
# This generates in 5 chunks:
# Chunk 1: frames 0-23 (output 0-15, overlap frames 16-23 blended)
# Chunk 2: frames 16-39 (output 16-31, overlap 32-39 blended)
# ... and so on
Problem
With 8 GPUs, inference was only 3x faster than single GPU. Expected near-linear scaling, but got diminishing returns after 3 GPUs.
What I Tried
Attempt 1: Increased batch size. Hit memory limits.
Attempt 2: Used DeepSpeed ZeRO-3. This had too much overhead for inference.
Actual Fix
Switched to tensor parallelism for the attention layers instead of data parallelism. Each GPU processes a different spatial chunk of the same video frames. Used Flash Attention 2 for faster attention computation.
# Optimized multi-GPU inference
from video_sys import VideoGenerator
from video_sys.parallel import TensorParallelInference
# Use tensor parallelism instead of data parallelism
generator = VideoGenerator.from_pretrained(
"your-model",
torch_dtype=torch.float16
)
# Convert to tensor parallel
tp_generator = TensorParallelInference(
model=generator,
num_gpus=8,
# Split attention heads across GPUs
parallel_attention=True,
# Use Flash Attention 2
use_flash_attention_2=True,
# Overlap computation with communication
overlap_comm=True
)
# Now all 8 GPUs work on the same video
# Each GPU handles different spatial regions
output = tp_generator.generate(
prompt="A person walking",
num_frames=120,
height=1080,
width=1920,
batch_size=1 # One video across all GPUs
)
What I Learned
- NCCL needs specific tuning: Default settings work for single-node but fail multi-node. Disable IB transport and use TCP with Tree algorithm for stability.
- Reduce-scatter > AllReduce for video: Video models have large gradient tensors. Reduce-scatter is significantly more memory efficient.
- Chunked inference is essential for long videos: Can't fit 120+ frames in GPU memory. Chunking with overlap blending maintains quality while keeping memory usage constant.
- Tensor parallelism for inference: Data parallelism wastes GPU memory. Tensor parallelism (splitting attention heads) scales much better for video generation.
- Gradient checkpointing is worth the slowdown: Enables 2-3x larger batch sizes, which overall improves throughput despite 10% slower per-batch time.
Production Setup
Complete distributed training and inference setup for video generation.
# Install Video-Sys with distributed support
git clone https://github.com/mu-sz/Video-Sys.git
cd Video-Sys
pip install -e ".[distributed]"
# Install Flash Attention 2 for faster attention
pip install flash-attn --no-build-isolation
# Install DeepSpeed for ZeRO optimization
pip install deepspeed
Production training script:
#!/usr/bin/env python
"""
Distributed video generation training with Video-Sys.
Run with: torchrun --nproc_per_node=4 train_distributed.py
"""
import torch
from video_sys import VideoSysConfig, Trainer, VideoDataset
from video_sys.optimizers import AdamW8bit
def main():
# Configuration
config = VideoSysConfig(
# Model
model_name="stable-video-diffusion-xt",
resolution=(720, 1280),
num_frames=120,
# Distributed
distributed=True,
backend="nccl",
gradient_reduction="reduce_scatter",
find_unused_parameters=False,
# Memory optimization
gradient_checkpointing=True,
mixed_precision="bf16",
loss_scale="dynamic",
# Training
batch_size_per_gpu=2,
gradient_accumulation_steps=8,
num_workers=4,
pin_memory=True,
# Optimizer
optimizer="adamw_8bit",
learning_rate=1e-5,
weight_decay=0.01,
lr_scheduler="cosine",
warmup_steps=1000,
# Logging
log_every_n_steps=10,
save_every_n_steps=500,
checkpointing=True
)
# Create trainer
trainer = Trainer(config)
# Load dataset
train_dataset = VideoDataset(
video_dir="/path/to/videos",
resolution=(720, 1280),
num_frames=120,
augmentation=True
)
# Train
trainer.fit(
train_dataset=train_dataset,
val_dataset=None, # Or provide validation dataset
num_epochs=100,
resume_from_checkpoint=None # Or path to checkpoint
)
if __name__ == "__main__":
main()
Production inference script:
#!/usr/bin/env python
"""
Multi-GPU inference with Video-Sys.
Run with: torchrun --nproc_per_node=8 infer.py
"""
import torch
from video_sys import VideoGenerator
from video_sys.parallel import TensorParallelInference
def generate_video_batch(
prompts: list[str],
output_dir: str,
num_gpus: int = 8
):
"""
Generate multiple videos in parallel across GPUs.
"""
# Load model on each GPU
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)
generator = VideoGenerator.from_pretrained(
"your-finetuned-model",
torch_dtype=torch.float16,
device=f"cuda:{local_rank}"
)
# Convert to tensor parallel
tp_generator = TensorParallelInference(
model=generator,
num_gpus=num_gpus,
use_flash_attention_2=True
)
# Generate videos
for i, prompt in enumerate(prompts):
if i % num_gpus == local_rank:
output = tp_generator.generate(
prompt=prompt,
num_frames=120,
height=1080,
width=1920,
chunk_size=24,
overlap=8,
guidance_scale=7.5,
num_inference_steps=50
)
# Save
output_path = f"{output_dir}/video_{i}.mp4"
tp_generator.save_video(output, output_path)
print(f"GPU {local_rank}: Generated {output_path}")
if __name__ == "__main__":
prompts = [
"A person walking through a rainy city",
"Drone shot of mountains at sunset",
"Close-up of a flower blooming",
# ... more prompts
]
generate_video_batch(prompts, output_dir="./output")
Monitoring & Debugging
Essential metrics and debugging for distributed video generation.
Red Flags to Watch For
- NCCL timeout errors: Network issue. Check IB_transport settings and network connectivity.
- GPU memory > 90%: Will OOM soon. Reduce batch_size or enable gradient checkpointing.
- GPU utilization < 60%: CPU-bound or I/O-bound. Increase num_workers or use faster storage.
- Gradient norm > 10.0: Training is unstable. Reduce learning rate or enable gradient clipping.
- Loss not decreasing: Check learning rate, data quality, or model initialization.
Debug Commands
# Monitor GPU utilization
watch -n 1 nvidia-smi
# Check NCCL communication
NCCL_DEBUG=INFO python train.py 2>&1 | grep NCCL
# Profile training
torchrun --nproc_per_node=4 \
pytorch_profiler.py \
--output_dir ./profiler_logs
# Check distributed setup
python -m torch.distributed.run \
--nproc_per_node=4 \
test_distributed.py