CogVideoX: Automated Short Video Production

Wanted to automate short video creation for social media. CogVideoX from Tsinghua looked promising - open-source, Chinese support, and decent quality. Spent 3 months building a production pipeline that generates 10+ videos per day.

Problem

Using CogVideoX to generate videos with Chinese text overlays. Text in the prompt rendered fine, but any text that appeared in the video itself was garbled boxes. This was a dealbreaker for Chinese social media content.

Text encoding error: unsupported characters in render pipeline

What I Tried

Attempt 1: Set environment variable LANG=zh_CN.UTF-8 - didn't fix video rendering.
Attempt 2: Used fontPath parameter with Chinese TTF - still garbled.
Attempt 3: Tried English text with Chinese audio - worked but not what I wanted.

Actual Fix

CogVideoX's text rendering pipeline doesn't support Chinese fonts by default. Need to manually configure the font path AND use a custom text renderer that handles UTF-8 properly. Also, the model was trained primarily on English, so Chinese prompts need special handling.

# Chinese text rendering setup
from cogvideo import CogVideoXPipeline
from PIL import ImageFont

# Initialize with Chinese font support
pipeline = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5b",
    torch_dtype=torch.float16
).cuda()

# Configure Chinese font
chinese_font_path = "/path/to/simhei.ttf"  # Or NotoSansCJK
pipeline.text_renderer.config.font_path = chinese_font_path
pipeline.text_renderer.config.fallback_fonts = [
    chinese_font_path,
    "/usr/share/fonts/truetype/noto/NotoSansCJK-Regular.ttc"
]

# Chinese prompt engineering
# Need to be more specific than English
prompt = """
一个年轻的中国女性站在上海的街头,自信地微笑着。
背景是繁华的南京路,霓虹灯闪烁。
视频风格:vlog, 4K, 电影质感。
"""

# Generate with proper encoding
import locale
locale.setlocale(locale.LC_ALL, 'zh_CN.UTF-8')

video = pipeline(
    prompt=prompt,
    num_frames=200,
    guidance_scale=7.5,
    num_inference_steps=50,
    # Enable Chinese text mode
    enable_chinese_mode=True,
    text_encoding='utf-8'
)

# Save with proper metadata
video.save("output.mp4", encoding='utf-8')

Problem

Generating 10 videos in sequence. Each video took 8-10 minutes on a 4090. For daily production, this meant 1.5 hours just for generation, not counting post-processing. GPU utilization was only 60-70% - clearly not optimized.

What I Tried

Attempt 1: Reduced inference steps to 30 - quality dropped significantly.
Attempt 2: Enabled torch.compile - didn't help with batch processing.
Attempt 3: Ran multiple instances - memory thrashing occurred.

Actual Fix

CogVideoX can process multiple prompts in parallel using the batch dimension, but this isn't enabled by default. The key is using the batch_generate method AND pre-loading all prompts into memory. Also, using vLLM for inference instead of raw Transformers gives 3x speedup.

# Optimized batch generation
from cogvideo import CogVideoXPipeline
from vllm import LLM, SamplingParams

# Setup batch prompts
prompts = [
    "Prompt 1 for video 1",
    "Prompt 2 for video 2",
    # ... up to GPU capacity
]

# Use vLLM for faster inference
llm = LLM(
    model="THUDM/CogVideoX-5b",
    tensor_parallel_size=2,  # Use 2 GPUs if available
    max_model_len=4096,
    gpu_memory_utilization=0.9
)

# Batch sampling params
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=4096,
    # Video-specific params
    num_frames=200,
    guidance_scale=7.5,
)

# Generate in parallel
outputs = llm.generate(prompts, sampling_params)

# Process outputs
for i, output in enumerate(outputs):
    video = output.video
    video.save(f"batch_output_{i}.mp4")

# Alternative: Pipeline batching
pipeline = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5b",
    enable_vllm=True,  # Enable vLLM backend
    batch_size=4  # Process 4 at a time
)

videos = pipeline.batch_generate(
    prompts=prompts,
    num_inference_steps=50,
    parallel=True  # Enable parallel processing
)

What I Learned

Related Resources