MimicMotion: Video Motion Transfer That Doesn't Break Character Consistency

I needed to make static product photos dance for a social media campaign. The reference dance video looked great, but every time I tried to transfer the motion, the character's face would warp or disappear halfway through. Here's how I got MimicMotion to maintain identity throughout the video.

Problem

I was running motion transfer on a 120-frame dance video. The first 3 seconds looked perfect, but then the character's face started morphing. By frame 80, it was a completely different person. The pose transfer worked fine, but identity preservation failed.

Error: RuntimeWarning: Identity embedding strength dropped below 0.3 at frame 78

What I Tried

Attempt 1: Increased the identity loss weight from 1.0 to 5.0. This made the character rigid - the face stopped changing but so did the expressions, resulting in a dead-eyed stare throughout the video.
Attempt 2: Split the video into 4 shorter segments and ran them separately. This created visible jumps at the segment boundaries where the face would snap back to the original appearance.
Attempt 3: Used face landmarks to force consistency. This caused the model to fail on frames where the dancer turned their head (which was about 40% of the video).

Actual Fix

The solution was using temporal windowing with identity re-injection. Instead of processing the whole video at once, I used a sliding window approach where every 8 frames, the model re-encodes the identity from the source image. This keeps identity fresh without breaking temporal coherence.

# Fixed identity preservation with temporal windows
import torch
from mimicmotion.pipeline import MimicMotionPipeline

# Load pipeline
pipe = MimicMotionPipeline.from_pretrained("tencent/mimicmotion")

# Configure for long-form video with identity re-injection
pipe.scheduler.set_timesteps(50)
identity_injection_interval = 8  # Re-inject identity every 8 frames

# Process with sliding window
output = pipe(
    image="source_character.jpg",
    video="dance_reference.mp4",
    num_frames=120,
    guidance_scale=7.5,
    identity_weight=2.0,  # Moderate weight, not too high
    temporal_window=16,  # Process 16 frames at a time
    identity_reinjection=True,
    reinjection_interval=identity_injection_interval
).videos[0]

Problem

The generated video had visible jitter between frames. Arms and legs would slightly "vibrate" even though the source motion was smooth. This made the output look AI-generated and low quality.

What I Tried

Attempt 1: Applied post-processing video smoothing. This fixed some jitter but created motion blur that made fast movements look muddy.
Attempt 2: Increased inference steps from 20 to 50. This improved quality but made each video take 12 minutes to generate.
Attempt 3: Lowered the temperature parameter. This reduced jitter but also made the motion less dynamic - the dance looked stiff.

Actual Fix

The real issue was the temporal attention window being too small. By increasing it and adding a motion consistency loss, the jitter disappeared. I also used a two-pass approach: first generate at lower resolution, then refine with temporal smoothing.

# Two-pass generation with motion consistency
from mimicmotion.utils import motion_smooth_loss

# First pass: generate at half resolution
first_pass = pipe(
    image="source.jpg",
    video="reference.mp4",
    num_frames=120,
    height=360,  # Half resolution for speed
    width=640,
    motion_consistency=True,
    temporal_attention_window=24  # Increased from 16
).videos[0]

# Second pass: refine at full resolution
final_output = pipe.refine(
    video=first_pass,
    original_image="source.jpg",
    motion_smoothness=0.85,  # Add motion smoothing
    preserve_fine_details=True
)

Problem

During hand gestures, fingers would either merge together or disappear entirely. This was especially noticeable in dance moves where hands are raised and visible.

What I Tried

Attempt 1: Added hand-specific conditioning using MediaPipe landmarks. This caused the model to focus too much on hands and ignore the rest of the pose.
Attempt 2: Increased resolution to 1024x1024. Ran out of GPU memory on a 24GB VRAM card.

Actual Fix

Used the "hand refinement" checkpoint that MimicMotion includes specifically for this issue. It's a smaller model that runs only on detected hand regions.

# Hand-aware generation
output = pipe(
    image="source.jpg",
    video="reference.mp4",
    use_hand_refinement=True,
    hand_refinement_model="mimicmotion/hand-checkpoint",
    hand_detection_threshold=0.8,
    # This runs a second pass just on hand regions
    refine_hands=True
)

What I Learned

Production Setup

Complete configuration for generating consistent, high-quality dance videos in production.

# Install MimicMotion
git clone https://github.com/Tencent/MimicMotion.git
cd MimicMotion
pip install -e .

# Install additional dependencies
pip install mediapipe opencv-python torchvision
pip install accelerate transformers

# Download models
python download_models.py --all

Production inference script:

import torch
from mimicmotion import MimicMotionPipeline
from pathlib import Path

def generate_motion_transfer(
    source_image: str,
    reference_video: str,
    output_path: str,
    num_frames: int = 120,
    fps: int = 24
):
    """
    Production-ready motion transfer with all fixes applied.
    """
    # Load pipeline with optimizations
    pipe = MimicMotionPipeline.from_pretrained(
        "tencent/mimicmotion",
        torch_dtype=torch.float16,
        variant="fp16"
    ).to("cuda")

    # Enable memory optimizations
    pipe.enable_model_cpu_offload()
    pipe.enable_vae_slicing()

    # Generate with optimal settings
    output = pipe(
        image=source_image,
        video=reference_video,
        num_frames=num_frames,
        fps=fps,
        guidance_scale=7.5,
        num_inference_steps=30,
        height=720,
        width=1280,
        # Identity preservation
        identity_weight=2.0,
        identity_reinjection=True,
        reinjection_interval=8,
        # Temporal coherence
        temporal_window=24,
        motion_consistency=True,
        # Hand refinement
        use_hand_refinement=True,
        refine_hands=True
    ).videos[0]

    # Save output
    output_video = Path(output_path)
    pipe.save_video(output, str(output_video))
    return str(output_video)

# Usage
generate_motion_transfer(
    source_image="product_photo.jpg",
    reference_video="dance_reference.mp4",
    output_path="output_dance.mp4"
)

Monitoring & Debugging

Watch these metrics during generation to catch issues early.

Red Flags to Watch For

Debug Commands

# Check GPU utilization during generation
nvidia-smi -l 1

# Monitor identity preservation in real-time
python mimicmotion/utils/monitor_identity.py \
    --input output_dance.mp4 \
    --source product_photo.jpg

# Batch process with logging
python batch_generate.py \
    --input_dir ./images \
    --reference dance_ref.mp4 \
    --log_file generation.log \
    --save_intermediate

Related Resources