MimicMotion: Video Motion Transfer That Doesn't Break Character Consistency
I needed to make static product photos dance for a social media campaign. The reference dance video looked great, but every time I tried to transfer the motion, the character's face would warp or disappear halfway through. Here's how I got MimicMotion to maintain identity throughout the video.
Problem
I was running motion transfer on a 120-frame dance video. The first 3 seconds looked perfect, but then the character's face started morphing. By frame 80, it was a completely different person. The pose transfer worked fine, but identity preservation failed.
Error: RuntimeWarning: Identity embedding strength dropped below 0.3 at frame 78
What I Tried
Attempt 1: Increased the identity loss weight from 1.0 to 5.0. This made the character rigid - the face stopped changing but so did the expressions, resulting in a dead-eyed stare throughout the video.
Attempt 2: Split the video into 4 shorter segments and ran them separately. This created visible jumps at the segment boundaries where the face would snap back to the original appearance.
Attempt 3: Used face landmarks to force consistency. This caused the model to fail on frames where the dancer turned their head (which was about 40% of the video).
Actual Fix
The solution was using temporal windowing with identity re-injection. Instead of processing the whole video at once, I used a sliding window approach where every 8 frames, the model re-encodes the identity from the source image. This keeps identity fresh without breaking temporal coherence.
# Fixed identity preservation with temporal windows
import torch
from mimicmotion.pipeline import MimicMotionPipeline
# Load pipeline
pipe = MimicMotionPipeline.from_pretrained("tencent/mimicmotion")
# Configure for long-form video with identity re-injection
pipe.scheduler.set_timesteps(50)
identity_injection_interval = 8 # Re-inject identity every 8 frames
# Process with sliding window
output = pipe(
image="source_character.jpg",
video="dance_reference.mp4",
num_frames=120,
guidance_scale=7.5,
identity_weight=2.0, # Moderate weight, not too high
temporal_window=16, # Process 16 frames at a time
identity_reinjection=True,
reinjection_interval=identity_injection_interval
).videos[0]
Problem
The generated video had visible jitter between frames. Arms and legs would slightly "vibrate" even though the source motion was smooth. This made the output look AI-generated and low quality.
What I Tried
Attempt 1: Applied post-processing video smoothing. This fixed some jitter but created motion blur that made fast movements look muddy.
Attempt 2: Increased inference steps from 20 to 50. This improved quality but made each video take 12 minutes to generate.
Attempt 3: Lowered the temperature parameter. This reduced jitter but also made the motion less dynamic - the dance looked stiff.
Actual Fix
The real issue was the temporal attention window being too small. By increasing it and adding a motion consistency loss, the jitter disappeared. I also used a two-pass approach: first generate at lower resolution, then refine with temporal smoothing.
# Two-pass generation with motion consistency
from mimicmotion.utils import motion_smooth_loss
# First pass: generate at half resolution
first_pass = pipe(
image="source.jpg",
video="reference.mp4",
num_frames=120,
height=360, # Half resolution for speed
width=640,
motion_consistency=True,
temporal_attention_window=24 # Increased from 16
).videos[0]
# Second pass: refine at full resolution
final_output = pipe.refine(
video=first_pass,
original_image="source.jpg",
motion_smoothness=0.85, # Add motion smoothing
preserve_fine_details=True
)
Problem
During hand gestures, fingers would either merge together or disappear entirely. This was especially noticeable in dance moves where hands are raised and visible.
What I Tried
Attempt 1: Added hand-specific conditioning using MediaPipe landmarks. This caused the model to focus too much on hands and ignore the rest of the pose.
Attempt 2: Increased resolution to 1024x1024. Ran out of GPU memory on a 24GB VRAM card.
Actual Fix
Used the "hand refinement" checkpoint that MimicMotion includes specifically for this issue. It's a smaller model that runs only on detected hand regions.
# Hand-aware generation
output = pipe(
image="source.jpg",
video="reference.mp4",
use_hand_refinement=True,
hand_refinement_model="mimicmotion/hand-checkpoint",
hand_detection_threshold=0.8,
# This runs a second pass just on hand regions
refine_hands=True
)
What I Learned
- Identity preservation needs balance: Too much weight makes the character rigid and expressionless. Too little and they morph into someone else. The sweet spot is 1.5-2.5 with temporal re-injection.
- Temporal coherence costs memory: Larger attention windows fix jitter but require more VRAM. For 120 frames at 720p, you need at least 18GB VRAM with a 24-frame window.
- Two-pass is worth it: Generating at half resolution first, then refining, is faster and produces better results than trying to do it all at once.
- Hands are hard: Use the dedicated hand refinement model rather than trying to solve it with general parameters.
- Frame rate matters: Generate at 24fps for dance videos. Higher frame rates (30+) increase jitter without visible quality improvement.
Production Setup
Complete configuration for generating consistent, high-quality dance videos in production.
# Install MimicMotion
git clone https://github.com/Tencent/MimicMotion.git
cd MimicMotion
pip install -e .
# Install additional dependencies
pip install mediapipe opencv-python torchvision
pip install accelerate transformers
# Download models
python download_models.py --all
Production inference script:
import torch
from mimicmotion import MimicMotionPipeline
from pathlib import Path
def generate_motion_transfer(
source_image: str,
reference_video: str,
output_path: str,
num_frames: int = 120,
fps: int = 24
):
"""
Production-ready motion transfer with all fixes applied.
"""
# Load pipeline with optimizations
pipe = MimicMotionPipeline.from_pretrained(
"tencent/mimicmotion",
torch_dtype=torch.float16,
variant="fp16"
).to("cuda")
# Enable memory optimizations
pipe.enable_model_cpu_offload()
pipe.enable_vae_slicing()
# Generate with optimal settings
output = pipe(
image=source_image,
video=reference_video,
num_frames=num_frames,
fps=fps,
guidance_scale=7.5,
num_inference_steps=30,
height=720,
width=1280,
# Identity preservation
identity_weight=2.0,
identity_reinjection=True,
reinjection_interval=8,
# Temporal coherence
temporal_window=24,
motion_consistency=True,
# Hand refinement
use_hand_refinement=True,
refine_hands=True
).videos[0]
# Save output
output_video = Path(output_path)
pipe.save_video(output, str(output_video))
return str(output_video)
# Usage
generate_motion_transfer(
source_image="product_photo.jpg",
reference_video="dance_reference.mp4",
output_path="output_dance.mp4"
)
Monitoring & Debugging
Watch these metrics during generation to catch issues early.
Red Flags to Watch For
- Identity confidence dropping below 0.4: Character will start morphing. Reduce temporal_window or increase identity_weight.
- Motion loss spikes: Indicates pose extraction failure. Check that reference video has clear pose visibility.
- VRAM usage exceeding 22GB on 24GB card: Will cause OOM errors. Reduce resolution or temporal_window.
- Generation time > 10 minutes per video: Not sustainable for batch processing. Consider using two-pass approach.
- Hand detection rate < 60%: Hands will appear poorly. Ensure hands are visible in source image.
Debug Commands
# Check GPU utilization during generation
nvidia-smi -l 1
# Monitor identity preservation in real-time
python mimicmotion/utils/monitor_identity.py \
--input output_dance.mp4 \
--source product_photo.jpg
# Batch process with logging
python batch_generate.py \
--input_dir ./images \
--reference dance_ref.mp4 \
--log_file generation.log \
--save_intermediate