Fish-Speech 1.5+: Multilingual Voice Cloning
Needed voice cloning that could handle Chinese and English seamlessly. Tried everything - Coqui, YourTTS, Mimic-3. Fish-Speech 1.5+ blew them all away. Cloned my voice from 30 seconds of audio and the language switching is practically indistinguishable from real.
Problem
Feeding it mixed Chinese-English text. Language detection was inconsistent -
Chinese text often read with English pronunciation, or vice versa. Made the output
unusable for bilingual content.
Detected language: 'en' (Confidence: 0.3) - actual text was Chinese
What I Tried
Attempt 1: Manually specified language tags in text - tedious and error-prone.
Attempt 2: Split text by language and processed separately - lost continuity.
Attempt 3: Adjusted language detection threshold - still inconsistent.
Actual Fix
Fish-Speech 1.5+ has an improved language detector, but it needs to be explicitly enabled for mixed content. The fix involves using the "auto" language mode AND providing context hints. Also, training with bilingual audio samples helps significantly.
# Mixed language configuration
from fish_speech import FishSpeechModel
model = FishSpeechModel.from_pretrained(
"fish-speech-1.5",
device="cuda"
)
# Configure mixed language mode
model.config.language_mode = "auto" # Not "zh" or "en"
model.config.language_detection = {
"enabled": True,
"min_confidence": 0.4, # Lower for better detection
"context_window": 50, # Characters to consider for context
}
# Provide language hints (optional but helps)
text = """
[EN] Hello, welcome to the show.
[ZH] 今天我们要讨论一个非常重要的话题。
[EN] Let's dive right in.
"""
# Or use automatic detection with hints
result = model.synthesize(
text="Hello 今天 is a beautiful day. Let's begin 现在开始.",
language_hints=["en", "zh"], # Suggest both languages
enable_code_switching=True, # Allow mid-sentence switching
preserve_accent=True, # Keep original accent
)
# For best results, fine-tune on bilingual audio
model.fine_tune(
audio_files=["bilingual_sample.wav"],
epochs=10,
learning_rate=1e-5
)
Problem
Recorded 30 seconds of my voice for cloning. Generated speech had the same pitch and rhythm, but the timbre was off - sounded like a different person. Side-by-side comparison was obvious.
What I Tried
Attempt 1: Recorded 3 minutes of audio instead - slightly better, still not great.
Attempt 2: Used studio-quality recording - minimal improvement.
Attempt 3: Adjusted timbre preservation parameter - made voice sound robotic.
Actual Fix
The reference audio quality matters, but more importantly, it needs diversity in phonemes and emotional range. 30 seconds of just reading a script doesn't capture the full voice characteristics. Fish-Speech 1.5+ has a "diversity sampling" feature that helps.
# Improved voice cloning
from fish_speech import VoiceCloner
cloner = VoiceCloner(
model_name="fish-speech-1.5",
diversity_sampling=True, # Key feature in 1.5+
)
# Record diverse audio samples
# Instead of 30 seconds continuous, do:
samples = [
"reading_script.wav", # Neutral speech
"conversational.wav", # Natural talking
"emotional_happy.wav", # Different emotions
"emotional_serious.wav",
"whispering.wav", # Different intensity
"loud_projection.wav",
]
# Clone with diversity
voice_model = cloner.clone(
audio_samples=samples,
config={
"preserve_timbre": 0.9, # High timbre preservation
"capture_prosody": True,
"emotion_range": "full",
"min_samples": 5, # Need variety
}
)
# Test clone
result = voice_model.synthesize(
text="This is a test of my cloned voice.",
reference_style="conversational" # Match recording style
)
What I Learned
- Lesson 1: Mixed language needs explicit auto mode - default language selection is poor.
- Lesson 2: Voice cloning needs diverse samples - 30 seconds of one style isn't enough.
- Lesson 3: Fish-Speech 1.5+ is the best multilingual TTS I've used - beats Coqui by far.
- Overall: For multilingual voice cloning, Fish-Speech is unmatched in 2026.