Fish-Speech 1.5+: Multilingual Voice Cloning

Needed voice cloning that could handle Chinese and English seamlessly. Tried everything - Coqui, YourTTS, Mimic-3. Fish-Speech 1.5+ blew them all away. Cloned my voice from 30 seconds of audio and the language switching is practically indistinguishable from real.

Language Detection Fails on Mixed Content

Problem

Feeding it mixed Chinese-English text. Language detection was inconsistent - Chinese text often read with English pronunciation, or vice versa. Made the output unusable for bilingual content.

Detected language: 'en' (Confidence: 0.3) - actual text was Chinese

What I Tried

Attempt 1: Manually specified language tags in text - tedious and error-prone.
Attempt 2: Split text by language and processed separately - lost continuity.
Attempt 3: Adjusted language detection threshold - still inconsistent.

Actual Fix

Fish-Speech 1.5+ has an improved language detector, but it needs to be explicitly enabled for mixed content. The fix involves using the "auto" language mode AND providing context hints. Also, training with bilingual audio samples helps significantly.

# Mixed language configuration
from fish_speech import FishSpeechModel

model = FishSpeechModel.from_pretrained(
    "fish-speech-1.5",
    device="cuda"
)

# Configure mixed language mode
model.config.language_mode = "auto"  # Not "zh" or "en"
model.config.language_detection = {
    "enabled": True,
    "min_confidence": 0.4,  # Lower for better detection
    "context_window": 50,   # Characters to consider for context
}

# Provide language hints (optional but helps)
text = """
[EN] Hello, welcome to the show.
[ZH] 今天我们要讨论一个非常重要的话题。
[EN] Let's dive right in.
"""

# Or use automatic detection with hints
result = model.synthesize(
    text="Hello 今天 is a beautiful day. Let's begin 现在开始.",
    language_hints=["en", "zh"],  # Suggest both languages
    enable_code_switching=True,  # Allow mid-sentence switching
    preserve_accent=True,  # Keep original accent
)

# For best results, fine-tune on bilingual audio
model.fine_tune(
    audio_files=["bilingual_sample.wav"],
    epochs=10,
    learning_rate=1e-5
)

Cloned Voice Lacks Original Timbre

Problem

Recorded 30 seconds of my voice for cloning. Generated speech had the same pitch and rhythm, but the timbre was off - sounded like a different person. Side-by-side comparison was obvious.

What I Tried

Attempt 1: Recorded 3 minutes of audio instead - slightly better, still not great.
Attempt 2: Used studio-quality recording - minimal improvement.
Attempt 3: Adjusted timbre preservation parameter - made voice sound robotic.

Actual Fix

The reference audio quality matters, but more importantly, it needs diversity in phonemes and emotional range. 30 seconds of just reading a script doesn't capture the full voice characteristics. Fish-Speech 1.5+ has a "diversity sampling" feature that helps.

# Improved voice cloning
from fish_speech import VoiceCloner

cloner = VoiceCloner(
    model_name="fish-speech-1.5",
    diversity_sampling=True,  # Key feature in 1.5+
)

# Record diverse audio samples
# Instead of 30 seconds continuous, do:
samples = [
    "reading_script.wav",      # Neutral speech
    "conversational.wav",      # Natural talking
    "emotional_happy.wav",     # Different emotions
    "emotional_serious.wav",
    "whispering.wav",          # Different intensity
    "loud_projection.wav",
]

# Clone with diversity
voice_model = cloner.clone(
    audio_samples=samples,
    config={
        "preserve_timbre": 0.9,  # High timbre preservation
        "capture_prosody": True,
        "emotion_range": "full",
        "min_samples": 5,  # Need variety
    }
)

# Test clone
result = voice_model.synthesize(
    text="This is a test of my cloned voice.",
    reference_style="conversational"  # Match recording style
)

What I Learned

Lesson 1: Mixed language needs explicit auto mode - default language selection is poor.
Lesson 2: Voice cloning needs diverse samples - 30 seconds of one style isn't enough.
Lesson 3: Fish-Speech 1.5+ is the best multilingual TTS I've used - beats Coqui by far.
Overall: For multilingual voice cloning, Fish-Speech is unmatched in 2026.

Problem

What I Tried

Actual Fix

Problem

What I Tried

Actual Fix

What I Learned

Related Resources