GPT-SoVITS: Clone Any Voice with 5 Seconds of Audio

March 2024 · AI Voice

Started making YouTube videos last month. Needed voiceovers but didn't want to record every time. Looked into hiring voice actors - $50 to $200 per video. That's not sustainable for a small channel.

Tried free TTS tools - sounded robotic. Paid services like ElevenLabs were better but $22/month adds up. Then someone mentioned GPT-SoVITS in a forum. Open-source, free, and the quality is insane.

The key insight

You only need 5 seconds of someone's voice. Extract audio from any YouTube video, train the model, and make that person say anything you type. Quality rivals $500 commercial tools.

Problem is the setup is brutal. Official docs are in Chinese, dependencies everywhere, 10GB+ downloads, confusing environment variables. Spent two days getting it working.

So here's what I wish I had: simple setup process with two paths - one-click Windows installer, or cloud option if you don't have a GPU.

What you need

Hardware requirements:

  • • NVIDIA GPU with 6GB+ VRAM (GTX 1060 or better)
  • • 16GB RAM recommended (8GB minimum)
  • • 15GB free storage space
  • • Windows 10/11 (Linux works but more complex)

No GPU? Use the cloud option below instead.

Option A: Windows portable version (recommended)

This is what worked for me. Someone made a portable version with everything pre-configured:

Step 1: Download

Search Google for "GPT-SoVITS portable version" or check community releases. Look for a 7z file around 8GB named something like "GPT-SoVITS_vX.X.X.7z"

Size: ~8GB compressed
Contains: Everything pre-configured

Step 2: Extract and run

Extract to a folder with short path (no special characters), then double-click "go-web.bat"

Extract → Open folder → Double-click go-web.bat
Browser opens at: http://localhost:9872

Step 3: Prepare audio sample

Download a YouTube video and extract 5-10 seconds of clear speech:

# Use yt-dlp to get audio only
yt-dlp -x "https://www.youtube.com/watch?v=VIDEO_ID" -f bestaudio

# Or use free online audio trimmer
# Export as WAV: 16kHz, 16-bit, mono

Step 4: Train model

In the web interface (界面选择 language if needed):

1. 点击 "语音训练" (Voice Training)
2. Upload your WAV file
3. Set model name in English
4. Click "开始训练" (Start Training)
5. Wait 10-30 minutes depending on GPU

Step 5: Generate speech

Go to "语音推理" (Voice Inference) section:

1. Select your trained model from dropdown
2. Type or paste your text
3. Click "生成音频" (Generate Audio)
4. Download the result

Option B: Cloud deployment (no GPU)

Don't have a powerful GPU? Run it on Google Colab's free tier:

Google Colab method

Search for "GPT-SoVITS Colab" - several community notebooks exist:

1. Open the Colab notebook
2. Click "Copy to Drive"
3. Change runtime to GPU (Runtime → Change runtime type → T4)
4. Run all cells in order
5. Upload audio when prompted
6. Generate text directly in browser

Free tier gives you T4 GPU for a few hours per day. Good enough for occasional use.

Hugging Face Spaces

Community-hosted versions:

1. Visit huggingface.co/spaces
2. Search: "GPT-SoVITS"
3. Choose a space with GPU (might have queue)
4. Upload audio → Type text → Download

Slower than local GPU but no setup required.

Getting better results

Audio sample quality

  • • Clear speech, no background noise
  • • 5-10 seconds is enough (more doesn't help)
  • • Natural tone, not exaggerated
  • • One person speaking only

Text formatting

Use punctuation for pauses:
"Hello. This is a test."

Ellipsis for longer pauses:
"Wait for it... [3 seconds] ... here it is."

Emphasis affects tone:
"THIS IS LOUD."
"whisper this"

Common issues fixed

  • Robot voice: Sample had noise
  • Mispronunciation: Add phonetics: "[Jee-Pee-Tee]"
  • Artifacts: Lower sampling rate
  • Training failed: Close other GPU apps

Speed tips

  • • First training: 10-30 minutes
  • • Subsequent: 5-10 minutes
  • • Generation: 2-5 seconds per sentence
  • • GPU usage: ~4GB VRAM during training

Issues I ran into

CUDA out of memory

Even with 6GB VRAM:

Check GPU usage: nvidia-smi
Close browser & other GPU apps
Reduce batch size in settings

Port 9872 already in use

Something is already running on the port:

netstat -ano | findstr :9872
taskkill /PID [PID] /F

# Or change port in config.ini

Training stuck at 0%

GPU not being used:

Reinstall PyTorch with CUDA support
Make sure NVIDIA drivers are up to date
Check: torch.cuda.is_available()

Audio has digital noise

Try these:

Lower sampling rate in settings
Use cleaner audio sample
Update to latest version

Can't process English text

Make sure you're using:

The multilingual model
Or download English-specific model
Check model dropdown in interface

Cost comparison

Tool Monthly Yearly Quality
GPT-SoVITS $0 $0 Excellent
ElevenLabs $22 $264 Excellent
Murf.ai $26 $312 Good
Play.ht $31 $372 Good

First month alone saved me $200 in voice actor fees. Setup time investment: ~3 hours.

What I use it for

YouTube voiceovers

Clone my own voice, don't record every time

Content dubbing

Translate videos, keep original voice

VTuber avatar

Consistent character without speaking

Audiobooks

Different voices for different characters

Podcast promos

Generate episode previews

Client presentations

Professional voiceover for slides

My take

Quality is professional-grade now. I've uploaded 20+ videos with GPT-SoVITS voiceovers - nobody can tell it's AI.

Cloud option makes it accessible even without a powerful GPU. I run it on Colab when generating voiceovers for client work.

Setup is frustrating the first time. But once it works, it's: upload audio → type text → download. Much faster than recording myself, and zero ongoing cost.

GitHub: github.com/RVC-Boss/GPT-SoVITS

Note: Use responsibly. Don't clone voices without permission, especially for misleading content or impersonation.