GPT-SoVITS: Clone Any Voice with 5 Seconds of Audio
March 2024 · AI Voice
Started making YouTube videos last month. Needed voiceovers but didn't want to record every time. Looked into hiring voice actors - $50 to $200 per video. That's not sustainable for a small channel.
Tried free TTS tools - sounded robotic. Paid services like ElevenLabs were better but $22/month adds up. Then someone mentioned GPT-SoVITS in a forum. Open-source, free, and the quality is insane.
The key insight
You only need 5 seconds of someone's voice. Extract audio from any YouTube video, train the model, and make that person say anything you type. Quality rivals $500 commercial tools.
Problem is the setup is brutal. Official docs are in Chinese, dependencies everywhere, 10GB+ downloads, confusing environment variables. Spent two days getting it working.
So here's what I wish I had: simple setup process with two paths - one-click Windows installer, or cloud option if you don't have a GPU.
What you need
Hardware requirements:
- • NVIDIA GPU with 6GB+ VRAM (GTX 1060 or better)
- • 16GB RAM recommended (8GB minimum)
- • 15GB free storage space
- • Windows 10/11 (Linux works but more complex)
No GPU? Use the cloud option below instead.
Option A: Windows portable version (recommended)
This is what worked for me. Someone made a portable version with everything pre-configured:
Step 1: Download
Search Google for "GPT-SoVITS portable version" or check community releases. Look for a 7z file around 8GB named something like "GPT-SoVITS_vX.X.X.7z"
Size: ~8GB compressed Contains: Everything pre-configured
Step 2: Extract and run
Extract to a folder with short path (no special characters), then double-click "go-web.bat"
Extract → Open folder → Double-click go-web.bat Browser opens at: http://localhost:9872
Step 3: Prepare audio sample
Download a YouTube video and extract 5-10 seconds of clear speech:
# Use yt-dlp to get audio only yt-dlp -x "https://www.youtube.com/watch?v=VIDEO_ID" -f bestaudio # Or use free online audio trimmer # Export as WAV: 16kHz, 16-bit, mono
Step 4: Train model
In the web interface (界面选择 language if needed):
1. 点击 "语音训练" (Voice Training) 2. Upload your WAV file 3. Set model name in English 4. Click "开始训练" (Start Training) 5. Wait 10-30 minutes depending on GPU
Step 5: Generate speech
Go to "语音推理" (Voice Inference) section:
1. Select your trained model from dropdown 2. Type or paste your text 3. Click "生成音频" (Generate Audio) 4. Download the result
Option B: Cloud deployment (no GPU)
Don't have a powerful GPU? Run it on Google Colab's free tier:
Google Colab method
Search for "GPT-SoVITS Colab" - several community notebooks exist:
1. Open the Colab notebook 2. Click "Copy to Drive" 3. Change runtime to GPU (Runtime → Change runtime type → T4) 4. Run all cells in order 5. Upload audio when prompted 6. Generate text directly in browser
Free tier gives you T4 GPU for a few hours per day. Good enough for occasional use.
Hugging Face Spaces
Community-hosted versions:
1. Visit huggingface.co/spaces 2. Search: "GPT-SoVITS" 3. Choose a space with GPU (might have queue) 4. Upload audio → Type text → Download
Slower than local GPU but no setup required.
Getting better results
Audio sample quality
- • Clear speech, no background noise
- • 5-10 seconds is enough (more doesn't help)
- • Natural tone, not exaggerated
- • One person speaking only
Text formatting
Use punctuation for pauses: "Hello. This is a test." Ellipsis for longer pauses: "Wait for it... [3 seconds] ... here it is." Emphasis affects tone: "THIS IS LOUD." "whisper this"
Common issues fixed
- • Robot voice: Sample had noise
- • Mispronunciation: Add phonetics: "[Jee-Pee-Tee]"
- • Artifacts: Lower sampling rate
- • Training failed: Close other GPU apps
Speed tips
- • First training: 10-30 minutes
- • Subsequent: 5-10 minutes
- • Generation: 2-5 seconds per sentence
- • GPU usage: ~4GB VRAM during training
Issues I ran into
CUDA out of memory
Even with 6GB VRAM:
Check GPU usage: nvidia-smi Close browser & other GPU apps Reduce batch size in settings
Port 9872 already in use
Something is already running on the port:
netstat -ano | findstr :9872 taskkill /PID [PID] /F # Or change port in config.ini
Training stuck at 0%
GPU not being used:
Reinstall PyTorch with CUDA support Make sure NVIDIA drivers are up to date Check: torch.cuda.is_available()
Audio has digital noise
Try these:
Lower sampling rate in settings Use cleaner audio sample Update to latest version
Can't process English text
Make sure you're using:
The multilingual model Or download English-specific model Check model dropdown in interface
Cost comparison
| Tool | Monthly | Yearly | Quality |
|---|---|---|---|
| GPT-SoVITS | $0 | $0 | Excellent |
| ElevenLabs | $22 | $264 | Excellent |
| Murf.ai | $26 | $312 | Good |
| Play.ht | $31 | $372 | Good |
First month alone saved me $200 in voice actor fees. Setup time investment: ~3 hours.
What I use it for
YouTube voiceovers
Clone my own voice, don't record every time
Content dubbing
Translate videos, keep original voice
VTuber avatar
Consistent character without speaking
Audiobooks
Different voices for different characters
Podcast promos
Generate episode previews
Client presentations
Professional voiceover for slides
My take
Quality is professional-grade now. I've uploaded 20+ videos with GPT-SoVITS voiceovers - nobody can tell it's AI.
Cloud option makes it accessible even without a powerful GPU. I run it on Colab when generating voiceovers for client work.
Setup is frustrating the first time. But once it works, it's: upload audio → type text → download. Much faster than recording myself, and zero ongoing cost.
GitHub: github.com/RVC-Boss/GPT-SoVITS
Note: Use responsibly. Don't clone voices without permission, especially for misleading content or impersonation.