Why I needed this
Had hours of meeting recordings, interview audio, lecture videos. Needed them transcribed. Looked at options - Rev charges $1.25/minute, Otter.ai is $20/month, and the free alternatives were garbage.
Then someone mentioned Whisper. OpenAI's open-source speech recognition model. Runs locally, free, supports like 90 languages. Sounded too good to be true.
Tried it on a 2-hour meeting recording. 10 minutes later, had a near-perfect transcript. Speakers identified, punctuation correct, even caught technical terms. Was pretty blown away.
What Whisper actually does
Speech to text that actually works. Handles multiple speakers, different languages, background noise, technical terms, accents. Runs on your computer - no API calls, no privacy concerns.
Best part: it's completely free. Well, except for the electricity to run it.
So what is Whisper
Whisper is an automatic speech recognition (ASR) system trained by OpenAI on 680,000 hours of multilingual data. That's a lot - makes it way better than anything else out there.
The cool stuff:
- Insanely accurate: Better than most paid services I've tried
- Multiple languages: Works with 90+ languages out of the box
- Speaker detection: Can identify different speakers automatically
- Robust: Handles background noise, accents, technical terms
- Runs locally: Your audio never leaves your machine
- Free: Open source, no API costs
- Multiple models: Tiny to giant - pick based on your hardware
Model sizes (trading speed for accuracy):
- tiny: Fastest, least accurate (~1GB RAM)
- base: Good balance (~1GB RAM)
- small: Better accuracy (~2GB RAM)
- medium: Even better (~5GB RAM)
- large: Best accuracy, slowest (~10GB RAM)
I use "small" for most stuff. It's accurate enough and doesn't take forever. Only use "large" for important stuff where accuracy matters more than speed.
Getting Whisper running
Prerequisites
You'll need:
- Python 3.8 or higher
- ffmpeg (for audio processing)
- A decent GPU (optional but 10x faster)
- Some RAM and disk space
Install ffmpeg
Required for audio processing:
# macOS
brew install ffmpeg
# Ubuntu/Debian
sudo apt update && sudo apt install ffmpeg
# Windows
# Download from https://ffmpeg.org/download.html
Install Whisper
Simplest way with pip:
pip install openai-whisper
That's it
You can now transcribe audio:
whisper audio.mp3
First run downloads the model (~150MB for small, ~3GB for large). After that, it's cached locally.
Using Whisper
Basic transcription
Simple command line usage:
# Transcribe with default model (base)
whisper audio.mp3
# Use a specific model
whisper audio.mp3 --model small
# Output to file
whisper audio.mp3 --output_format txt > transcript.txt
# All output formats
whisper audio.mp3 --output_format all
With timestamps
Get timestamps for each segment:
whisper audio.mp3 --model small --output_format srt
Creates an SRT file with timestamps like:
1
00:00:00,000 --> 00:00:05,000
Hello everyone, welcome to today's meeting.
2
00:00:05,000 --> 00:00:10,000
Let's start by reviewing the project status.
Python API
Use it in your Python code:
import whisper
# Load model (downloads on first run)
model = whisper.load_model("small")
# Transcribe audio file
result = model.transcribe("audio.mp3")
# Get text
print(result["text"])
# Get segments with timestamps
for segment in result["segments"]:
print(f"[{segment['start']:.2f}s -> {segment['end']:.2f}s] {segment['text']}")
Translate audio
Transcribe and translate to English:
# Transcribe Spanish audio to English text
whisper spanish_audio.mp3 --model medium --task translate
# Or in Python
result = model.transcribe("spanish_audio.mp3", task="translate")
Different languages
Whisper auto-detects language, but you can specify:
# Specify language
whisper audio.mp3 --language Japanese
# List supported languages
whisper --help | grep -A 100 "LANGUAGE"
Stuff I use it for
Meeting recordings
Record Zoom/Google Meet meetings, transcribe automatically:
# After meeting
whisper meeting_recording.mp4 --model medium --output_format srt > meeting.txt
# Search for specific topics
grep "deadline" meeting.txt
Lecture videos
Transcribe course videos for note-taking:
# Batch process all videos in a folder
for video in lectures/*.mp4; do
whisper "$video" --model small --output_format txt
done
Podcast episodes
Searchable transcripts of favorite podcasts:
import whisper
model = whisper.load_model("medium")
# Process entire episode
result = model.transcribe("podcast_ep42.mp3")
# Save with metadata
with open("podcast_ep42.txt", "w") as f:
f.write(result["text"])
# Now searchable!
# Can also extract quotes, find topics, etc.
Voice notes
Record thoughts on phone, transcribe later:
# Quick voice memo transcription
whisper voice_memo.m4a --model tiny --output_format txt > notes.txt
# Tiny model is fast enough for real-time-ish use
Subtitles generation
Auto-generate subtitles for videos:
# Generate SRT subtitles
whisper my_video.mp4 --model small --output_format srt
# Creates my_video.srt ready to use with video players
Getting better results
Faster with GPU
If you have an NVIDIA GPU:
import whisper
# Use CUDA (NVIDIA GPU)
model = whisper.load_model("small", device="cuda")
# Or specify GPU
model = whisper.load_model("small", device="cuda:0")
Speed difference on my machine:
- CPU (M1 MacBook): ~10x real-time (10 min audio = 100 min processing)
- GPU (RTX 3080): ~0.1x real-time (10 min audio = 1 min processing)
Better accuracy
Tips for improving transcription:
# Use larger model for important stuff
whisper critical_audio.mp3 --model large
# Specify language if known (avoids detection errors)
whisper audio.mp3 --language English
# Increase temperature for more creative (but less accurate) transcription
whisper audio.mp3 --temperature 0.0 # more deterministic
Batch processing
Process multiple files efficiently:
import whisper
from pathlib import Path
model = whisper.load_model("small")
# Process all audio files in a folder
audio_files = Path("recordings").glob("*.mp3")
for audio_file in audio_files:
print(f"Processing {audio_file.name}...")
result = model.transcribe(str(audio_file))
# Save transcript
output_file = audio_file.with_suffix(".txt")
with open(output_file, "w") as f:
f.write(result["text"])
print("Done!")
Speaker diarization
Whisper doesn't natively separate speakers, but you can combine it with other tools:
# Use with pyannote.audio for speaker separation
pip install pyannote.audio
# Python code to combine Whisper + pyannote
# (More complex, but works for multi-speaker audio)
Building stuff with Whisper
Simple transcription service
Web API for transcription:
from fastapi import FastAPI, UploadFile
from fastapi.responses import JSONResponse
import whisper
app = FastAPI()
model = whisper.load_model("small")
@app.post("/transcribe")
async def transcribe(file: UploadFile):
# Save uploaded file
with open("temp_audio.mp3", "wb") as buffer:
buffer.write(await file.read())
# Transcribe
result = model.transcribe("temp_audio.mp3")
return JSONResponse({
"text": result["text"],
"language": result["language"]
})
# Run with: uvicorn main:app --reload
Real-time transcription
With a bit more work, you can do live transcription:
# Requires audio streaming and chunking
# More complex but doable
import whisper
import pyaudio
model = whisper.load_model("tiny")
# Stream audio from microphone
# Process in chunks
# Display real-time transcription
Stuff that went wrong
ffmpeg not found
Whisper kept complaining about missing ffmpeg.
# Fixed by
# Installing ffmpeg (see installation section)
# Or adding to PATH if already installed
Out of memory
Large model crashed on my 8GB RAM machine.
# Fixed by
1. Using smaller model (small instead of large)
2. Closing other programs
3. Processing shorter audio chunks
Wrong language detection
Sometimes detected wrong language automatically.
# Fixed by
whisper audio.mp3 --language English # specify manually
Poor audio quality
Noisy recordings had lots of errors.
# Fixed by
1. Improving audio quality at source
2. Using larger model (medium/large)
3. Pre-processing audio with noise reduction
Too slow on CPU
Hour-long recording took all day.
# Fixed by
1. Using smaller model (tiny/base)
2. Getting a GPU (huge speedup)
3. Being patient (it's free after all)
Whisper vs alternatives
| Whisper | Paid Services | Other Open Source | |
|---|---|---|---|
| Cost | Free | $1-2/min | Free |
| Accuracy | ★★★★★ | ★★★★★ | ★★★☆☆ |
| Languages | 90+ | 10-50 | 10-30 |
| Privacy | 100% local | Cloud | Local |
| Speed | Slow (CPU) / Fast (GPU) | Fast | Varies |
| Setup | pip install | Web upload | Varies |
Whisper matches or beats paid services in accuracy. The only downside is speed if you don't have a GPU.
Would I recommend it?
Absolutely. I've transcribed maybe 50 hours of audio with Whisper. Meetings, lectures, podcasts, voice notes. Accuracy is consistently good, even with technical content.
The fact that it runs locally is huge. I can process sensitive meetings without worrying about privacy. No uploading to cloud services, no API costs.
Speed is the only downside. Without a GPU, long recordings take a while. But I just start it and come back later. It's free - can't really complain.
If you need transcription regularly, Whisper saves hundreds of dollars compared to paid services. And it's often more accurate.
Links: github.com/openai/whisper | Paper: arxiv.org/abs/2212.04356