Had hours of meeting recordings, interview audio, lecture videos. Needed them transcribed. Looked at options - Rev charges $1.25/minute, Otter.ai is $20/month, and the free alternatives were garbage.
Then someone mentioned Whisper. OpenAI's open-source speech recognition model. Runs locally, free, supports like 90 languages. Sounded too good to be true.
Tried it on a 2-hour meeting recording. 10 minutes later, had a near-perfect transcript. Speakers identified, punctuation correct, even caught technical terms. Was pretty blown away.
What Whisper actually does
Whisper is an automatic speech recognition system trained on 680,000 hours of multilingual data. That's a lot - makes it way better than anything else out there.
It handles multiple speakers, different languages, background noise, technical terms, accents. Runs on your computer so your audio never leaves your machine. Completely free, open source.
Model sizes
Whisper comes in different sizes. Trading speed for accuracy:
- tiny: Fastest, least accurate (~1GB RAM)
- base: Good balance (~1GB RAM)
- small: Better accuracy (~2GB RAM)
- medium: Even better (~5GB RAM)
- large: Best accuracy, slowest (~10GB RAM)
I use "small" for most stuff. It's accurate enough and doesn't take forever. Only use "large" for important stuff where accuracy matters more than speed.
Getting it installed
You'll need Python 3.8+, ffmpeg, and ideally a GPU.
Install ffmpeg first
# macOS
brew install ffmpeg
# Ubuntu/Debian
sudo apt update && sudo apt install ffmpeg
# Windows - download from ffmpeg.org
Then install Whisper
pip install openai-whisper
That's it. You can now transcribe audio with whisper audio.mp3. First run downloads the model (~150MB for small, ~3GB for large).
Basic usage
# Transcribe with default model (base)
whisper audio.mp3
# Use a specific model
whisper audio.mp3 --model small
# Output to file
whisper audio.mp3 --output_format txt > transcript.txt
# Get all output formats (txt, srt, vtt, json)
whisper audio.mp3 --output_format all
With timestamps
Get timestamps for each segment:
whisper audio.mp3 --model small --output_format srt
Creates an SRT file with timestamps like:
1
00:00:00,000 --> 00:00:05,000
Hello everyone, welcome to today's meeting.
2
00:00:05,000 --> 00:00:10,000
Let's start by reviewing the project status.
Python API
Use it in your Python code:
import whisper
# Load model (downloads on first run)
model = whisper.load_model("small")
# Transcribe audio file
result = model.transcribe("audio.mp3")
# Get text
print(result["text"])
# Get segments with timestamps
for segment in result["segments"]:
print(f"[{segment['start']:.2f}s -> {segment['end']:.2f}s] {segment['text']}")
Translation and languages
Transcribe and translate to English:
# Transcribe Spanish audio to English text
whisper spanish_audio.mp3 --model medium --task translate
# Or in Python
result = model.transcribe("spanish_audio.mp3", task="translate")
Whisper auto-detects language, but you can specify:
whisper audio.mp3 --language Japanese
Stuff I use it for
Meeting recordings
Record Zoom/Google Meet meetings, transcribe automatically:
# After meeting
whisper meeting_recording.mp4 --model medium --output_format srt > meeting.txt
# Search for specific topics
grep "deadline" meeting.txt
Lecture videos
Transcribe course videos for note-taking:
# Batch process all videos in a folder
for video in lectures/*.mp4; do
whisper "$video" --model small --output_format txt
done
Podcast episodes
Searchable transcripts of favorite podcasts:
import whisper
model = whisper.load_model("medium")
# Process entire episode
result = model.transcribe("podcast_ep42.mp3")
# Save with metadata
with open("podcast_ep42.txt", "w") as f:
f.write(result["text"])
# Now searchable!
Subtitles generation
Auto-generate subtitles for videos:
# Generate SRT subtitles
whisper my_video.mp4 --model small --output_format srt
# Creates my_video.srt ready to use with video players
Faster with GPU
If you have an NVIDIA GPU:
import whisper
# Use CUDA (NVIDIA GPU)
model = whisper.load_model("small", device="cuda")
Speed difference on my machine:
- CPU (M1 MacBook): ~10x real-time (10 min audio = 100 min processing)
- GPU (RTX 3080): ~0.1x real-time (10 min audio = 1 min processing)
Batch processing
Process multiple files efficiently:
import whisper
from pathlib import Path
model = whisper.load_model("small")
# Process all audio files in a folder
audio_files = Path("recordings").glob("*.mp3")
for audio_file in audio_files:
print(f"Processing {audio_file.name}...")
result = model.transcribe(str(audio_file))
# Save transcript
output_file = audio_file.with_suffix(".txt")
with open(output_file, "w") as f:
f.write(result["text"])
print("Done!")
Building a transcription service
Simple web API with FastAPI:
from fastapi import FastAPI, UploadFile
from fastapi.responses import JSONResponse
import whisper
app = FastAPI()
model = whisper.load_model("small")
@app.post("/transcribe")
async def transcribe(file: UploadFile):
# Save uploaded file
with open("temp_audio.mp3", "wb") as buffer:
buffer.write(await file.read())
# Transcribe
result = model.transcribe("temp_audio.mp3")
return JSONResponse({
"text": result["text"],
"language": result["language"]
})
Run with: uvicorn main:app --reload
Stuff that went wrong
ffmpeg not found
Whisper kept complaining about missing ffmpeg. Fixed by installing ffmpeg (see above) or adding to PATH if already installed.
Out of memory
Large model crashed on my 8GB RAM machine. Fixed by using smaller model (small instead of large), closing other programs, or processing shorter audio chunks.
Wrong language detection
Sometimes detected wrong language automatically. Fixed by specifying manually: whisper audio.mp3 --language English
Whisper vs alternatives
| Feature | Whisper | Paid Services | Other Open Source |
|---|---|---|---|
| Cost | Free | $1-2/min | Free |
| Accuracy | Excellent | Excellent | Okay |
| Languages | 90+ | 10-50 | 10-30 |
| Privacy | 100% local | Cloud | Local |
| Speed | Slow CPU / Fast GPU | Fast | Varies |
Whisper matches or beats paid services in accuracy. The only downside is speed if you don't have a GPU.
Bottom line
I've transcribed maybe 50 hours of audio with Whisper. Meetings, lectures, podcasts, voice notes. Accuracy is consistently good, even with technical content.
The fact that it runs locally is huge. I can process sensitive meetings without worrying about privacy. No uploading to cloud services, no API costs.
Speed is the only downside. Without a GPU, long recordings take a while. But I just start it and come back later. It's free - can't really complain.
If you need transcription regularly, Whisper saves hundreds of dollars compared to paid services. And it's often more accurate.
Links: github.com/openai/whisper | Paper: arxiv.org/abs/2212.04356