Whisper: Finally, Speech to Text That Actually Works

Had hours of meeting recordings, interview audio, lecture videos. Needed them transcribed. Looked at options - Rev charges $1.25/minute, Otter.ai is $20/month, and the free alternatives were garbage.

Then someone mentioned Whisper. OpenAI's open-source speech recognition model. Runs locally, free, supports like 90 languages. Sounded too good to be true.

Tried it on a 2-hour meeting recording. 10 minutes later, had a near-perfect transcript. Speakers identified, punctuation correct, even caught technical terms. Was pretty blown away.

What Whisper actually does

Whisper is an automatic speech recognition system trained on 680,000 hours of multilingual data. That's a lot - makes it way better than anything else out there.

It handles multiple speakers, different languages, background noise, technical terms, accents. Runs on your computer so your audio never leaves your machine. Completely free, open source.

Model sizes

Whisper comes in different sizes. Trading speed for accuracy:

tiny: Fastest, least accurate (~1GB RAM)
base: Good balance (~1GB RAM)
small: Better accuracy (~2GB RAM)
medium: Even better (~5GB RAM)
large: Best accuracy, slowest (~10GB RAM)

I use "small" for most stuff. It's accurate enough and doesn't take forever. Only use "large" for important stuff where accuracy matters more than speed.

Getting it installed

You'll need Python 3.8+, ffmpeg, and ideally a GPU.

Install ffmpeg first

# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt update && sudo apt install ffmpeg

# Windows - download from ffmpeg.org

Then install Whisper

pip install openai-whisper

That's it. You can now transcribe audio with whisper audio.mp3. First run downloads the model (~150MB for small, ~3GB for large).

Basic usage

# Transcribe with default model (base)
whisper audio.mp3

# Use a specific model
whisper audio.mp3 --model small

# Output to file
whisper audio.mp3 --output_format txt > transcript.txt

# Get all output formats (txt, srt, vtt, json)
whisper audio.mp3 --output_format all

With timestamps

Get timestamps for each segment:

whisper audio.mp3 --model small --output_format srt

Creates an SRT file with timestamps like:

1
00:00:00,000 --> 00:00:05,000
Hello everyone, welcome to today's meeting.

2
00:00:05,000 --> 00:00:10,000
Let's start by reviewing the project status.

Python API

Use it in your Python code:

import whisper

# Load model (downloads on first run)
model = whisper.load_model("small")

# Transcribe audio file
result = model.transcribe("audio.mp3")

# Get text
print(result["text"])

# Get segments with timestamps
for segment in result["segments"]:
    print(f"[{segment['start']:.2f}s -> {segment['end']:.2f}s] {segment['text']}")

Translation and languages

Transcribe and translate to English:

# Transcribe Spanish audio to English text
whisper spanish_audio.mp3 --model medium --task translate

# Or in Python
result = model.transcribe("spanish_audio.mp3", task="translate")

Whisper auto-detects language, but you can specify:

whisper audio.mp3 --language Japanese

Stuff I use it for

Meeting recordings

Record Zoom/Google Meet meetings, transcribe automatically:

# After meeting
whisper meeting_recording.mp4 --model medium --output_format srt > meeting.txt

# Search for specific topics
grep "deadline" meeting.txt

Lecture videos

Transcribe course videos for note-taking:

# Batch process all videos in a folder
for video in lectures/*.mp4; do
    whisper "$video" --model small --output_format txt
done

Podcast episodes

Searchable transcripts of favorite podcasts:

import whisper

model = whisper.load_model("medium")

# Process entire episode
result = model.transcribe("podcast_ep42.mp3")

# Save with metadata
with open("podcast_ep42.txt", "w") as f:
    f.write(result["text"])

# Now searchable!

Subtitles generation

Auto-generate subtitles for videos:

# Generate SRT subtitles
whisper my_video.mp4 --model small --output_format srt

# Creates my_video.srt ready to use with video players

Faster with GPU

If you have an NVIDIA GPU:

import whisper

# Use CUDA (NVIDIA GPU)
model = whisper.load_model("small", device="cuda")

Speed difference on my machine:

CPU (M1 MacBook): ~10x real-time (10 min audio = 100 min processing)
GPU (RTX 3080): ~0.1x real-time (10 min audio = 1 min processing)

Batch processing

Process multiple files efficiently:

import whisper
from pathlib import Path

model = whisper.load_model("small")

# Process all audio files in a folder
audio_files = Path("recordings").glob("*.mp3")

for audio_file in audio_files:
    print(f"Processing {audio_file.name}...")
    result = model.transcribe(str(audio_file))

    # Save transcript
    output_file = audio_file.with_suffix(".txt")
    with open(output_file, "w") as f:
        f.write(result["text"])

print("Done!")

Building a transcription service

Simple web API with FastAPI:

from fastapi import FastAPI, UploadFile
from fastapi.responses import JSONResponse
import whisper

app = FastAPI()
model = whisper.load_model("small")

@app.post("/transcribe")
async def transcribe(file: UploadFile):
    # Save uploaded file
    with open("temp_audio.mp3", "wb") as buffer:
        buffer.write(await file.read())

    # Transcribe
    result = model.transcribe("temp_audio.mp3")

    return JSONResponse({
        "text": result["text"],
        "language": result["language"]
    })

Run with: uvicorn main:app --reload

Stuff that went wrong

ffmpeg not found

Whisper kept complaining about missing ffmpeg. Fixed by installing ffmpeg (see above) or adding to PATH if already installed.

Out of memory

Large model crashed on my 8GB RAM machine. Fixed by using smaller model (small instead of large), closing other programs, or processing shorter audio chunks.

Wrong language detection

Sometimes detected wrong language automatically. Fixed by specifying manually: whisper audio.mp3 --language English

Whisper vs alternatives

Feature	Whisper	Paid Services	Other Open Source
Cost	Free	$1-2/min	Free
Accuracy	Excellent	Excellent	Okay
Languages	90+	10-50	10-30
Privacy	100% local	Cloud	Local
Speed	Slow CPU / Fast GPU	Fast	Varies

Whisper matches or beats paid services in accuracy. The only downside is speed if you don't have a GPU.

Bottom line

I've transcribed maybe 50 hours of audio with Whisper. Meetings, lectures, podcasts, voice notes. Accuracy is consistently good, even with technical content.

The fact that it runs locally is huge. I can process sensitive meetings without worrying about privacy. No uploading to cloud services, no API costs.

Speed is the only downside. Without a GPU, long recordings take a while. But I just start it and come back later. It's free - can't really complain.

If you need transcription regularly, Whisper saves hundreds of dollars compared to paid services. And it's often more accurate.

Links: github.com/openai/whisper | Paper: arxiv.org/abs/2212.04356