Whisper: Finally, Speech to Text That Actually Works

Tried paying for transcription? Whisper is free, runs locally, and is insanely accurate. Open source from OpenAI. Here's how to set it up.

Why I needed this

Had hours of meeting recordings, interview audio, lecture videos. Needed them transcribed. Looked at options - Rev charges $1.25/minute, Otter.ai is $20/month, and the free alternatives were garbage.

Then someone mentioned Whisper. OpenAI's open-source speech recognition model. Runs locally, free, supports like 90 languages. Sounded too good to be true.

Tried it on a 2-hour meeting recording. 10 minutes later, had a near-perfect transcript. Speakers identified, punctuation correct, even caught technical terms. Was pretty blown away.

What Whisper actually does

Speech to text that actually works. Handles multiple speakers, different languages, background noise, technical terms, accents. Runs on your computer - no API calls, no privacy concerns.

Best part: it's completely free. Well, except for the electricity to run it.

So what is Whisper

Whisper is an automatic speech recognition (ASR) system trained by OpenAI on 680,000 hours of multilingual data. That's a lot - makes it way better than anything else out there.

The cool stuff:

  • Insanely accurate: Better than most paid services I've tried
  • Multiple languages: Works with 90+ languages out of the box
  • Speaker detection: Can identify different speakers automatically
  • Robust: Handles background noise, accents, technical terms
  • Runs locally: Your audio never leaves your machine
  • Free: Open source, no API costs
  • Multiple models: Tiny to giant - pick based on your hardware

Model sizes (trading speed for accuracy):

  • tiny: Fastest, least accurate (~1GB RAM)
  • base: Good balance (~1GB RAM)
  • small: Better accuracy (~2GB RAM)
  • medium: Even better (~5GB RAM)
  • large: Best accuracy, slowest (~10GB RAM)

I use "small" for most stuff. It's accurate enough and doesn't take forever. Only use "large" for important stuff where accuracy matters more than speed.

Getting Whisper running

Prerequisites

You'll need:

  • Python 3.8 or higher
  • ffmpeg (for audio processing)
  • A decent GPU (optional but 10x faster)
  • Some RAM and disk space

Install ffmpeg

Required for audio processing:

# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt update && sudo apt install ffmpeg

# Windows
# Download from https://ffmpeg.org/download.html

Install Whisper

Simplest way with pip:

pip install openai-whisper

That's it

You can now transcribe audio:

whisper audio.mp3

First run downloads the model (~150MB for small, ~3GB for large). After that, it's cached locally.

Using Whisper

Basic transcription

Simple command line usage:

# Transcribe with default model (base)
whisper audio.mp3

# Use a specific model
whisper audio.mp3 --model small

# Output to file
whisper audio.mp3 --output_format txt > transcript.txt

# All output formats
whisper audio.mp3 --output_format all

With timestamps

Get timestamps for each segment:

whisper audio.mp3 --model small --output_format srt

Creates an SRT file with timestamps like:

1
00:00:00,000 --> 00:00:05,000
Hello everyone, welcome to today's meeting.

2
00:00:05,000 --> 00:00:10,000
Let's start by reviewing the project status.

Python API

Use it in your Python code:

import whisper

# Load model (downloads on first run)
model = whisper.load_model("small")

# Transcribe audio file
result = model.transcribe("audio.mp3")

# Get text
print(result["text"])

# Get segments with timestamps
for segment in result["segments"]:
    print(f"[{segment['start']:.2f}s -> {segment['end']:.2f}s] {segment['text']}")

Translate audio

Transcribe and translate to English:

# Transcribe Spanish audio to English text
whisper spanish_audio.mp3 --model medium --task translate

# Or in Python
result = model.transcribe("spanish_audio.mp3", task="translate")

Different languages

Whisper auto-detects language, but you can specify:

# Specify language
whisper audio.mp3 --language Japanese

# List supported languages
whisper --help | grep -A 100 "LANGUAGE"

Stuff I use it for

Meeting recordings

Record Zoom/Google Meet meetings, transcribe automatically:

# After meeting
whisper meeting_recording.mp4 --model medium --output_format srt > meeting.txt

# Search for specific topics
grep "deadline" meeting.txt

Lecture videos

Transcribe course videos for note-taking:

# Batch process all videos in a folder
for video in lectures/*.mp4; do
    whisper "$video" --model small --output_format txt
done

Podcast episodes

Searchable transcripts of favorite podcasts:

import whisper

model = whisper.load_model("medium")

# Process entire episode
result = model.transcribe("podcast_ep42.mp3")

# Save with metadata
with open("podcast_ep42.txt", "w") as f:
    f.write(result["text"])

# Now searchable!
# Can also extract quotes, find topics, etc.

Voice notes

Record thoughts on phone, transcribe later:

# Quick voice memo transcription
whisper voice_memo.m4a --model tiny --output_format txt > notes.txt

# Tiny model is fast enough for real-time-ish use

Subtitles generation

Auto-generate subtitles for videos:

# Generate SRT subtitles
whisper my_video.mp4 --model small --output_format srt

# Creates my_video.srt ready to use with video players

Getting better results

Faster with GPU

If you have an NVIDIA GPU:

import whisper

# Use CUDA (NVIDIA GPU)
model = whisper.load_model("small", device="cuda")

# Or specify GPU
model = whisper.load_model("small", device="cuda:0")

Speed difference on my machine:

  • CPU (M1 MacBook): ~10x real-time (10 min audio = 100 min processing)
  • GPU (RTX 3080): ~0.1x real-time (10 min audio = 1 min processing)

Better accuracy

Tips for improving transcription:

# Use larger model for important stuff
whisper critical_audio.mp3 --model large

# Specify language if known (avoids detection errors)
whisper audio.mp3 --language English

# Increase temperature for more creative (but less accurate) transcription
whisper audio.mp3 --temperature 0.0  # more deterministic

Batch processing

Process multiple files efficiently:

import whisper
from pathlib import Path

model = whisper.load_model("small")

# Process all audio files in a folder
audio_files = Path("recordings").glob("*.mp3")

for audio_file in audio_files:
    print(f"Processing {audio_file.name}...")
    result = model.transcribe(str(audio_file))

    # Save transcript
    output_file = audio_file.with_suffix(".txt")
    with open(output_file, "w") as f:
        f.write(result["text"])

print("Done!")

Speaker diarization

Whisper doesn't natively separate speakers, but you can combine it with other tools:

# Use with pyannote.audio for speaker separation
pip install pyannote.audio

# Python code to combine Whisper + pyannote
# (More complex, but works for multi-speaker audio)

Building stuff with Whisper

Simple transcription service

Web API for transcription:

from fastapi import FastAPI, UploadFile
from fastapi.responses import JSONResponse
import whisper

app = FastAPI()
model = whisper.load_model("small")

@app.post("/transcribe")
async def transcribe(file: UploadFile):
    # Save uploaded file
    with open("temp_audio.mp3", "wb") as buffer:
        buffer.write(await file.read())

    # Transcribe
    result = model.transcribe("temp_audio.mp3")

    return JSONResponse({
        "text": result["text"],
        "language": result["language"]
    })

# Run with: uvicorn main:app --reload

Real-time transcription

With a bit more work, you can do live transcription:

# Requires audio streaming and chunking
# More complex but doable

import whisper
import pyaudio

model = whisper.load_model("tiny")

# Stream audio from microphone
# Process in chunks
# Display real-time transcription

Stuff that went wrong

ffmpeg not found

Whisper kept complaining about missing ffmpeg.

# Fixed by
# Installing ffmpeg (see installation section)
# Or adding to PATH if already installed

Out of memory

Large model crashed on my 8GB RAM machine.

# Fixed by
1. Using smaller model (small instead of large)
2. Closing other programs
3. Processing shorter audio chunks

Wrong language detection

Sometimes detected wrong language automatically.

# Fixed by
whisper audio.mp3 --language English  # specify manually

Poor audio quality

Noisy recordings had lots of errors.

# Fixed by
1. Improving audio quality at source
2. Using larger model (medium/large)
3. Pre-processing audio with noise reduction

Too slow on CPU

Hour-long recording took all day.

# Fixed by
1. Using smaller model (tiny/base)
2. Getting a GPU (huge speedup)
3. Being patient (it's free after all)

Whisper vs alternatives

Whisper Paid Services Other Open Source
Cost Free $1-2/min Free
Accuracy ★★★★★ ★★★★★ ★★★☆☆
Languages 90+ 10-50 10-30
Privacy 100% local Cloud Local
Speed Slow (CPU) / Fast (GPU) Fast Varies
Setup pip install Web upload Varies

Whisper matches or beats paid services in accuracy. The only downside is speed if you don't have a GPU.

Would I recommend it?

Absolutely. I've transcribed maybe 50 hours of audio with Whisper. Meetings, lectures, podcasts, voice notes. Accuracy is consistently good, even with technical content.

The fact that it runs locally is huge. I can process sensitive meetings without worrying about privacy. No uploading to cloud services, no API costs.

Speed is the only downside. Without a GPU, long recordings take a while. But I just start it and come back later. It's free - can't really complain.

If you need transcription regularly, Whisper saves hundreds of dollars compared to paid services. And it's often more accurate.

Links: github.com/openai/whisper | Paper: arxiv.org/abs/2212.04356