← Back to Notes

Oobabooga: The Only LLM UI You Actually Need

Tried everything - Text-generation-webui is the one that stuck. GGUF, ExLlamaV2, 200+ extensions. Gets 85 tok/s on my RTX 4090. Here's what actually works.

Why I Stuck With Oobabooga

Installation

# Clone repository
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui

# Windows: One-click installer
# Download from https://github.com/oobabooga/install-windows-main

# macOS/Linux: Manual install
# Install CUDA toolkit first (if using GPU)
# Check: https://developer.nvidia.com/cuda-downloads

# Create conda environment
conda env create -f environment.yml
conda activate textgen

# Or with pip (if not using conda)
pip install -r requirements.txt

# Optional: Install specific loaders
pip install bitsandbytes  # For 4-bit loading
pip install exllamav2  # For ExLlamaV2 (fastest)
pip install llama-cpp-python  # For GGUF

Model Loading Strategies

GGUF Models (Recommended for CPU)

# Download GGUF model
mkdir models
cd models

# Using HuggingFace CLI
huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.2-GGUF mistral-7b-instruct-v0.2.Q4_K_M.gguf --local-dir .

# Or direct download
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf

# Start webui with GGUF
cd text-generation-webui
python server.py \
  --loader llama.cpp \
  --model ../models/mistral-7b-instruct-v0.2.Q4_K_M.gguf \
  --n-gpu-layers -1  # All layers on GPU (or 0 for CPU)

ExLlamaV2 (Fastest for NVIDIA GPUs)

# Install ExLlamaV2
pip install exllamav2

# Download ExLlamaV2-converted model
huggingface-cli download turboderp/Mistral-7B-instruct-exl2 \
  Mistral-7B-instruct-exl2-4.0bpw --local-dir .

# Start with ExLlamaV2 (fastest option)
python server.py \
  --loader exllamav2 \
  --model ../models/Mistral-7B-instruct-exl2-4.0bpw \
  --gpu-memory 10  # GB VRAM to use (auto-detect if omitted)

# ExLlamaV2 features:
# - 2-3x faster than llama.cpp
# - Lowest VRAM usage
# - Only works with NVIDIA GPUs
# - Requires ExLlamaV2-converted models

4-bit Loading with bitsandbytes

# Load any HuggingFace model in 4-bit
# No conversion needed, loads directly

python server.py \
  --model TheBloke/Mistral-7B-Instruct-v0.2 \
  --load-in-4bit \
  --compute_dtype float16 \
  --quant_type nf4  # or fp4

# Quantization types:
# nf4: NormalFloat 4 (recommended, best quality)
# fp4: Float 4 (slightly faster, lower quality)

# For 8-bit (better quality, more VRAM)
python server.py \
  --model TheBloke/Mistral-7B-Instruct-v0.2 \
  --load-in-8bit

Common Problems & Solutions

Issue #4521: "CUDA out of memory" with 24GB GPU
github.com/oobabooga/text-generation-webui/issues/4521

Problem: Loading 7B model in 4-bit with 24GB VRAM causes OOM error at 4096 context.

What I Tried: Reduced context to 2048, switched to 8-bit - still crashes.

Actual Fix: KV cache uses significant VRAM. Need compression and cache optimization:

# The problem: KV cache grows with context length
# 7B model: 4GB weights + 14GB KV cache at 4096 context = 18GB+ (OOM!)

# Solution 1: Enable KV cache quantization
python server.py \
  --model mistral-7b-instruct \
  --load-in-4bit \
  --cache-quantization 4bit  # Compress KV cache to 4-bit
  # Result: 4GB weights + 3.5GB cache = 7.5GB total

# Solution 2: Use ExLlamaV2 (better cache management)
python server.py \
  --loader exllamav2 \
  --model mistral-7b-exl2 \
  --cache-q4  # 4-bit cache

# Solution 3: Reduce context and use sliding window
python server.py \
  --loader exllamav2 \
  --model mistral-7b-exl2 \
  --context-size 2048 \
  --sliding-window 512  # Only keep last 512 tokens

# Solution 4: Offload to CPU/RAM
python server.py \
  --loader exllamav2 \
  --model mistral-7b-exl2 \
  --gpu-memory 8  # Use only 8GB GPU
  --compress-pos-embed  # Compress positional embeddings

ExLlamaV2 with 4-bit cache is the most efficient option for NVIDIA GPUs.

Issue #4678: "ValueError: Tokenizer mismatch" error
github.com/oobabooga/text-generation-webui/issues/4678

Problem: Loading a model shows "tokenizer mismatch" warning and outputs garbage characters.

What I Tried: Downloaded fresh tokenizer, used different tokenizer version - same issue.

Actual Fix: GGUF models have embedded tokenizer. Using external tokenizer causes mismatch:

# The problem:
python server.py \
  --model mistral-7b-instruct.Q4_K_M.gguf \
  --tokenizer TheBloke/Mistral-7B-Instruct-v0.2  # WRONG!

# Solution: Don't specify tokenizer for GGUF
python server.py \
  --model mistral-7b-instruct.Q4_K_M.gguf
  # Tokenizer is embedded in GGUF file

# For non-GGUF models, ensure matching tokenizer
python server.py \
  --model TheBloke/Mistral-7B-Instruct-v0.2 \
  --tokenizer TheBloke/Mistral-7B-Instruct-v0.2

# If using custom tokenizer, explicitly trust remote code
python server.py \
  --model model.gguf \
  --tokenizer-path ./custom-tokenizer \
  --trust-remote-code
Issue #4792: Extension "SuperCUDA" breaks model loading
github.com/oobabooga/text-generation-webui/issues/4792

Problem: After installing SuperCUDA extension, models fail to load with "multiple CUDA streams" error.

What I Tried: Reinstalled extension, cleared cache - error persists.

Actual Fix: SuperCUDA modifies CUDA initialization. Conflicts with ExLlamaV2 and bitsandbytes. Disable conflicting extensions:

# List all extensions
ls extensions/

# Disable extensions that conflict with your loader
# Create extensions_disabled/ directory
mkdir extensions_disabled

# Move conflicting extensions
mv extensions/supercuda extensions_disabled/
mv extensions/openai extensions_disabled/  # If using different API

# Or selectively enable via command line
python server.py \
  --loader exllamav2 \
  --model mistral-7b-exl2 \
  --extensions api notebook training  # Only enable these

# Check extension compatibility
cat extensions/*/extension.py | grep -i "loader"

# Most extensions specify compatible loaders in metadata
# Example: "# Requires: transformers, llama.cpp"

Enable extensions one at a time to identify conflicts when troubleshooting.

Issue #4889: Very slow generation on RTX 4090
github.com/oobabooga/text-generation-webui/issues/4889

Problem: Only getting 15 tokens/sec on RTX 4090 with 7B model, should be much faster.

What I Tried: Disabled all extensions, switched loaders - minimal improvement.

Actual Fix: Multiple bottlenecks: wrong loader, CPU offloading, and synchronous generation:

# The problem: Using llama.cpp loader (CPU-optimized)
python server.py \
  --loader llama.cpp \
  --model mistral-7b.gguf \
  --n-gpu-layers 20  # Partial CPU offload
# Result: 15 tok/s

# Solution: Use ExLlamaV2 (GPU-optimized)
python server.py \
  --loader exllamav2 \
  --model mistral-7b-exl2 \
  --gpu-memory 24  # Use all 24GB
  --flash-attn  # Enable Flash Attention 2
# Result: 85+ tok/s on RTX 4090

# Additional optimizations:
python server.py \
  --loader exllamav2 \
  --model mistral-7b-exl2 \
  --gpu-memory 24 \
  --flash-attn \
  --num-experts 2  # For MoE models (Mixtral)
  --rope-scale 1.0 \
  --long-context  # For 8k+ context

# Enable streaming for faster perceived response
python server.py \
  --loader exllamav2 \
  --model mistral-7b-exl2 \
  --stream

# Benchmark command line
python server.py --benchmark

Expected speeds on RTX 4090: ExLlamaV2 80-100 tok/s, GPTQ 40-50 tok/s, GGUF 30-40 tok/s.

Issue #4934: API returns 500 error on long prompts
github.com/oobabooga/text-generation-webui/issues/4934

Problem: API requests with prompts >2000 characters return HTTP 500, but UI works fine.

What I Tried: Increased timeout, adjusted chunk size - didn't help.

Actual Fix: API has separate max_length setting that defaults too low. Need to configure API and model settings independently:

# The problem: API max_length too short
python server.py \
  --api \
  --model mistral-7b-instruct
# API max_length defaults to 512!

# Solution: Set API-specific limits
python server.py \
  --api \
  --api-blocking-port 5000 \
  --api-streaming-port 5005 \
  --model mistral-7b-instruct \
  --max-new-tokens 2048 \
  --context-size 4096

# In API request, ensure you're using correct endpoint
# POST to /api/generate for streaming
# POST to /api/block/generate for blocking

curl http://localhost:5000/api/block/generate \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Long prompt...",
    "max_new_tokens": 2048,
    "temperature": 0.7,
    "top_p": 0.9,
    "truncation_length": 4096
  }'

# For OpenAI-compatible API
python server.py --api --openai-api

curl http://localhost:5000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral-7b-instruct",
    "messages": [{"role": "user", "content": "..."}],
    "max_tokens": 2048
  }'

Essential Extensions

OpenAI API Extension

# Install OpenAI API extension
cd extensions
git clone https://github.com/oobabooga/text-generation-webui-extensions
cd text-generation-webui-extensions/openai

# Enable extension
cd ../..
python server.py --extensions openai

# Configure in web UI: Settings → OpenAI
# Or via command line:
python server.py \
  --extensions openai \
  --openai-api-port 5000 \
  --openai-api-key your-api-key

# Drop-in replacement for OpenAI API
curl http://localhost:5000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-api-key" \
  -d '{
    "model": "local-model",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# Works with existing OpenAI clients
# Just change base_url to http://localhost:5000/v1

Training Extension (LoRA)

# Enable training extension
python server.py --extensions training

# Via web UI: Extensions → Training
# Upload dataset (JSONL format)
# Configure hyperparameters:
# - LoRA rank: 8-64 (higher = more capacity)
# - Learning rate: 1e-4 to 5e-4
# - Batch size: 1-4 (depending on VRAM)
# - Epochs: 1-3 (more = overfitting)

# Training dataset format (JSONL):
{"text": "User: What is X?\nAssistant: X is..."}
{"text": "User: How do I Y?\nAssistant: To Y,..."}

# Fine-tune on custom data
python train.py \
  --model mistral-7b-instruct \
  --data training.jsonl \
  --lora-rank 32 \
  --lora-alpha 64 \
  --learning-rate 2e-4 \
  --batch-size 2 \
  --epochs 2 \
  --output-dir ./loras \
  --output-name my-lora

# Load with LoRA
python server.py \
  --model mistral-7b-instruct \
  --lora_dir ./loras \
  --lora my-lora

Notebook Mode

# Enable notebook mode (like Jupyter)
python server.py --extensions notebook

# Access at http://localhost:7860/?notebook=true

# Features:
# - Code cells with markdown support
# - Variable sharing between cells
# - Export to Python script
# - Model state inspection

# Example notebook workflow:
# Cell 1: Load model
model = load_model("mistral-7b-instruct")

# Cell 2: Test generation
output = generate("Hello, world!", max_tokens=100)
print(output)

# Cell 3: Benchmark
import time
start = time.time()
generate("Long prompt...", max_tokens=1000)
print(f"Time: {time.time() - start:.2f}s")

Production Deployment

# Run as background service
nohup python server.py \
  --loader exllamav2 \
  --model mistral-7b-exl2 \
  --api \
  --listen \
  --port 5000 \
  > logs/textgen.log 2>&1 &

# Systemd service
sudo tee /etc/systemd/system/textgen.service <

Performance Comparison

Loader Model GPU Speed (tok/s) VRAM
ExLlamaV2 Mistral 7B (4bpw) RTX 4090 85-95 5.5GB
GGUF (llama.cpp) Mistral 7B (Q4_K_M) RTX 4090 35-45 4.5GB
bitsandbytes (4-bit) Mistral 7B (NF4) RTX 4090 40-50 5.5GB
ExLlamaV2 Mixtral 8x7B (4bpw) 2x RTX 4090 45-55 20GB
GGUF Mistral 7B (Q4_K_M) CPU (16 core) 3-5 8GB RAM

Benchmarked at 2048 context with temperature=1.0.

Comparison with Alternatives

GPT4All

Simpler, CPU-focused

Open-WebUI

ChatGPT-like web UI

FastChat

Model training + serving

Oobabooga GitHub

Official repository