Oobabooga: The Only LLM UI You Actually Need
Tried everything - Text-generation-webui is the one that stuck. GGUF, ExLlamaV2, 200+ extensions. Gets 85 tok/s on my RTX 4090. Here's what actually works.
Why I Stuck With Oobabooga
- • Every model format: GGUF, ExLlamaV2, GPTQ, AWQ - just works
- • Extensions ecosystem: 200+ plugins. Training, API, notebooks - whatever you need
- • 4-bit loading: Mistral 7B in 4.5GB VRAM, still fast
- • ExLlamaV2 speed: 85-95 tok/s on RTX 4090. Nothing else comes close
- • Daily updates: Breaking changes sometimes, but always latest features
Installation
# Clone repository
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
# Windows: One-click installer
# Download from https://github.com/oobabooga/install-windows-main
# macOS/Linux: Manual install
# Install CUDA toolkit first (if using GPU)
# Check: https://developer.nvidia.com/cuda-downloads
# Create conda environment
conda env create -f environment.yml
conda activate textgen
# Or with pip (if not using conda)
pip install -r requirements.txt
# Optional: Install specific loaders
pip install bitsandbytes # For 4-bit loading
pip install exllamav2 # For ExLlamaV2 (fastest)
pip install llama-cpp-python # For GGUF
Model Loading Strategies
GGUF Models (Recommended for CPU)
# Download GGUF model
mkdir models
cd models
# Using HuggingFace CLI
huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.2-GGUF mistral-7b-instruct-v0.2.Q4_K_M.gguf --local-dir .
# Or direct download
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf
# Start webui with GGUF
cd text-generation-webui
python server.py \
--loader llama.cpp \
--model ../models/mistral-7b-instruct-v0.2.Q4_K_M.gguf \
--n-gpu-layers -1 # All layers on GPU (or 0 for CPU)
ExLlamaV2 (Fastest for NVIDIA GPUs)
# Install ExLlamaV2
pip install exllamav2
# Download ExLlamaV2-converted model
huggingface-cli download turboderp/Mistral-7B-instruct-exl2 \
Mistral-7B-instruct-exl2-4.0bpw --local-dir .
# Start with ExLlamaV2 (fastest option)
python server.py \
--loader exllamav2 \
--model ../models/Mistral-7B-instruct-exl2-4.0bpw \
--gpu-memory 10 # GB VRAM to use (auto-detect if omitted)
# ExLlamaV2 features:
# - 2-3x faster than llama.cpp
# - Lowest VRAM usage
# - Only works with NVIDIA GPUs
# - Requires ExLlamaV2-converted models
4-bit Loading with bitsandbytes
# Load any HuggingFace model in 4-bit
# No conversion needed, loads directly
python server.py \
--model TheBloke/Mistral-7B-Instruct-v0.2 \
--load-in-4bit \
--compute_dtype float16 \
--quant_type nf4 # or fp4
# Quantization types:
# nf4: NormalFloat 4 (recommended, best quality)
# fp4: Float 4 (slightly faster, lower quality)
# For 8-bit (better quality, more VRAM)
python server.py \
--model TheBloke/Mistral-7B-Instruct-v0.2 \
--load-in-8bit
Common Problems & Solutions
Problem: Loading 7B model in 4-bit with 24GB VRAM causes OOM error at 4096 context.
What I Tried: Reduced context to 2048, switched to 8-bit - still crashes.
Actual Fix: KV cache uses significant VRAM. Need compression and cache optimization:
# The problem: KV cache grows with context length
# 7B model: 4GB weights + 14GB KV cache at 4096 context = 18GB+ (OOM!)
# Solution 1: Enable KV cache quantization
python server.py \
--model mistral-7b-instruct \
--load-in-4bit \
--cache-quantization 4bit # Compress KV cache to 4-bit
# Result: 4GB weights + 3.5GB cache = 7.5GB total
# Solution 2: Use ExLlamaV2 (better cache management)
python server.py \
--loader exllamav2 \
--model mistral-7b-exl2 \
--cache-q4 # 4-bit cache
# Solution 3: Reduce context and use sliding window
python server.py \
--loader exllamav2 \
--model mistral-7b-exl2 \
--context-size 2048 \
--sliding-window 512 # Only keep last 512 tokens
# Solution 4: Offload to CPU/RAM
python server.py \
--loader exllamav2 \
--model mistral-7b-exl2 \
--gpu-memory 8 # Use only 8GB GPU
--compress-pos-embed # Compress positional embeddings
ExLlamaV2 with 4-bit cache is the most efficient option for NVIDIA GPUs.
Problem: Loading a model shows "tokenizer mismatch" warning and outputs garbage characters.
What I Tried: Downloaded fresh tokenizer, used different tokenizer version - same issue.
Actual Fix: GGUF models have embedded tokenizer. Using external tokenizer causes mismatch:
# The problem:
python server.py \
--model mistral-7b-instruct.Q4_K_M.gguf \
--tokenizer TheBloke/Mistral-7B-Instruct-v0.2 # WRONG!
# Solution: Don't specify tokenizer for GGUF
python server.py \
--model mistral-7b-instruct.Q4_K_M.gguf
# Tokenizer is embedded in GGUF file
# For non-GGUF models, ensure matching tokenizer
python server.py \
--model TheBloke/Mistral-7B-Instruct-v0.2 \
--tokenizer TheBloke/Mistral-7B-Instruct-v0.2
# If using custom tokenizer, explicitly trust remote code
python server.py \
--model model.gguf \
--tokenizer-path ./custom-tokenizer \
--trust-remote-code
Problem: After installing SuperCUDA extension, models fail to load with "multiple CUDA streams" error.
What I Tried: Reinstalled extension, cleared cache - error persists.
Actual Fix: SuperCUDA modifies CUDA initialization. Conflicts with ExLlamaV2 and bitsandbytes. Disable conflicting extensions:
# List all extensions
ls extensions/
# Disable extensions that conflict with your loader
# Create extensions_disabled/ directory
mkdir extensions_disabled
# Move conflicting extensions
mv extensions/supercuda extensions_disabled/
mv extensions/openai extensions_disabled/ # If using different API
# Or selectively enable via command line
python server.py \
--loader exllamav2 \
--model mistral-7b-exl2 \
--extensions api notebook training # Only enable these
# Check extension compatibility
cat extensions/*/extension.py | grep -i "loader"
# Most extensions specify compatible loaders in metadata
# Example: "# Requires: transformers, llama.cpp"
Enable extensions one at a time to identify conflicts when troubleshooting.
Problem: Only getting 15 tokens/sec on RTX 4090 with 7B model, should be much faster.
What I Tried: Disabled all extensions, switched loaders - minimal improvement.
Actual Fix: Multiple bottlenecks: wrong loader, CPU offloading, and synchronous generation:
# The problem: Using llama.cpp loader (CPU-optimized)
python server.py \
--loader llama.cpp \
--model mistral-7b.gguf \
--n-gpu-layers 20 # Partial CPU offload
# Result: 15 tok/s
# Solution: Use ExLlamaV2 (GPU-optimized)
python server.py \
--loader exllamav2 \
--model mistral-7b-exl2 \
--gpu-memory 24 # Use all 24GB
--flash-attn # Enable Flash Attention 2
# Result: 85+ tok/s on RTX 4090
# Additional optimizations:
python server.py \
--loader exllamav2 \
--model mistral-7b-exl2 \
--gpu-memory 24 \
--flash-attn \
--num-experts 2 # For MoE models (Mixtral)
--rope-scale 1.0 \
--long-context # For 8k+ context
# Enable streaming for faster perceived response
python server.py \
--loader exllamav2 \
--model mistral-7b-exl2 \
--stream
# Benchmark command line
python server.py --benchmark
Expected speeds on RTX 4090: ExLlamaV2 80-100 tok/s, GPTQ 40-50 tok/s, GGUF 30-40 tok/s.
Problem: API requests with prompts >2000 characters return HTTP 500, but UI works fine.
What I Tried: Increased timeout, adjusted chunk size - didn't help.
Actual Fix: API has separate max_length setting that defaults too low. Need to configure API and model settings independently:
# The problem: API max_length too short
python server.py \
--api \
--model mistral-7b-instruct
# API max_length defaults to 512!
# Solution: Set API-specific limits
python server.py \
--api \
--api-blocking-port 5000 \
--api-streaming-port 5005 \
--model mistral-7b-instruct \
--max-new-tokens 2048 \
--context-size 4096
# In API request, ensure you're using correct endpoint
# POST to /api/generate for streaming
# POST to /api/block/generate for blocking
curl http://localhost:5000/api/block/generate \
-H "Content-Type: application/json" \
-d '{
"prompt": "Long prompt...",
"max_new_tokens": 2048,
"temperature": 0.7,
"top_p": 0.9,
"truncation_length": 4096
}'
# For OpenAI-compatible API
python server.py --api --openai-api
curl http://localhost:5000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistral-7b-instruct",
"messages": [{"role": "user", "content": "..."}],
"max_tokens": 2048
}'
Essential Extensions
OpenAI API Extension
# Install OpenAI API extension
cd extensions
git clone https://github.com/oobabooga/text-generation-webui-extensions
cd text-generation-webui-extensions/openai
# Enable extension
cd ../..
python server.py --extensions openai
# Configure in web UI: Settings → OpenAI
# Or via command line:
python server.py \
--extensions openai \
--openai-api-port 5000 \
--openai-api-key your-api-key
# Drop-in replacement for OpenAI API
curl http://localhost:5000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-api-key" \
-d '{
"model": "local-model",
"messages": [{"role": "user", "content": "Hello!"}]
}'
# Works with existing OpenAI clients
# Just change base_url to http://localhost:5000/v1
Training Extension (LoRA)
# Enable training extension
python server.py --extensions training
# Via web UI: Extensions → Training
# Upload dataset (JSONL format)
# Configure hyperparameters:
# - LoRA rank: 8-64 (higher = more capacity)
# - Learning rate: 1e-4 to 5e-4
# - Batch size: 1-4 (depending on VRAM)
# - Epochs: 1-3 (more = overfitting)
# Training dataset format (JSONL):
{"text": "User: What is X?\nAssistant: X is..."}
{"text": "User: How do I Y?\nAssistant: To Y,..."}
# Fine-tune on custom data
python train.py \
--model mistral-7b-instruct \
--data training.jsonl \
--lora-rank 32 \
--lora-alpha 64 \
--learning-rate 2e-4 \
--batch-size 2 \
--epochs 2 \
--output-dir ./loras \
--output-name my-lora
# Load with LoRA
python server.py \
--model mistral-7b-instruct \
--lora_dir ./loras \
--lora my-lora
Notebook Mode
# Enable notebook mode (like Jupyter)
python server.py --extensions notebook
# Access at http://localhost:7860/?notebook=true
# Features:
# - Code cells with markdown support
# - Variable sharing between cells
# - Export to Python script
# - Model state inspection
# Example notebook workflow:
# Cell 1: Load model
model = load_model("mistral-7b-instruct")
# Cell 2: Test generation
output = generate("Hello, world!", max_tokens=100)
print(output)
# Cell 3: Benchmark
import time
start = time.time()
generate("Long prompt...", max_tokens=1000)
print(f"Time: {time.time() - start:.2f}s")
Production Deployment
# Run as background service
nohup python server.py \
--loader exllamav2 \
--model mistral-7b-exl2 \
--api \
--listen \
--port 5000 \
> logs/textgen.log 2>&1 &
# Systemd service
sudo tee /etc/systemd/system/textgen.service <
Performance Comparison
| Loader | Model | GPU | Speed (tok/s) | VRAM |
|---|---|---|---|---|
| ExLlamaV2 | Mistral 7B (4bpw) | RTX 4090 | 85-95 | 5.5GB |
| GGUF (llama.cpp) | Mistral 7B (Q4_K_M) | RTX 4090 | 35-45 | 4.5GB |
| bitsandbytes (4-bit) | Mistral 7B (NF4) | RTX 4090 | 40-50 | 5.5GB |
| ExLlamaV2 | Mixtral 8x7B (4bpw) | 2x RTX 4090 | 45-55 | 20GB |
| GGUF | Mistral 7B (Q4_K_M) | CPU (16 core) | 3-5 | 8GB RAM |
Benchmarked at 2048 context with temperature=1.0.
Comparison with Alternatives
Simpler, CPU-focused
ChatGPT-like web UI
Model training + serving
Official repository