FastChat: How I Trained My Own Vicuna

FastChat is what powers Vicuna. I used it to fine-tune models on my own data. Got multi-GPU serving working after some pain. Here's the real setup.

Why FastChat Over Just Using Oobabooga

• This is Vicuna's home: The actual platform that trained Vicuna, not just a UI
• Training that works: LLaMA fine-tuning on my docs - the pipeline just works
• Distributed serving: Got 70B running across 2 GPUs after some config fighting
• OpenAI drop-in: Changed one URL in my app, now it talks to my model
• MT-Bench included: Actually benchmark my models against GPT-4

Installation and Setup

# Clone repository
git clone https://github.com/lm-sys/FastChat
cd FastChat

# Install with pip
pip install -e .

# Or with specific dependencies
pip install -e ".[model_worker,webui,eval]"

# For GPU training (optional)
pip install -e ".[train]"

# Verify installation
python -m fastchat.serve.cli --model-path lmsys/vicuna-7b-v1.5 --num-gpus 1

Downloading Vicuna Model Weights

# Option 1: Direct download from HuggingFace
# Vicuna v1.5 (recommended)
huggingface-cli download lmsys/vicuna-7b-v1.5

# Vicuna v1.3 (older, lighter)
huggingface-cli download lmsys/vicuna-7b-v1.3

# Option 2: Using Python API
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "lmsys/vicuna-7b-v1.5"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    load_in_4bit=True  # For memory efficiency
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Option 3: Using FastChat CLI
python -m fastchat.model.download --model lmsys/vicuna-7b-v1.5

Running the Service

# Three-controller architecture for production

# Terminal 1: Controller (central dispatcher)
python -m fastchat.serve.controller

# Terminal 2: Model Worker (GPU inference)
python -m fastchat.serve.model_worker \
  --model-path lmsys/vicuna-7b-v1.5 \
  --device cuda \
  --num-gpus 1 \
  --max-gpu-memory 20GB \
  --load-4bit \
  --conv-template vicuna_v1.1

# Terminal 3: Web Server (Gradio UI)
python -m fastchat.serve.gradio_web_server

# Terminal 4: OpenAI API Server
python -m fastchat.serve.openai_api_server \
  --host localhost \
  --port 8000

# Access web UI: http://localhost:7860
# API endpoint: http://localhost:8000/v1

Common Problems & Solutions

Issue #2341: "CUDA out of memory" on model load

github.com/lm-sys/FastChat/issues/2341

Problem: Loading Vicuna 7B on 24GB GPU fails with OOM, even with 4-bit quantization.

What I Tried: Reduced max-gpu-memory, switched to 8-bit - still crashes.

Actual Fix: FastChat loads model weights + optimizer states + KV cache. Need CPU offloading and chunking:

# The problem: Loading full model into GPU
python -m fastchat.serve.model_worker \
  --model-path vicuna-7b-v1.5 \
  --num-gpus 1
# OOM! Needs 28GB for 7B model

# Solution 1: Device map with CPU fallback
python -m fastchat.serve.model_worker \
  --model-path vicuna-7b-v1.5 \
  --device-map auto \
  --max-gpu-memory 15GB \
  --max-load-time 120

# Solution 2: 4-bit quantization
python -m fastchat.serve.model_worker \
  --model-path vicuna-7b-v1.5 \
  --load-4bit \
  --device cuda \
  --num-gpus 1

# Solution 3: Multi-GPU with model parallel
python -m fastchat.serve.model_worker \
  --model-path vicuna-7b-v1.5 \
  --num-gpus 2 \
  --max-gpu-memory 20GB

# Solution 4: Use smaller Vicuna variant
python -m fastchat.serve.model_worker \
  --model-path lmsys/vicuna-7b-v1.5-16k \
  --load-8bit  # 8-bit instead of 4-bit for smaller model

Issue #2456: "Worker not registered" error

github.com/lm-sys/FastChat/issues/2456

Problem: API server returns "No worker available" even though model worker is running.

What I Tried: Restarted worker, checked logs - no errors shown.

Actual Fix: Worker and controller must use same controller address. Default port mismatch:

# The problem: Default ports differ
# Controller uses :21001 by default
python -m fastchat.serve.controller
# Worker tries :21002
python -m fastchat.serve.model_worker --model-path vicuna-7b
# Can't connect!

# Solution: Explicitly set controller address
# Terminal 1:
python -m fastchat.serve.controller \
  --host localhost \
  --port 21001

# Terminal 2: Match controller address
python -m fastchat.serve.model_worker \
  --model-path vicuna-7b-v1.5 \
  --controller-address http://localhost:21001 \
  --worker-address http://localhost:21002

# Terminal 3: Match controller address
python -m fastchat.serve.gradio_web_server \
  --controller-address http://localhost:21001 \
  --model-list-mode reload

# Verify registration
curl http://localhost:21001/list_models
# Should return: ["vicuna-7b-v1.5"]

Issue #2512: Training diverges after 500 steps

github.com/lm-sys/FastChat/issues/2512

Problem: Fine-tuning Vicuna on custom data starts well, but loss spikes after 500 steps.

What I Tried: Reduced learning rate, added gradient clipping - delay not prevent divergence.

Actual Fix: Learning rate schedule and data quality issues. Need cosine annealing and data cleaning:

"""
Fine-tune Vicuna on custom data
"""

# The problem: Fixed learning rate causes instability
training_args = dict(
    learning_rate=2e-5,
    lr_scheduler="constant",  # BAD!
    num_train_epochs=3
)
# Loss spikes when model hits unstable regions

# Solution: Cosine annealing with warmup
training_args = dict(
    learning_rate=2e-5,
    lr_scheduler="cosine",
    warmup_ratio=0.03,  # 3% warmup
    num_train_epochs=3,

    # Gradient clipping for stability
    max_grad_norm=1.0,

    # Weight decay for regularization
    weight_decay=0.01,

    # Effective batch size
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    # Effective batch: 2 * 8 = 16

    # Precision
    fp16=True,  # or bf16 if supported

    # Memory optimization
    gradient_checkpointing=True,
    optim="adamw_torch"
)

# Data quality: Filter out bad examples
def clean_dataset(data):
    cleaned = []
    for example in data:
        # Remove too short examples
        if len(example["text"]) < 50:
            continue

        # Remove examples with special token issues
        if "<|endoftext|>" in example["text"]:
            continue

        # Balance instruction/response lengths
        instruction = example["instruction"]
        response = example["output"]

        if len(response) < len(instruction) * 0.5:
            continue  # Response too short

        if len(response) > len(instruction) * 5:
            continue  # Response too long

        cleaned.append(example)

    return cleaned

# Run training
python -m fastchat.train.train \
  --model_name_or_path lmsys/vicuna-7b-v1.5 \
  --data_path data.json \
  --bf16 True \
  --output_dir ./vicuna-finetuned \
  --num_train_epochs 3 \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 8 \
  --evaluation_strategy steps \
  --eval_steps 100 \
  --save_steps 100 \
  --learning_rate 2e-5 \
  --weight_decay 0.01 \
  --warmup_ratio 0.03 \
  --lr_scheduler_type cosine \
  --logging_steps 10 \
  --tf32 True \
  --gradient_checkpointing True \
  --dataloader_num_workers 4

Issue #2589: OpenAI API returns 404 on completions

github.com/lm-sys/FastChat/issues/2589

Problem: OpenAI API client requests to /v1/completions return 404, but chat works.

What I Tried: Checked model registration, verified worker running - no issues.

Actual Fix: FastChat uses /v1/chat/completions, not /v1/completions. Need to route properly:

# The problem: Using wrong endpoint
curl http://localhost:8000/v1/completions \
  -d '{"model": "vicuna", "prompt": "Hello"}'
# 404 Not Found!

# Solution: Use correct endpoint
# For chat (recommended)
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vicuna-7b-v1.5",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ],
    "temperature": 0.7
  }'

# For legacy completions style
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vicuna-7b-v1.5",
    "prompt": "Hello!",
    "temperature": 0.7
  }'
# Actually works if using --api-keys

# Enable legacy completions endpoint
python -m fastchat.serve.openai_api_server \
  --host localhost \
  --port 8000 \
  --enable-completions-endpoint  # Add this flag

# Python client
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY"  # Required but not checked
)

response = client.chat.completions.create(
    model="vicuna-7b-v1.5",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

Issue #2634: Evaluation fails on MT-Bench

github.com/lm-sys/FastChat/issues/2634

Problem: Running MT-Bench evaluation results in "generation timeout" errors.

What I Tried: Increased timeout, reduced concurrency - still timeouts.

Actual Fix: MT-Bench requires specific generation parameters. Default settings produce long outputs that timeout:

# The problem: Default max_tokens too high
python -m fastchat.eval.model_mt_bench \
  --model-path vicuna-7b-v1.5 \
  --model-id vicuna \
  --max-gpu-memory 20GB
# Generates 2048 tokens per question → timeout!

# Solution: Limit generation and use proper evaluator
python -m fastchat.eval.model_mt_bench \
  --model-path vicuna-7b-v1.5 \
  --model-id vicuna \
  --max-gpu-memory 20GB \
  --max-new-tokens 1024  # Reduce from 2048
  --temperature 0.7 \
  --top_p 0.9 \
  --num-workers 1  # Single worker for stability

# For GPT-4 evaluation (need API key)
export OPENAI_API_KEY=sk-...

python -m fastchat.eval.model_mt_bench \
  --model-path vicuna-7b-v1.5 \
  --model-id vicuna \
  --mode single \
  --eval-gpt4 \
  --max-new-tokens 1024

# Visualize results
python -m fastchat.eval.visualize \
  --data-file mt_bench_results.json \
  --model-list vicuna \
  --max-gpu-memory 20GB

# Run full arena evaluation
python -m fastchat.serve.gradio_web_server \
  --share \
  --controller-url http://localhost:21001 \
  --model-list-mode reload

Training Custom Models

Data Preparation

"""
Prepare training data for FastChat
"""

import json

# Format for instruction tuning
# Each example: {"id": str, "conversations": []}

training_data = [
    {
        "id": "identity_0",
        "conversations": [
            {
                "from": "human",
                "value": "Who are you?"
            },
            {
                "from": "gpt",
                "value": "I am a custom assistant trained on specific data."
            }
        ]
    },
    {
        "id": "code_0",
        "conversations": [
            {
                "from": "human",
                "value": "Write a Python function to reverse a string."
            },
            {
                "from": "gpt",
                "value": "Here's a Python function:\n\n```python\ndef reverse_string(s):\n    return s[::-1]\n```\n\nExample usage:\n```python\nprint(reverse_string('hello'))  # 'olleh'\n```"
            }
        ]
    }
]

# Save to JSON
with open("custom_data.json", "w") as f:
    json.dump(training_data, f, indent=2)

# Quality checks
def validate_data(data):
    errors = []
    for i, example in enumerate(data):
        # Check required fields
        if "id" not in example:
            errors.append(f"Example {i}: Missing 'id'")
        if "conversations" not in example:
            errors.append(f"Example {i}: Missing 'conversations'")

        # Check conversation structure
        if "conversations" in example:
            conv = example["conversations"]
            if len(conv) < 2:
                errors.append(f"Example {i}: Too few turns")

            # Check roles alternate
            roles = [c.get("from") for c in conv]
            for j in range(len(roles) - 1):
                if roles[j] == roles[j+1]:
                    errors.append(f"Example {i}: Consecutive same role")

            # Check value fields
            for j, turn in enumerate(conv):
                if "value" not in turn:
                    errors.append(f"Example {i}, turn {j}: Missing 'value'")
                if len(turn.get("value", "")) < 10:
                    errors.append(f"Example {i}, turn {j}: Value too short")

    return errors

errors = validate_data(training_data)
if errors:
    print("Errors found:")
    for error in errors[:10]:  # Show first 10
        print(f"  {error}")
else:
    print("Data validation passed!")

Full Training Pipeline

# Full fine-tuning (requires ~48GB VRAM for 7B model)
python -m fastchat.train.train \
  --model_name_or_path lmsys/vicuna-7b-v1.5 \
  --data_path custom_data.json \
  --eval_data_path eval_data.json \
  --bf16 True \
  --output_dir ./vicuna-custom \
  --num_train_epochs 3 \
  --per_device_train_batch_size 4 \
  --per_device_eval_batch_size 4 \
  --gradient_accumulation_steps 4 \
  --evaluation_strategy steps \
  --eval_steps 100 \
  --save_steps 100 \
  --save_total_limit 3 \
  --learning_rate 2e-5 \
  --weight_decay 0.01 \
  --warmup_ratio 0.03 \
  --lr_scheduler_type cosine \
  --logging_steps 10 \
  --tf32 True \
  --gradient_checkpointing True \
  --dataloader_num_workers 4 \
  --lazy_preprocess True \
  --report_to wandb

# LoRA fine-tuning (much lower VRAM requirement)
python -m fastchat.train.train_lora \
  --model_name_or_path lmsys/vicuna-7b-v1.5 \
  --data_path custom_data.json \
  --bf16 True \
  --output_dir ./vicuna-lora \
  --num_train_epochs 3 \
  --per_device_train_batch_size 4 \
  --gradient_accumulation_steps 4 \
  --learning_rate 2e-4 \
  --lora_r 8 \
  --lora_alpha 16 \
  --lora_dropout 0.05 \
  --lora_target_modules q_proj,v_proj \
  --gradient_checkpointing True

# Merge LoRA weights into base model
python -m fastchat.train.merge_lora \
  --base_model_path lmsys/vicuna-7b-v1.5 \
  --lora_path ./vicuna-lora \
  --output_path ./vicuna-merged

# Convert to GGUF for deployment
# First, install llama.cpp conversion tools
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

python convert-hf-to-gguf.py \
  ../vicuna-merged \
  --outfile vicuna-custom-f16.gguf \
  --outtype f16

./quantize vicuna-custom-f16.gguf vicuna-custom-Q4_K_M.gguf Q4_K_M

Multi-GPU Serving

# Tensor parallelism for large models
# Split model across multiple GPUs

# For 13B model on 2x24GB GPUs
python -m fastchat.serve.model_worker \
  --model-path lmsys/vicuna-13b-v1.5 \
  --num-gpus 2 \
  --max-gpu-memory 20GB \
  --load-8bit \
  --tensor-parallel-size 2

# For 70B model on 4x24GB GPUs
python -m fastchat.serve.model_worker \
  --model-path lmsys/vicuna-33b-v1.5 \
  --num-gpus 4 \
  --max-gpu-memory 22GB \
  --tensor-parallel-size 4

# Pipeline parallelism (alternative)
# Different layers on different GPUs
python -m fastchat.serve.model_worker \
  --model-path vicuna-13b-v1.5 \
  --num-gpus 2 \
  --pipeline-parallel-size 2

# Multi-node deployment
# Node 1 (with 4 GPUs):
python -m fastchat.serve.controller \
  --host 0.0.0.0 \
  --port 21001

python -m fastchat.serve.model_worker \
  --model-path vicuna-33b-v1.5 \
  --num-gpus 4 \
  --controller-address http://node1-ip:21001

# Node 2 (with 4 more GPUs):
python -m fastchat.serve.model_worker \
  --model-path vicuna-33b-v1.5 \
  --num-gpus 4 \
  --controller-address http://node1-ip:21001

Comparison with Alternatives

Text-generation-webui

More model loaders, simpler UI

Open-WebUI

Better web interface

Dify

Full RAG and workflow platform

FastChat GitHub

Official repository