FastChat: How I Trained My Own Vicuna
FastChat is what powers Vicuna. I used it to fine-tune models on my own data. Got multi-GPU serving working after some pain. Here's the real setup.
Why FastChat Over Just Using Oobabooga
- • This is Vicuna's home: The actual platform that trained Vicuna, not just a UI
- • Training that works: LLaMA fine-tuning on my docs - the pipeline just works
- • Distributed serving: Got 70B running across 2 GPUs after some config fighting
- • OpenAI drop-in: Changed one URL in my app, now it talks to my model
- • MT-Bench included: Actually benchmark my models against GPT-4
Installation and Setup
# Clone repository
git clone https://github.com/lm-sys/FastChat
cd FastChat
# Install with pip
pip install -e .
# Or with specific dependencies
pip install -e ".[model_worker,webui,eval]"
# For GPU training (optional)
pip install -e ".[train]"
# Verify installation
python -m fastchat.serve.cli --model-path lmsys/vicuna-7b-v1.5 --num-gpus 1
Downloading Vicuna Model Weights
# Option 1: Direct download from HuggingFace
# Vicuna v1.5 (recommended)
huggingface-cli download lmsys/vicuna-7b-v1.5
# Vicuna v1.3 (older, lighter)
huggingface-cli download lmsys/vicuna-7b-v1.3
# Option 2: Using Python API
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "lmsys/vicuna-7b-v1.5"
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
load_in_4bit=True # For memory efficiency
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Option 3: Using FastChat CLI
python -m fastchat.model.download --model lmsys/vicuna-7b-v1.5
Running the Service
# Three-controller architecture for production
# Terminal 1: Controller (central dispatcher)
python -m fastchat.serve.controller
# Terminal 2: Model Worker (GPU inference)
python -m fastchat.serve.model_worker \
--model-path lmsys/vicuna-7b-v1.5 \
--device cuda \
--num-gpus 1 \
--max-gpu-memory 20GB \
--load-4bit \
--conv-template vicuna_v1.1
# Terminal 3: Web Server (Gradio UI)
python -m fastchat.serve.gradio_web_server
# Terminal 4: OpenAI API Server
python -m fastchat.serve.openai_api_server \
--host localhost \
--port 8000
# Access web UI: http://localhost:7860
# API endpoint: http://localhost:8000/v1
Common Problems & Solutions
Problem: Loading Vicuna 7B on 24GB GPU fails with OOM, even with 4-bit quantization.
What I Tried: Reduced max-gpu-memory, switched to 8-bit - still crashes.
Actual Fix: FastChat loads model weights + optimizer states + KV cache. Need CPU offloading and chunking:
# The problem: Loading full model into GPU
python -m fastchat.serve.model_worker \
--model-path vicuna-7b-v1.5 \
--num-gpus 1
# OOM! Needs 28GB for 7B model
# Solution 1: Device map with CPU fallback
python -m fastchat.serve.model_worker \
--model-path vicuna-7b-v1.5 \
--device-map auto \
--max-gpu-memory 15GB \
--max-load-time 120
# Solution 2: 4-bit quantization
python -m fastchat.serve.model_worker \
--model-path vicuna-7b-v1.5 \
--load-4bit \
--device cuda \
--num-gpus 1
# Solution 3: Multi-GPU with model parallel
python -m fastchat.serve.model_worker \
--model-path vicuna-7b-v1.5 \
--num-gpus 2 \
--max-gpu-memory 20GB
# Solution 4: Use smaller Vicuna variant
python -m fastchat.serve.model_worker \
--model-path lmsys/vicuna-7b-v1.5-16k \
--load-8bit # 8-bit instead of 4-bit for smaller model
Problem: API server returns "No worker available" even though model worker is running.
What I Tried: Restarted worker, checked logs - no errors shown.
Actual Fix: Worker and controller must use same controller address. Default port mismatch:
# The problem: Default ports differ
# Controller uses :21001 by default
python -m fastchat.serve.controller
# Worker tries :21002
python -m fastchat.serve.model_worker --model-path vicuna-7b
# Can't connect!
# Solution: Explicitly set controller address
# Terminal 1:
python -m fastchat.serve.controller \
--host localhost \
--port 21001
# Terminal 2: Match controller address
python -m fastchat.serve.model_worker \
--model-path vicuna-7b-v1.5 \
--controller-address http://localhost:21001 \
--worker-address http://localhost:21002
# Terminal 3: Match controller address
python -m fastchat.serve.gradio_web_server \
--controller-address http://localhost:21001 \
--model-list-mode reload
# Verify registration
curl http://localhost:21001/list_models
# Should return: ["vicuna-7b-v1.5"]
Problem: Fine-tuning Vicuna on custom data starts well, but loss spikes after 500 steps.
What I Tried: Reduced learning rate, added gradient clipping - delay not prevent divergence.
Actual Fix: Learning rate schedule and data quality issues. Need cosine annealing and data cleaning:
"""
Fine-tune Vicuna on custom data
"""
# The problem: Fixed learning rate causes instability
training_args = dict(
learning_rate=2e-5,
lr_scheduler="constant", # BAD!
num_train_epochs=3
)
# Loss spikes when model hits unstable regions
# Solution: Cosine annealing with warmup
training_args = dict(
learning_rate=2e-5,
lr_scheduler="cosine",
warmup_ratio=0.03, # 3% warmup
num_train_epochs=3,
# Gradient clipping for stability
max_grad_norm=1.0,
# Weight decay for regularization
weight_decay=0.01,
# Effective batch size
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
# Effective batch: 2 * 8 = 16
# Precision
fp16=True, # or bf16 if supported
# Memory optimization
gradient_checkpointing=True,
optim="adamw_torch"
)
# Data quality: Filter out bad examples
def clean_dataset(data):
cleaned = []
for example in data:
# Remove too short examples
if len(example["text"]) < 50:
continue
# Remove examples with special token issues
if "<|endoftext|>" in example["text"]:
continue
# Balance instruction/response lengths
instruction = example["instruction"]
response = example["output"]
if len(response) < len(instruction) * 0.5:
continue # Response too short
if len(response) > len(instruction) * 5:
continue # Response too long
cleaned.append(example)
return cleaned
# Run training
python -m fastchat.train.train \
--model_name_or_path lmsys/vicuna-7b-v1.5 \
--data_path data.json \
--bf16 True \
--output_dir ./vicuna-finetuned \
--num_train_epochs 3 \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 8 \
--evaluation_strategy steps \
--eval_steps 100 \
--save_steps 100 \
--learning_rate 2e-5 \
--weight_decay 0.01 \
--warmup_ratio 0.03 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--tf32 True \
--gradient_checkpointing True \
--dataloader_num_workers 4
Problem: OpenAI API client requests to /v1/completions return 404, but chat works.
What I Tried: Checked model registration, verified worker running - no issues.
Actual Fix: FastChat uses /v1/chat/completions, not /v1/completions. Need to route properly:
# The problem: Using wrong endpoint
curl http://localhost:8000/v1/completions \
-d '{"model": "vicuna", "prompt": "Hello"}'
# 404 Not Found!
# Solution: Use correct endpoint
# For chat (recommended)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "vicuna-7b-v1.5",
"messages": [
{"role": "user", "content": "Hello!"}
],
"temperature": 0.7
}'
# For legacy completions style
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "vicuna-7b-v1.5",
"prompt": "Hello!",
"temperature": 0.7
}'
# Actually works if using --api-keys
# Enable legacy completions endpoint
python -m fastchat.serve.openai_api_server \
--host localhost \
--port 8000 \
--enable-completions-endpoint # Add this flag
# Python client
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY" # Required but not checked
)
response = client.chat.completions.create(
model="vicuna-7b-v1.5",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
Problem: Running MT-Bench evaluation results in "generation timeout" errors.
What I Tried: Increased timeout, reduced concurrency - still timeouts.
Actual Fix: MT-Bench requires specific generation parameters. Default settings produce long outputs that timeout:
# The problem: Default max_tokens too high
python -m fastchat.eval.model_mt_bench \
--model-path vicuna-7b-v1.5 \
--model-id vicuna \
--max-gpu-memory 20GB
# Generates 2048 tokens per question → timeout!
# Solution: Limit generation and use proper evaluator
python -m fastchat.eval.model_mt_bench \
--model-path vicuna-7b-v1.5 \
--model-id vicuna \
--max-gpu-memory 20GB \
--max-new-tokens 1024 # Reduce from 2048
--temperature 0.7 \
--top_p 0.9 \
--num-workers 1 # Single worker for stability
# For GPT-4 evaluation (need API key)
export OPENAI_API_KEY=sk-...
python -m fastchat.eval.model_mt_bench \
--model-path vicuna-7b-v1.5 \
--model-id vicuna \
--mode single \
--eval-gpt4 \
--max-new-tokens 1024
# Visualize results
python -m fastchat.eval.visualize \
--data-file mt_bench_results.json \
--model-list vicuna \
--max-gpu-memory 20GB
# Run full arena evaluation
python -m fastchat.serve.gradio_web_server \
--share \
--controller-url http://localhost:21001 \
--model-list-mode reload
Training Custom Models
Data Preparation
"""
Prepare training data for FastChat
"""
import json
# Format for instruction tuning
# Each example: {"id": str, "conversations": []}
training_data = [
{
"id": "identity_0",
"conversations": [
{
"from": "human",
"value": "Who are you?"
},
{
"from": "gpt",
"value": "I am a custom assistant trained on specific data."
}
]
},
{
"id": "code_0",
"conversations": [
{
"from": "human",
"value": "Write a Python function to reverse a string."
},
{
"from": "gpt",
"value": "Here's a Python function:\n\n```python\ndef reverse_string(s):\n return s[::-1]\n```\n\nExample usage:\n```python\nprint(reverse_string('hello')) # 'olleh'\n```"
}
]
}
]
# Save to JSON
with open("custom_data.json", "w") as f:
json.dump(training_data, f, indent=2)
# Quality checks
def validate_data(data):
errors = []
for i, example in enumerate(data):
# Check required fields
if "id" not in example:
errors.append(f"Example {i}: Missing 'id'")
if "conversations" not in example:
errors.append(f"Example {i}: Missing 'conversations'")
# Check conversation structure
if "conversations" in example:
conv = example["conversations"]
if len(conv) < 2:
errors.append(f"Example {i}: Too few turns")
# Check roles alternate
roles = [c.get("from") for c in conv]
for j in range(len(roles) - 1):
if roles[j] == roles[j+1]:
errors.append(f"Example {i}: Consecutive same role")
# Check value fields
for j, turn in enumerate(conv):
if "value" not in turn:
errors.append(f"Example {i}, turn {j}: Missing 'value'")
if len(turn.get("value", "")) < 10:
errors.append(f"Example {i}, turn {j}: Value too short")
return errors
errors = validate_data(training_data)
if errors:
print("Errors found:")
for error in errors[:10]: # Show first 10
print(f" {error}")
else:
print("Data validation passed!")
Full Training Pipeline
# Full fine-tuning (requires ~48GB VRAM for 7B model)
python -m fastchat.train.train \
--model_name_or_path lmsys/vicuna-7b-v1.5 \
--data_path custom_data.json \
--eval_data_path eval_data.json \
--bf16 True \
--output_dir ./vicuna-custom \
--num_train_epochs 3 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 4 \
--evaluation_strategy steps \
--eval_steps 100 \
--save_steps 100 \
--save_total_limit 3 \
--learning_rate 2e-5 \
--weight_decay 0.01 \
--warmup_ratio 0.03 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--tf32 True \
--gradient_checkpointing True \
--dataloader_num_workers 4 \
--lazy_preprocess True \
--report_to wandb
# LoRA fine-tuning (much lower VRAM requirement)
python -m fastchat.train.train_lora \
--model_name_or_path lmsys/vicuna-7b-v1.5 \
--data_path custom_data.json \
--bf16 True \
--output_dir ./vicuna-lora \
--num_train_epochs 3 \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 4 \
--learning_rate 2e-4 \
--lora_r 8 \
--lora_alpha 16 \
--lora_dropout 0.05 \
--lora_target_modules q_proj,v_proj \
--gradient_checkpointing True
# Merge LoRA weights into base model
python -m fastchat.train.merge_lora \
--base_model_path lmsys/vicuna-7b-v1.5 \
--lora_path ./vicuna-lora \
--output_path ./vicuna-merged
# Convert to GGUF for deployment
# First, install llama.cpp conversion tools
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
python convert-hf-to-gguf.py \
../vicuna-merged \
--outfile vicuna-custom-f16.gguf \
--outtype f16
./quantize vicuna-custom-f16.gguf vicuna-custom-Q4_K_M.gguf Q4_K_M
Multi-GPU Serving
# Tensor parallelism for large models
# Split model across multiple GPUs
# For 13B model on 2x24GB GPUs
python -m fastchat.serve.model_worker \
--model-path lmsys/vicuna-13b-v1.5 \
--num-gpus 2 \
--max-gpu-memory 20GB \
--load-8bit \
--tensor-parallel-size 2
# For 70B model on 4x24GB GPUs
python -m fastchat.serve.model_worker \
--model-path lmsys/vicuna-33b-v1.5 \
--num-gpus 4 \
--max-gpu-memory 22GB \
--tensor-parallel-size 4
# Pipeline parallelism (alternative)
# Different layers on different GPUs
python -m fastchat.serve.model_worker \
--model-path vicuna-13b-v1.5 \
--num-gpus 2 \
--pipeline-parallel-size 2
# Multi-node deployment
# Node 1 (with 4 GPUs):
python -m fastchat.serve.controller \
--host 0.0.0.0 \
--port 21001
python -m fastchat.serve.model_worker \
--model-path vicuna-33b-v1.5 \
--num-gpus 4 \
--controller-address http://node1-ip:21001
# Node 2 (with 4 more GPUs):
python -m fastchat.serve.model_worker \
--model-path vicuna-33b-v1.5 \
--num-gpus 4 \
--controller-address http://node1-ip:21001
Comparison with Alternatives
More model loaders, simpler UI
Better web interface
Full RAG and workflow platform
Official repository