PrivateGPT vs LocalGPT: Which One Actually Works?
Spent two weeks getting document chat working locally. PrivateGPT won on my CPU-only laptop, but LocalGPT was better with GPU. Here's the breakdown and real issues I hit.
PrivateGPT vs LocalGPT: My Experience
- • PrivateGPT: Used llama.cpp, worked great on my MacBook. ChromaDB for storage, simpler setup
- • LocalGPT: More flexible but needs GPU. Used LangChain + HuggingFace, better if you have CUDA
- • PrivateGPT wins when: CPU-only, don't want to fight with CUDA, just want it to work
- • LocalGPT wins when: Have decent GPU, want to swap models, need more control
- • Reality: PrivateGPT for most people. LocalGPT if you're serious about LLM ops
Production Installation
Install with CUDA support for GPU acceleration:
# Clone repository
git clone https://github.com/zylon-ai/private-gpt
cd private-gpt
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install with GPU support (CUDA 12.1)
pip install "uv[standard]"
uv pip install -e ".[llama-cpp,_gpu]"
# For CPU-only installation:
# uv pip install -e ".[llama-cpu]"
# Download models (create models folder first)
mkdir models
cd models
# Download Llama 3 8B Instruct (4-bit quantized)
# Option 1: Using huggingface-cli
huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.2-GGUF mistral-7b-instruct-v0.2.Q4_K_M.gguf --local-dir .
# Option 2: Direct download
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf
Configuration for Production
# settings.yaml
server:
port: 8001
cors:
enabled: true
allowed_origins: ["*"]
llm:
mode: llama-cpp
max_new_tokens: 2048
context_window: 8192
tokenizer: mistral-7b-instruct-v0.2.Q4_K_M.gguf
llama_cpp:
model_path: models/mistral-7b-instruct-v0.2.Q4_K_M.gguf
n_ctx: 8192
n_batch: 512
n_gpu_layers: 35 # Set to -1 for all layers on GPU, 0 for CPU
verbose: true
embedding:
mode: huggingface
ingest_mode: simple
huggingface:
model_name: BAAI/bge-small-en-v1.5
vectorstore:
database: chroma
chunks:
size: 512
overlap: 50
# For better retrieval on code/technical docs
retrieval:
mode: chroma
k: 5 # Number of documents to retrieve
score_threshold: 0.3 # Lower = more matches
Common Problems & Solutions
Problem: Running Mistral 7B with 35 GPU layers causes OOM on RTX 4090 (24GB VRAM).
What I Tried: Reduced context window, disabled quantization - still crashed on large documents.
Actual Fix: The issue is n_gpu_layers combined with n_batch. Offload fewer layers and increase batch size for better memory utilization:
# In settings.yaml:
llama:
llama_cpp:
model_path: models/mistral-7b-instruct-v0.2.Q4_K_M.gguf
n_gpu_layers: 20 # Not 35! Let CPU handle some layers
n_batch: 1024 # Increase from 512 for better GPU utilization
n_ctx: 4096 # Reduce from 8192
f16_kv: true # Use FP16 for KV cache (saves VRAM)
use_mmap: true # Memory-map model file
use_mlock: false # Disable on GPU systems
numa: false # Disable NUMA for single-GPU systems
# Alternative: Use smaller model
llama:
llama_cpp:
model_path: models/mistral-7b-instruct-v0.2.Q3_K_M.gguf # More aggressive quantization
n_gpu_layers: 30 # Can fit more layers with Q3
Monitor VRAM usage with watch -n 1 nvidia-smi while ingesting documents to find optimal settings.
Problem: Large PDFs (>100 pages) appear to hang during ingestion. Process reaches 99% and never completes.
What I Tried: Increased timeout, switched to different PDF parsers - no change.
Actual Fix: The embedding model gets stuck on long chunks. Need to chunk more aggressively and limit chunk length:
# In settings.yaml:
chunks:
size: 256 # Reduce from 512 (smaller chunks process faster)
overlap: 50
# Add custom chunking for large docs in code:
# Create ingest_custom.py
from private_gpt.components.ingest import chunking
from langchain.text_splitter import RecursiveCharacterTextSplitter
def create_optimized_chunker():
return RecursiveCharacterTextSplitter(
chunk_size=256,
chunk_overlap=50,
length_function=len,
separators=[
"\n\n", # Paragraphs first
"\n", # Lines
" ", # Words
"" # Characters (last resort)
],
keep_separator=False # Don't include separator in chunk
)
# Use batch ingestion to prevent memory buildup
PGPT_PROFILES=chroma pgpt ingest documents/*.pdf --batch-size 10
For PDFs with tables/images, pre-process with pdfplumber to extract text separately.
Problem: Queries about code examples return irrelevant text chunks, missing the actual code blocks.
What I Tried: Adjusted score_threshold, increased k value - retrieved more but still irrelevant.
Actual Fix: Default embedding model (all-MiniLM) doesn't understand code well. Switch to code-aware embedding and use hybrid search:
# Use code-specific embedding
embedding:
mode: huggingface
huggingface:
# Better for code/technical content
model_name: BAAI/bge-base-en-v1.5
# Or for mixed code+text:
# model_name: intfloat/e5-large-v2
# Enable hybrid search (keyword + semantic)
vectorstore:
database: chroma
chroma:
collection_name: my_documents
# Enable BM25 hybrid search
hybrid_search: true
bm25_weight: 0.3 # 30% keyword, 70% semantic
# Adjust retrieval
retrieval:
mode: chroma
k: 10 # Retrieve more candidates
score_threshold: 0.2 # Lower threshold
rerank: true # Enable reranking
rerank_model: cross-encoder # Better final ranking
For API documentation, pre-process to keep endpoints and their descriptions together.
Problem: After switching to local embedding model with llama-cpp LLM, getting dimension mismatch errors.
What I Tried: Recreated vector database, cleared cache - error persists.
Actual Fix: Llama.cpp has built-in embedding but different dimensions than HuggingFace models. Need to use consistent embedding source:
# Option 1: Use HuggingFace for both (recommended)
embedding:
mode: huggingface
huggingface:
model_name: BAAI/bge-small-en-v1.5
# Dimension: 384
llm:
mode: llama-cpp
llama_cpp:
# Use LLM only for generation, not embedding
embedding_mode: false
n_gpu_layers: -1
# Option 2: Use llama-cpp for both
embedding:
mode: llama-cpp
llama_cpp:
model_path: models/mistral-7b-instruct-v0.2.Q4_K_M.gguf
# Dimension: 4096 (Mistral)
llm:
mode: llama-cpp
# CRITICAL: Must delete old ChromaDB when changing embedding model
# Different dimensions = incompatible vectors
rm -rf chroma_db # Delete and re-ingest
Problem: System retrieves relevant documents but ignores them completely, answering from training data.
What I Tried: Lowered score_threshold, verified documents are retrieved - still ignored.
Actual Fix: The prompt template needs to explicitly reference retrieved context. Default template may not inject context properly:
# Create custom prompt template
# In prompts/custom_system_prompt.txt
You are a helpful assistant that answers questions based on the provided context.
If the answer cannot be found in the context, say "I don't have enough information to answer this."
Context information is below.
---------------------
{context}
---------------------
Given the context information and not prior knowledge, answer the query.
Query: {query}
Answer:
# In settings.yaml, point to custom prompt:
prompt:
mode: custom
template_file: prompts/custom_system_prompt.txt
# Or use built-in with context injection
prompt:
mode: default
# Ensure context is actually passed
include_context: true
max_context_tokens: 2048 # Increase from default 1024
Use the API response's sources field to verify documents were actually retrieved.
Performance Optimization
GPU Optimization
# Check GPU utilization
nvidia-smi
# For RTX 30/40 series, use CUDA graphs for faster inference
export CUDA_VISIBLE_DEVICES=0
export CUDA_GRAPH=1
# Quantization comparison for Mistral 7B on RTX 4090:
# Q4_K_M: ~4.5GB VRAM, 45 tok/s
# Q5_K_M: ~5.5GB VRAM, 38 tok/s
# Q8_0: ~8.5GB VRAM, 28 tok/s
# Optimal for most use cases:
# Use Q4_K_M for best speed/quality tradeoff
CPU Optimization
# Enable all CPU optimizations
export OMP_NUM_THREADS=8 # Set to your CPU core count
export MKL_NUM_THREADS=8
# For Apple Silicon (M1/M2/M3):
# Use Metal backend (via llama.cpp)
export LLAMA_METAL=1
# Install with Metal support
brew install llama.cpp
# Settings for Metal:
llama:
llama_cpp:
n_gpu_layers: -1 # All layers on GPU
use_mmap: true
use_mlock: false
Retrieval Optimization
# Multi-stage retrieval for better results
retrieval:
mode: advanced
# Stage 1: Broad retrieval
initial_k: 20
initial_threshold: 0.1
# Stage 2: Rerank
rerank: true
rerank_model: cross-encoder/ms-marco-MiniLM-L-6-v2
# Stage 3: Final selection
final_k: 5
diversity_penalty: 0.1 # Encourage diverse results
Deployment as API Service
# Run as background service
nohup python -m private_gpt > privategpt.log 2>&1 &
# Or use systemd for production
sudo tee /etc/systemd/system/private-gpt.service > /dev/null <
API Usage Examples
import requests
import json
BASE_URL = "http://localhost:8001"
# 1. Ingest documents
with open("document.pdf", "rb") as f:
response = requests.post(
f"{BASE_URL}/v1/ingest",
files={"file": f}
)
print(response.json())
# 2. Chat with context
response = requests.post(
f"{BASE_URL}/v1/chat",
json={
"query": "What does the document say about X?",
"context_filter": {}, # Optional metadata filter
"mode": "chat" # or "query" for retrieval-only
}
)
result = response.json()
print(result["answer"])
print("Sources:", [s["source"] for s in result["sources"]])
# 3. Retrieve documents without LLM generation
response = requests.post(
f"{BASE_URL}/v1/retrieve",
json={
"query": "machine learning",
"k": 5
}
)
documents = response.json()
Comparison with Alternatives
When to use each based on your hardware
Full-featured platform vs simple solution
Add web UI to PrivateGPT backend
Official repository and issues