PrivateGPT vs LocalGPT: Which One Actually Works?

Spent two weeks getting document chat working locally. PrivateGPT won on my CPU-only laptop, but LocalGPT was better with GPU. Here's the breakdown and real issues I hit.

PrivateGPT vs LocalGPT: My Experience

• PrivateGPT: Used llama.cpp, worked great on my MacBook. ChromaDB for storage, simpler setup
• LocalGPT: More flexible but needs GPU. Used LangChain + HuggingFace, better if you have CUDA
• PrivateGPT wins when: CPU-only, don't want to fight with CUDA, just want it to work
• LocalGPT wins when: Have decent GPU, want to swap models, need more control
• Reality: PrivateGPT for most people. LocalGPT if you're serious about LLM ops

Production Installation

Install with CUDA support for GPU acceleration:

# Clone repository
git clone https://github.com/zylon-ai/private-gpt
cd private-gpt

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install with GPU support (CUDA 12.1)
pip install "uv[standard]"
uv pip install -e ".[llama-cpp,_gpu]"

# For CPU-only installation:
# uv pip install -e ".[llama-cpu]"

# Download models (create models folder first)
mkdir models
cd models

# Download Llama 3 8B Instruct (4-bit quantized)
# Option 1: Using huggingface-cli
huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.2-GGUF mistral-7b-instruct-v0.2.Q4_K_M.gguf --local-dir .

# Option 2: Direct download
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf

Configuration for Production

# settings.yaml

server:
  port: 8001
  cors:
    enabled: true
    allowed_origins: ["*"]

llm:
  mode: llama-cpp
  max_new_tokens: 2048
  context_window: 8192
  tokenizer: mistral-7b-instruct-v0.2.Q4_K_M.gguf

  llama_cpp:
    model_path: models/mistral-7b-instruct-v0.2.Q4_K_M.gguf
    n_ctx: 8192
    n_batch: 512
    n_gpu_layers: 35  # Set to -1 for all layers on GPU, 0 for CPU
    verbose: true

embedding:
  mode: huggingface
  ingest_mode: simple
  huggingface:
    model_name: BAAI/bge-small-en-v1.5

vectorstore:
  database: chroma

chunks:
  size: 512
  overlap: 50

# For better retrieval on code/technical docs
retrieval:
  mode: chroma
  k: 5  # Number of documents to retrieve
  score_threshold: 0.3  # Lower = more matches

Common Problems & Solutions

Issue #1172: "CUDA out of memory" on 24GB GPU

github.com/zylon-ai/private-gpt/issues/1172

Problem: Running Mistral 7B with 35 GPU layers causes OOM on RTX 4090 (24GB VRAM).

What I Tried: Reduced context window, disabled quantization - still crashed on large documents.

Actual Fix: The issue is n_gpu_layers combined with n_batch. Offload fewer layers and increase batch size for better memory utilization:

# In settings.yaml:
llama:
  llama_cpp:
    model_path: models/mistral-7b-instruct-v0.2.Q4_K_M.gguf
    n_gpu_layers: 20  # Not 35! Let CPU handle some layers
    n_batch: 1024  # Increase from 512 for better GPU utilization
    n_ctx: 4096  # Reduce from 8192
    f16_kv: true  # Use FP16 for KV cache (saves VRAM)
    use_mmap: true  # Memory-map model file
    use_mlock: false  # Disable on GPU systems
    numa: false  # Disable NUMA for single-GPU systems

# Alternative: Use smaller model
llama:
  llama_cpp:
    model_path: models/mistral-7b-instruct-v0.2.Q3_K_M.gguf  # More aggressive quantization
    n_gpu_layers: 30  # Can fit more layers with Q3

Monitor VRAM usage with watch -n 1 nvidia-smi while ingesting documents to find optimal settings.

Issue #1234: "Document ingestion hangs at 99%"

github.com/zylon-ai/private-gpt/issues/1234

Problem: Large PDFs (>100 pages) appear to hang during ingestion. Process reaches 99% and never completes.

What I Tried: Increased timeout, switched to different PDF parsers - no change.

Actual Fix: The embedding model gets stuck on long chunks. Need to chunk more aggressively and limit chunk length:

# In settings.yaml:
chunks:
  size: 256  # Reduce from 512 (smaller chunks process faster)
  overlap: 50

# Add custom chunking for large docs in code:
# Create ingest_custom.py

from private_gpt.components.ingest import chunking
from langchain.text_splitter import RecursiveCharacterTextSplitter

def create_optimized_chunker():
    return RecursiveCharacterTextSplitter(
        chunk_size=256,
        chunk_overlap=50,
        length_function=len,
        separators=[
            "\n\n",  # Paragraphs first
            "\n",    # Lines
            " ",     # Words
            ""       # Characters (last resort)
        ],
        keep_separator=False  # Don't include separator in chunk
    )

# Use batch ingestion to prevent memory buildup
PGPT_PROFILES=chroma pgpt ingest documents/*.pdf --batch-size 10

For PDFs with tables/images, pre-process with pdfplumber to extract text separately.

Issue #1345: Poor retrieval on technical documentation

github.com/zylon-ai/private-gpt/issues/1345

Problem: Queries about code examples return irrelevant text chunks, missing the actual code blocks.

What I Tried: Adjusted score_threshold, increased k value - retrieved more but still irrelevant.

Actual Fix: Default embedding model (all-MiniLM) doesn't understand code well. Switch to code-aware embedding and use hybrid search:

# Use code-specific embedding
embedding:
  mode: huggingface
  huggingface:
    # Better for code/technical content
    model_name: BAAI/bge-base-en-v1.5
    # Or for mixed code+text:
    # model_name: intfloat/e5-large-v2

# Enable hybrid search (keyword + semantic)
vectorstore:
  database: chroma
  chroma:
    collection_name: my_documents
    # Enable BM25 hybrid search
    hybrid_search: true
    bm25_weight: 0.3  # 30% keyword, 70% semantic

# Adjust retrieval
retrieval:
  mode: chroma
  k: 10  # Retrieve more candidates
  score_threshold: 0.2  # Lower threshold
  rerank: true  # Enable reranking
  rerank_model: cross-encoder  # Better final ranking

For API documentation, pre-process to keep endpoints and their descriptions together.

Issue #1456: Embedding model incompatible with llama-cpp

github.com/zylon-ai/private-gpt/issues/1456

Problem: After switching to local embedding model with llama-cpp LLM, getting dimension mismatch errors.

What I Tried: Recreated vector database, cleared cache - error persists.

Actual Fix: Llama.cpp has built-in embedding but different dimensions than HuggingFace models. Need to use consistent embedding source:

# Option 1: Use HuggingFace for both (recommended)
embedding:
  mode: huggingface
  huggingface:
    model_name: BAAI/bge-small-en-v1.5
    # Dimension: 384

llm:
  mode: llama-cpp
  llama_cpp:
    # Use LLM only for generation, not embedding
    embedding_mode: false
    n_gpu_layers: -1

# Option 2: Use llama-cpp for both
embedding:
  mode: llama-cpp
  llama_cpp:
    model_path: models/mistral-7b-instruct-v0.2.Q4_K_M.gguf
    # Dimension: 4096 (Mistral)

llm:
  mode: llama-cpp

# CRITICAL: Must delete old ChromaDB when changing embedding model
# Different dimensions = incompatible vectors
rm -rf chroma_db  # Delete and re-ingest

Issue #1523: Context not being used in responses

github.com/zylon-ai/private-gpt/issues/1523

Problem: System retrieves relevant documents but ignores them completely, answering from training data.

What I Tried: Lowered score_threshold, verified documents are retrieved - still ignored.

Actual Fix: The prompt template needs to explicitly reference retrieved context. Default template may not inject context properly:

# Create custom prompt template
# In prompts/custom_system_prompt.txt

You are a helpful assistant that answers questions based on the provided context.
If the answer cannot be found in the context, say "I don't have enough information to answer this."

Context information is below.
---------------------
{context}
---------------------

Given the context information and not prior knowledge, answer the query.
Query: {query}

Answer:

# In settings.yaml, point to custom prompt:
prompt:
  mode: custom
  template_file: prompts/custom_system_prompt.txt

# Or use built-in with context injection
prompt:
  mode: default
  # Ensure context is actually passed
  include_context: true
  max_context_tokens: 2048  # Increase from default 1024

Use the API response's sources field to verify documents were actually retrieved.

Performance Optimization

GPU Optimization

# Check GPU utilization
nvidia-smi

# For RTX 30/40 series, use CUDA graphs for faster inference
export CUDA_VISIBLE_DEVICES=0
export CUDA_GRAPH=1

# Quantization comparison for Mistral 7B on RTX 4090:
# Q4_K_M: ~4.5GB VRAM, 45 tok/s
# Q5_K_M: ~5.5GB VRAM, 38 tok/s
# Q8_0: ~8.5GB VRAM, 28 tok/s

# Optimal for most use cases:
# Use Q4_K_M for best speed/quality tradeoff

CPU Optimization

# Enable all CPU optimizations
export OMP_NUM_THREADS=8  # Set to your CPU core count
export MKL_NUM_THREADS=8

# For Apple Silicon (M1/M2/M3):
# Use Metal backend (via llama.cpp)
export LLAMA_METAL=1

# Install with Metal support
brew install llama.cpp

# Settings for Metal:
llama:
  llama_cpp:
    n_gpu_layers: -1  # All layers on GPU
    use_mmap: true
    use_mlock: false

Retrieval Optimization

# Multi-stage retrieval for better results
retrieval:
  mode: advanced
  # Stage 1: Broad retrieval
  initial_k: 20
  initial_threshold: 0.1
  # Stage 2: Rerank
  rerank: true
  rerank_model: cross-encoder/ms-marco-MiniLM-L-6-v2
  # Stage 3: Final selection
  final_k: 5
  diversity_penalty: 0.1  # Encourage diverse results

Deployment as API Service

# Run as background service
nohup python -m private_gpt > privategpt.log 2>&1 &

# Or use systemd for production
sudo tee /etc/systemd/system/private-gpt.service > /dev/null <



        API Usage Examples

        
            import requests
import json

BASE_URL = "http://localhost:8001"

# 1. Ingest documents
with open("document.pdf", "rb") as f:
    response = requests.post(
        f"{BASE_URL}/v1/ingest",
        files={"file": f}
    )
print(response.json())

# 2. Chat with context
response = requests.post(
    f"{BASE_URL}/v1/chat",
    json={
        "query": "What does the document say about X?",
        "context_filter": {},  # Optional metadata filter
        "mode": "chat"  # or "query" for retrieval-only
    }
)
result = response.json()
print(result["answer"])
print("Sources:", [s["source"] for s in result["sources"]])

# 3. Retrieve documents without LLM generation
response = requests.post(
    f"{BASE_URL}/v1/retrieve",
    json={
        "query": "machine learning",
        "k": 5
    }
)
documents = response.json()
        

        Comparison with Alternatives

        
            
                LocalGPT vs PrivateGPT
                When to use each based on your hardware
            
            
                Dify for Production
                Full-featured platform vs simple solution
            
            
                Open-WebUI Integration
                Add web UI to PrivateGPT backend
            
            
                PrivateGPT GitHub
                Official repository and issues